This documentation gives an overview of the typical steps that have to be done in order to aggregate the annotations of our pipeline for further classifier training.
- Log in to MN5 and load the required environment:
ml impi intel load mkl hdf5 python/3.11.5-gcc
source /home/frau/frau435699/ehpc17/richard/repositories/marenostrum5-tools/mn5_ml_filter_deployment/venvs/ml_filter_build_974c508ce77f9e9d92996a2e0ff80190b37f9654/bin/activate
# load your environment insteadThe environment is built and deployed using marenostrum5-tools.
- The results of our experiments should be collected in the following folder:
/gpfs/projects/ehpc17/results/prompt_based_annotations- We have multiple annotations per document to account for a certain randomness in our decoding strategy. These scores have to be aggregated to a single score. This can be done with ml_filter. To aggregate the scores in all jsonl-files in a certain directory (e.g. for all of the 37 languages), you can run the following commands:
start_dir="/gpfs/projects/ehpc17/results/prompt_based_annotations/educational_content/Llama-3.3-70B-Instruct"
target_dir="${start_dir}_aggregated"
find $start_dir -type f -name "*.jsonl" | while read -r file; do
parent_dir=$(dirname "$file")
python3.11 ml_filter aggregate_scores $parent_dir $target_dir --aggregation majority --labels 0,1,2,3,4,5 --raw_data_lookup_dir /gpfs/scratch/ehpc17/dqa/data/fineweb_2_500k_both_deduplicated
done- The results of step 3 can be transferred to a machine with internet access and from the uploaded to huggingface.