MLE is a machine learning engineer that reproduces a paper, trains a model, or fine-tunes a model for you with a single prompt like: "I want to train ConvNeXt on the CIFAR-10 dataset".
This codebase is a template codebase for a general machine learning workflow to apply a model onto a dataset:
preprocess
For example, suppose we want to fine-tune MedGemma 1.5 on the FLARE-MLLM-2D dataset:
Fine-tune MedGemma 1.5 (https://huggingface.co/google/medgemma-1.5-4b-it) on the FLARE-MLLM-2D dataset (https://huggingface.co/datasets/FLARE-MedFM/FLARE-MLLM-2D). For evaluation, please report:
- Balanced accuracy for the disease diagnostic classification
- Mean Absolute Error (MAE) for cell counting
- F1 score matching via IoU > 0.5 for detection
- F1 score for multi-label classification
- Mean absolute error (MAE) for regression
- GREEN score for report generation
For the GREEN score computation, use the implementation in https://github.com/ATATC/GREEN.
This repository is the example here.
The commands are the same as on Erbium, but you need to use these flags to specify the paths:
--root_dir path/to/project/root
--input_dir path/to/input/directory
--output_dir path/to/output/directoryYour dataset should be available at "{INPUT_DIR}/{DATASET_NAME}".
If you are working inside a fork of MLE, you can install it directly from GitHub.
pip install git+https://github.com/your-username/your-forked-repoIf you cloned MLE and are working locally, upload the source files to "/workspace/app" and install it from there.
cd /workspace/app
pip install -e .python -m mle preprocesspython -m mle train --num_epochs=1000 --batch_size=2 --learning_rate=0.0004python -m mle evaluate segmentationCreate a virtual environment and install some critical dependencies first.
module load python/3.12
module load arrow
module load cuda
virtualenv /scratch/${USER}/venv
source /scratch/${USER}/venv/bin/activate
pip install --no-index --upgrade pip
pip install --no-index simpleitk # critical dependencies whose wheels for simpleitk are too slow to buildNote that unlike Erbium where we reinforce the file structure, you probably need to create the input and output directories yourself on SLURM clusters.
mkdir /scratch/${USER}/input
mkdir /scratch/${USER}/outputYour dataset should be available at "/scratch/{USER}/{DATASET_NAME}".
If you are working inside a fork of MLE, you can install it directly from GitHub.
pip install git+https://github.com/your-username/your-forked-repoIf you cloned MLE and are working locally, upload the source files to "/scratch/${USER}/app" and install it from there.
cd /scratch/${USER}/app
pip install -e .You can use dra-config skills to generate the job script or the following template.
#!/bin/bash
#SBATCH --job-name=
#SBATCH --account=
#SBATCH --time=
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=
#SBATCH --mem=
#SBATCH --gpus-per-node=h100:1
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err
# virtual environment
set -euo pipefail
module --force purge
module load StdEnv/2023 || true
module load python/3.12 || true
module load arrow || true
module load cuda || true
# authentication
...
python -m mle -c slurm -suser ${USER} ...You can have a JSON or YAML file with the arguments you want to pass to the engine.
Suppose you have "path/to/custom-args.yaml", simply add a flag to the command like:
python -m mle ... --custom_args path/to/custom-args.yaml