Code and data-processing workflow for training FoodProX random-forest models and generating FPro, a continuous food-processing score derived from NOVA-class probability vectors.
FoodProX is a machine-learning framework that uses nutrient composition profiles to infer the degree of food processing. Rather than relying only on discrete NOVA labels, FoodProX generates a four-class probability vector for each food item and projects that vector onto a continuous processing axis, producing the FPro score.
This repository provides a reproducible workflow to:
- load USDA manually labeled food-composition data;
- train random-forest classifiers using nutrient profiles;
- evaluate class-specific ROC-AUC and average precision across cross-validation folds;
- generate FPro scores for foods using fold-averaged class probabilities;
- compare model variants using 58-nutrient and 57-nutrient feature sets.
.
├── README.md
├── Score_Generation.ipynb
├── functions_for_evaluation.py
├── scoring.py
├── environment.yml
├── FNDDS_SR_combined_58_nutrients.csv
├── Metrics_GitHub/
├── Outputs_GitHub/
└── .gitignore
The notebook also creates a local Models_GitHub/ directory when it is executed. This directory contains trained model binaries and is not tracked in the repository because the files are large and can be regenerated from the notebook.
-
Score_Generation.ipynb
Main notebook for data loading, model training, cross-validation, scoring, and visualization. -
functions_for_evaluation.py
Helper functions for multiclass ROC-AUC (AUC), average precision/AUPRC (AUP), ROC/PR curves, and fold-wise model training/evaluation. -
scoring.py
Containsclassify_db, which applies trained fold-specific models to a food database and computes averaged class probabilities and FPro scores. -
environment.yml
Conda environment specification. -
FNDDS_SR_combined_58_nutrients.csv
Input food-composition dataset used by the notebook. -
Metrics_GitHub/
Cross-validation metrics and curve objects generated by the workflow. -
Outputs_GitHub/
Generated FPro output files.
The input NOVA labels are stored in the column:
novaclass
For model training, the notebook defines:
pythonlabel = novaclass - 1Therefore, the model class order is:
pythonlabel |
NOVA class |
|---|---|
| 0 | NOVA 1 |
| 1 | NOVA 2 |
| 2 | NOVA 3 |
| 3 | NOVA 4 |
Rows with novaclass = 0 become pythonlabel = -1 and are excluded from model training.
For each food item
This formulation maps the minimally processed vertex to
The notebook trains four random-forest model variants:
| Model | Feature set | Training-set definition |
|---|---|---|
| Model 1 | 58 nutrients | Full food profiles (Unique food code, nova class, nutrient profile) |
| Model 2 | 58 nutrients | Unique NOVA–nutrient profile pairs |
| Model 3 | 57 NDSR-compatible nutrients | Full food profiles (Unique food code, nova class, nutrient profile) |
| Model 4 | 57 NDSR-compatible nutrients | Unique NOVA–nutrient profile pairs |
The 57-nutrient models are intended for studies that rely on NDSR-compatible nutrient profiles.
All models use the same fixed random-forest configuration:
params_defined = {
"n_estimators": 500,
"max_features": "sqrt",
"max_depth": 20
}This choice reflects the objective of the model. The classifier is not used primarily as a discrete NOVA-label predictor; rather, it is used to generate class-probability vectors from which FPro is computed as a continuous projection score. Therefore, the goal is to obtain stable probability-derived scores, not to maximize discrete classification performance through extensive hyperparameter optimization.
We deliberately avoided exhaustive hyperparameter tuning for two reasons. First, the labeled reference set is relatively large compared with the number of nutrient features, and the random-forest model already achieves strong cross-validated discrimination across NOVA classes. Second, hyperparameter tuning would require withholding additional data or introducing a nested model-selection layer, whereas our priority was to expose the model to as much labeled information as possible to improve the stability and resolution of the FPro probability surface.
This decision is also supported by previous FoodProX analyses and by sensitivity checks in the present implementation, which showed that random-forest performance and FPro behavior were stable across a wide range of hyperparameter choices. In particular, allowing deeper trees did not materially change cross-validated AUC or average precision across the evaluated model variants. We therefore retained a single fixed configuration for all models to ensure comparability across nutrient panels and training-set definitions.
Create the conda environment:
conda env create -f environment.yml
conda activate food_pro_py311Then launch Jupyter:
jupyter notebookor:
jupyter labFrom the repository root, open:
Score_Generation.ipynb
and run all cells.
The notebook creates the following output directories if they do not already exist:
Models_GitHub/
Metrics_GitHub/
Outputs_GitHub/
Metrics_GitHub/ and Outputs_GitHub/ are included in this repository to provide generated metrics and FPro outputs. Models_GitHub/ is generated locally when the notebook is run, but it is excluded from version control because trained model binaries are large and can be regenerated.
The main scoring function is:
from scoring import classify_dbExample:
db_scored = classify_db(
db=input_dataframe,
model_per_fold=models,
nut_sel=nutrient_columns
)The function returns the input dataframe with additional columns including fold-level probabilities, averaged probabilities, FPro, std_FPro, min_FPro, max_FPro, and final class calls.
To use classify_db directly, users must provide trained fold-specific models. These can be generated by running Score_Generation.ipynb. If pre-trained models are released separately, they should be placed in a local Models_GitHub/ directory before scoring.
For each model variant, the workflow saves:
- trained fold-specific random-forest models in the local, untracked
Models_GitHub/directory; - cross-validation AUC/AUP metrics;
- ROC and precision-recall curve objects;
- train/test split indices;
- scored food database files with FPro values.
Large trained model files are not tracked in the repository. This keeps the repository lightweight and avoids storing generated binary artifacts in Git history. The models can be regenerated by running the notebook from the repository root.
The recommended .gitignore includes:
Models_GitHub/
__pycache__/
.ipynb_checkpoints/
.DS_Store
*.pyc- The code assumes that input nutrient values are aligned with the nutrient names used during training.
classify_dbchecks that all fold-specific models have the same class order.- FPro is computed from class probabilities, not from hard class labels.
- The probability columns
p1,p2,p3, andp4correspond to NOVA classes 1, 2, 3, and 4. Metrics_GitHub/andOutputs_GitHub/contain generated files from the workflow; they can be regenerated by rerunning the notebook.
-
Menichetti G. et al. Machine learning prediction of the degree of food processing. Nature Communications, 2023.
https://www.nature.com/articles/s41467-023-37457-1 -
Ispirova G., Sebek M., Menichetti G. Informatics for Food Processing. arXiv, 2025.
https://arxiv.org/abs/2505.17087