FoodProX-FPro-Score-Generation

Code and data-processing workflow for training FoodProX random-forest models and generating FPro, a continuous food-processing score derived from NOVA-class probability vectors.

Overview

FoodProX is a machine-learning framework that uses nutrient composition profiles to infer the degree of food processing. Rather than relying only on discrete NOVA labels, FoodProX generates a four-class probability vector for each food item and projects that vector onto a continuous processing axis, producing the FPro score.

This repository provides a reproducible workflow to:

load USDA manually labeled food-composition data;
train random-forest classifiers using nutrient profiles;
evaluate class-specific ROC-AUC and average precision across cross-validation folds;
generate FPro scores for foods using fold-averaged class probabilities;
compare model variants using 58-nutrient and 57-nutrient feature sets.

Repository structure

.
├── README.md
├── Score_Generation.ipynb
├── functions_for_evaluation.py
├── scoring.py
├── environment.yml
├── FNDDS_SR_combined_58_nutrients.csv
├── Metrics_GitHub/
├── Outputs_GitHub/
└── .gitignore

The notebook also creates a local Models_GitHub/ directory when it is executed. This directory contains trained model binaries and is not tracked in the repository because the files are large and can be regenerated from the notebook.

Main files

Score_Generation.ipynb
Main notebook for data loading, model training, cross-validation, scoring, and visualization.
functions_for_evaluation.py
Helper functions for multiclass ROC-AUC (AUC), average precision/AUPRC (AUP), ROC/PR curves, and fold-wise model training/evaluation.
scoring.py
Contains classify_db, which applies trained fold-specific models to a food database and computes averaged class probabilities and FPro scores.
environment.yml
Conda environment specification.
FNDDS_SR_combined_58_nutrients.csv
Input food-composition dataset used by the notebook.
Metrics_GitHub/
Cross-validation metrics and curve objects generated by the workflow.
Outputs_GitHub/
Generated FPro output files.

Data and label mapping

The input NOVA labels are stored in the column:

novaclass

For model training, the notebook defines:

pythonlabel = novaclass - 1

Therefore, the model class order is:

`pythonlabel`	NOVA class
0	NOVA 1
1	NOVA 2
2	NOVA 3
3	NOVA 4

Rows with novaclass = 0 become pythonlabel = -1 and are excluded from model training.

Definition of FPro

For each food item $k$, the trained classifiers output a probability vector $p^k = (p_1^k, p_2^k, p_3^k, p_4^k)$, where $p_i^k$ is the predicted probability that item $k$ belongs to NOVA class $i$. Formally, $FPro$ is defined as the orthogonal projection of the food’s class-probability vector $p^k$ onto the line within the probability simplex that extends from the minimally processed vertex $(1,0,0,0)$ to the ultra-processed vertex $(0,0,0,1)$. The score for item $k$ is therefore given by

$\mathrm{FPro}_k = \frac{1 - p_1^k + p_4^k}{2}.$

This formulation maps the minimally processed vertex to $\mathrm{FPro}=0$ and the ultra-processed vertex to $\mathrm{FPro}=1$.

Model variants

The notebook trains four random-forest model variants:

Model	Feature set	Training-set definition
Model 1	58 nutrients	Full food profiles (Unique food code, nova class, nutrient profile)
Model 2	58 nutrients	Unique NOVA–nutrient profile pairs
Model 3	57 NDSR-compatible nutrients	Full food profiles (Unique food code, nova class, nutrient profile)
Model 4	57 NDSR-compatible nutrients	Unique NOVA–nutrient profile pairs

The 57-nutrient models are intended for studies that rely on NDSR-compatible nutrient profiles.

Random-forest hyperparameters

All models use the same fixed random-forest configuration:

params_defined = {
    "n_estimators": 500,
    "max_features": "sqrt",
    "max_depth": 20
}

This choice reflects the objective of the model. The classifier is not used primarily as a discrete NOVA-label predictor; rather, it is used to generate class-probability vectors from which FPro is computed as a continuous projection score. Therefore, the goal is to obtain stable probability-derived scores, not to maximize discrete classification performance through extensive hyperparameter optimization.

We deliberately avoided exhaustive hyperparameter tuning for two reasons. First, the labeled reference set is relatively large compared with the number of nutrient features, and the random-forest model already achieves strong cross-validated discrimination across NOVA classes. Second, hyperparameter tuning would require withholding additional data or introducing a nested model-selection layer, whereas our priority was to expose the model to as much labeled information as possible to improve the stability and resolution of the FPro probability surface.

This decision is also supported by previous FoodProX analyses and by sensitivity checks in the present implementation, which showed that random-forest performance and FPro behavior were stable across a wide range of hyperparameter choices. In particular, allowing deeper trees did not materially change cross-validated AUC or average precision across the evaluated model variants. We therefore retained a single fixed configuration for all models to ensure comparability across nutrient panels and training-set definitions.

Installation

Create the conda environment:

conda env create -f environment.yml
conda activate food_pro_py311

Then launch Jupyter:

jupyter notebook

or:

jupyter lab

Running the notebook

From the repository root, open:

Score_Generation.ipynb

and run all cells.

The notebook creates the following output directories if they do not already exist:

Models_GitHub/
Metrics_GitHub/
Outputs_GitHub/

Metrics_GitHub/ and Outputs_GitHub/ are included in this repository to provide generated metrics and FPro outputs. Models_GitHub/ is generated locally when the notebook is run, but it is excluded from version control because trained model binaries are large and can be regenerated.

Programmatic scoring

The main scoring function is:

from scoring import classify_db

Example:

db_scored = classify_db(
    db=input_dataframe,
    model_per_fold=models,
    nut_sel=nutrient_columns
)

The function returns the input dataframe with additional columns including fold-level probabilities, averaged probabilities, FPro, std_FPro, min_FPro, max_FPro, and final class calls.

To use classify_db directly, users must provide trained fold-specific models. These can be generated by running Score_Generation.ipynb. If pre-trained models are released separately, they should be placed in a local Models_GitHub/ directory before scoring.

Outputs

For each model variant, the workflow saves:

trained fold-specific random-forest models in the local, untracked Models_GitHub/ directory;
cross-validation AUC/AUP metrics;
ROC and precision-recall curve objects;
train/test split indices;
scored food database files with FPro values.

Large generated files

Large trained model files are not tracked in the repository. This keeps the repository lightweight and avoids storing generated binary artifacts in Git history. The models can be regenerated by running the notebook from the repository root.

The recommended .gitignore includes:

Models_GitHub/

__pycache__/
.ipynb_checkpoints/
.DS_Store
*.pyc

Notes

The code assumes that input nutrient values are aligned with the nutrient names used during training.
classify_db checks that all fold-specific models have the same class order.
FPro is computed from class probabilities, not from hard class labels.
The probability columns p1, p2, p3, and p4 correspond to NOVA classes 1, 2, 3, and 4.
Metrics_GitHub/ and Outputs_GitHub/ contain generated files from the workflow; they can be regenerated by rerunning the notebook.

References

Menichetti G. et al. Machine learning prediction of the degree of food processing. Nature Communications, 2023.
https://www.nature.com/articles/s41467-023-37457-1
Ispirova G., Sebek M., Menichetti G. Informatics for Food Processing. arXiv, 2025.
https://arxiv.org/abs/2505.17087

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FoodProX-FPro-Score-Generation

Overview

Repository structure

Main files

Data and label mapping

Definition of FPro

Model variants

Random-forest hyperparameters

Installation

Running the notebook

Programmatic scoring

Outputs

Large generated files

Notes

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Metrics_GitHub		Metrics_GitHub
Outputs_GitHub		Outputs_GitHub
.gitignore		.gitignore
FNDDS_SR_combined_58_nutrients.csv		FNDDS_SR_combined_58_nutrients.csv
LICENSE		LICENSE
README.md		README.md
Score_Generation.ipynb		Score_Generation.ipynb
environment.yml		environment.yml
functions_for_evaluation.py		functions_for_evaluation.py
scoring.py		scoring.py

Folders and files

Latest commit

History

Repository files navigation

FoodProX-FPro-Score-Generation

Overview

Repository structure

Main files

Data and label mapping

Definition of FPro

Model variants

Random-forest hyperparameters

Installation

Running the notebook

Programmatic scoring

Outputs

Large generated files

Notes

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages