Rest of GPR Library#8
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Gaussian Process Regression (GPR) module to the SimulationSupport package, aiming to support Post-Newtonian initial-parameter generation/corrections plus an end-to-end GPR training/prediction/cross-validation workflow, along with unit tests and new runtime dependencies.
Changes:
- Added
SimulationSupport.gprwith GPytorch-based training/prediction utilities, PN helper functions, cross-validation, plotting, and checkpoint loading. - Added a large new unit test module covering normalization, training/prediction, parsing, correction application, LOO CV, plotting residuals, and checkpoint restore.
- Updated
pyproject.tomlto include ML/plotting dependencies.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/SimulationSupport/gpr.py |
New GPR/PN library implementation (training, prediction, CV, plotting, checkpoint restore). |
tests/test_gpr.py |
New unit tests for the GPR module and related utilities. |
pyproject.toml |
Adds new dependencies required by the new GPR functionality. |
Comments suppressed due to low confidence (7)
src/SimulationSupport/gpr.py:44
normalize_dataalways reshapesXandYwith.reshape(-1, 1), which breaks multidimensional inputs (N, D) and contradicts the docstring claim that multi-D X is passed through unchanged. HandleX.ndim == 1/Y.ndim == 1separately and only reshape in that case (otherwise fit/transform the 2D arrays directly).
# Reshape X and Y into 2D arrays before fitting
# as StandardScaler expects shape (N, 1) for 1D inputs
X_normalized = scaler_X.fit_transform(X.reshape(-1, 1))
Y_normalized = scaler_Y.fit_transform(Y.reshape(-1, 1))
src/SimulationSupport/gpr.py:680
predict_with_gpr_modeldoes not reshape 1Draw_Xto (N, 1) before converting to a tensor, whiletrain_gpr_modelreshapes 1D inputs during training. If callers pass a 1D array at prediction time, this can produce a shape mismatch in the GP model (kernels/LinearMean expect 2D inputs). Reshape 1D inputs consistently in both training and prediction paths.
# Normalize the input using the model's stored parameters
normalized_X = (raw_X - model.input_mean) / model.input_std
X_tensor = torch.from_numpy(normalized_X).float()
src/SimulationSupport/gpr.py:813
train_model_and_eigenvalue_analysisdefaultsgpr=Nonebut then unconditionally callsgpr.train_gpr_model,gpr.normalize_data, andgpr.predict_with_gpr_model, which will raise an AttributeError unless the caller always passes a module object. Either defaultgprto the current module (or remove the parameter) and call the local functions directly, or raise a clear error ifgpris not provided.
# Train GPR
model, likelihood = gpr.train_gpr_model(X, Y)
# Normalize values - needed for eigenvalue analysis
X_normalized, Y_normalized, scaler_X, scaler_Y = gpr.normalize_data(X, Y)
tests/test_gpr.py:223
setUpClasswrites to a hard-coded path under the repo (Path("SimulationSupport") / "gpr_tests"). Tests should avoid creating/cleaning directories in the working tree because failures can leave artifacts or clash with user checkouts. Prefer usingtempfile.TemporaryDirectory()(like other tests here) and keep all filesystem writes inside it.
cls.test_dir = Path("SimulationSupport") / "gpr_tests"
shutil.rmtree(cls.test_dir, ignore_errors=True)
cls.test_dir.mkdir(parents=True, exist_ok=True)
cls.X, cls.Y = cls.make_data(n=20)
src/SimulationSupport/gpr.py:625
- The training-loop comment states that “early stopping terminates training once the loss plateaus”, but the implementation always runs the full 200 iterations (it only tracks
best_state). Either implement actual early stopping (break when no improvement for N steps) or adjust the comment/docstring so it doesn’t claim behavior that isn’t present.
# Training loop for 200 iterations; 200 was chosen empirically for this dataset.
# In practice, early stopping terminates training once the loss
# plateaus, so the exact value isn't critical. If utilizing a larger
# dataset where more iterations may be needed, increase this value if the loss
# is still decreasing at iteration 200.
src/SimulationSupport/gpr.py:1076
save_gpr_correctedhas a malformed docstring opener (""" ") which leaves a stray quote at the start of the docstring. Clean this up to a standard triple-quoted docstring to avoid confusing generated docs/help output.
):
""" "
Write and save GPR corrected runs to a text file.
Each line contains RunID, ZwickyDays, q, chiA, chiB, D0, Omega0, adot0.
src/SimulationSupport/gpr.py:567
train_gpr_modelnormalizes byinput_std/output_stdcomputed from the training data without guarding against zeros. If a feature column (or the target) is constant, this will divide by zero and propagate NaNs through training/prediction. Add a small epsilon or use a safer scaling strategy (e.g., replace zeros in std with 1.0, similar toStandardScaler).
input_mean = raw_X.mean(axis=0)
input_std = raw_X.std(axis=0)
output_mean = raw_Y.mean()
output_std = raw_Y.std()
# Normalize data column-wise; needed for all dimension > 1
normalized_X = (raw_X - input_mean) / input_std
normalized_Y = (raw_Y - output_mean) / output_std
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "scipy", | ||
| "matplotlib", | ||
| "torch", | ||
| "gpytorch" |
00e1240 to
daafdf5
Compare
There was a problem hiding this comment.
Please remove these other eccentricity-specific functions. Also, we have quite a few diagnostic functions now that do validations and plots (eigenvalue analysis, LOO analysis). Let's put these in a separate file to keep the code manageable. I suggest you move gpr.py -> grp/core.py and put the diagnostic functions in a new file gpr/diagnostics.py. You can also add gpr/__init__.py with the code from .core import *. This way people can still just do from SimulationSupport.gpr import GPRegressionModel without having to import from core.
Complete GPR library with the remaining functions beyond standalone ML utilities pushed previously. Added the functions to run the full pipeline, plot kernel eigenvalue decay, leave-one-out cross-validation, and the functions to restore saved models from disk.