Updated TF attack#142
Conversation
📝 WalkthroughWalkthroughThis PR refines the Tartan Federer membership inference attack implementation by introducing label encoder reuse, improving dataset preprocessing, and refining noise dimension inference. The changes add optional support in Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
127-145:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAlign categorical dtype with checkpointed
LabelEncoderexpectations.
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.pybuildscategorical_featureswithto_numpy()(no dtype enforcement), butsrc/midst_toolkit/models/clavaddpm/dataset.pystringifies categorical columns withto_numpy(dtype=np.str_)when constructing the training categorical arrays used forLabelEncoderfitting/saving. This dtype mismatch can causeLabelEncoder.transform()to fail for int/bool categories treated as unseen labels at attack time.🔧 Minimal fix
- categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy()} + categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy(dtype=np.str_)}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around lines 127 - 145, The categorical features are created with data[categorical_column_names].to_numpy() which can yield non-string dtypes and mismatch the stringified categories used when fitting/saving LabelEncoder in clavaddpm.dataset; change the construction of categorical_features (and the local all_categorical_features used before encoding) to explicitly convert to strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so label_encoders[column_index].transform receives the same dtype it was trained on; keep the rest of the loop (noise_scale handling and encoding steps) unchanged and use the existing symbols categorical_features, all_categorical_features, label_encoders, get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 468-475: The current loop sets _relation_order to [] for
non-"tabddpm" so noise_dimension is never assigned and later causes
UnboundLocalError; add an explicit guard that validates model_type before
attempting to probe the checkpoint (e.g., check model_type == "tabddpm" and
raise a clear exception like ValueError("unsupported model_type: ...")
otherwise) so you never open first_model_path/_parent_child ckpt for unsupported
types; update the logic around _relation_order, the checkpoint probe using
CustomUnpickler, and ensure get_score()’s error path remains the single source
of truth for unsupported models by raising the clearer error early.
In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py`:
- Around line 103-122: When a label_encoders_path is supplied you must fail fast
instead of mixing preloaded and newly-fitted encoders: after loading
preloaded_encoders (from label_encoders_path) validate that
categorical_column_names is not None and that every name in
categorical_column_names exists as a key in preloaded_encoders; if any are
missing, raise a clear error (or return/raise ValueError) rather than falling
back to fitting per-column. Update the loop that currently checks
preloaded_encoders and conditionally fits (the block using preloaded_encoders,
label_encoder, encoded_labels and the fallback LabelEncoder()) to assume
encoders are present when label_encoders_path was provided and only fit new
encoders when no path was provided; include the check up front so you never mix
cached and freshly-fit encoders.
In `@src/midst_toolkit/models/clavaddpm/dataset.py`:
- Around line 380-390: The Dataset.from_df constructor currently probes the CWD
for attack-specific relative files using the local _le_path loop (import os as
_os and the for _parent ... if _os.path.exists ...), which must be removed;
instead add an explicit optional parameter (e.g., encoder_path=None) to
Dataset.from_df and use that value as the label-encoder path (leave None if not
provided) rather than auto-discovering whitebox_single_table_* files, remove the
os import and the _parent loop, and update callers to pass the encoder_path from
their context so encoder discovery is deterministic and not CWD-dependent.
---
Outside diff comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 127-145: The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bb57c420-9245-46a5-8738-a40f7dda137e
📒 Files selected for processing (4)
src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.pysrc/midst_toolkit/models/clavaddpm/dataset.pysrc/midst_toolkit/models/clavaddpm/dataset_utils.pytests/integration/attacks/tartan_federer/test_tartan_federer_attack.py
| _relation_order = [("None", "trans")] if model_type == "tabddpm" else [] | ||
| for _parent, _child in _relation_order: | ||
| _ckpt_path = first_model_path / f"{_parent}_{_child}_ckpt.pkl" | ||
| with open(_ckpt_path, "rb") as _f: | ||
| _probe_model = CustomUnpickler(_f).load() | ||
| noise_dimension = _probe_model.diffusion.num_numerical_features | ||
| log(INFO, f"Noise dimension read from diffusion model: {noise_dimension}") | ||
| break |
There was a problem hiding this comment.
Reject unsupported model_type before probing the checkpoint.
For anything other than tabddpm, _relation_order is empty and noise_dimension is never assigned, so Line 477 fails with UnboundLocalError instead of the clearer unsupported-model error used later in get_score().
🔧 Suggested guard
- _relation_order = [("None", "trans")] if model_type == "tabddpm" else []
+ if model_type != "tabddpm":
+ raise ValueError(f"Unsupported model_type for Tartan Federer attack training: {model_type}")
+ _relation_order = [("None", "trans")]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around
lines 468 - 475, The current loop sets _relation_order to [] for non-"tabddpm"
so noise_dimension is never assigned and later causes UnboundLocalError; add an
explicit guard that validates model_type before attempting to probe the
checkpoint (e.g., check model_type == "tabddpm" and raise a clear exception like
ValueError("unsupported model_type: ...") otherwise) so you never open
first_model_path/_parent_child ckpt for unsupported types; update the logic
around _relation_order, the checkpoint probe using CustomUnpickler, and ensure
get_score()’s error path remains the single source of truth for unsupported
models by raising the clearer error early.
| # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data | ||
| preloaded_encoders: dict[str, LabelEncoder] | None = None | ||
| if label_encoders_path is not None: | ||
| _pkl_path = Path(label_encoders_path) | ||
| if _pkl_path.exists(): | ||
| with open(_pkl_path, "rb") as _f: | ||
| preloaded_encoders = pickle.load(_f) | ||
|
|
||
| categorical_data_encoded = [] | ||
| label_encoders = {} | ||
| for column in range(all_categorical_data.shape[1]): | ||
| label_encoder = LabelEncoder() | ||
| encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) | ||
| col_name = categorical_column_names[column] if categorical_column_names is not None else None | ||
| if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders: | ||
| # Use pre-fitted encoder from full dataset (e.g. 101K rows) | ||
| label_encoder = preloaded_encoders[col_name] | ||
| encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float) | ||
| else: | ||
| # Fallback: fit on current data | ||
| label_encoder = LabelEncoder() | ||
| encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) |
There was a problem hiding this comment.
Don't silently mix cached and freshly fit encoders.
If label_encoders_path is stale or points at the wrong file, missing columns fall through to fit_transform() and you end up with a mixed encoder set that no longer matches the checkpoint you meant to reuse. Validate the full categorical_column_names set up front and fail fast instead of partially re-fitting.
🔧 Suggested guard
if label_encoders_path is not None:
_pkl_path = Path(label_encoders_path)
if _pkl_path.exists():
with open(_pkl_path, "rb") as _f:
preloaded_encoders = pickle.load(_f)
+ if categorical_column_names is not None:
+ missing = set(categorical_column_names) - set(preloaded_encoders)
+ if missing:
+ raise ValueError(
+ f"Missing label encoders for categorical columns: {sorted(missing)}"
+ )
@@
- else:
+ elif preloaded_encoders is None:
# Fallback: fit on current data
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)
+ else:
+ raise KeyError(f"No cached encoder found for categorical column: {col_name}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data | |
| preloaded_encoders: dict[str, LabelEncoder] | None = None | |
| if label_encoders_path is not None: | |
| _pkl_path = Path(label_encoders_path) | |
| if _pkl_path.exists(): | |
| with open(_pkl_path, "rb") as _f: | |
| preloaded_encoders = pickle.load(_f) | |
| categorical_data_encoded = [] | |
| label_encoders = {} | |
| for column in range(all_categorical_data.shape[1]): | |
| label_encoder = LabelEncoder() | |
| encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) | |
| col_name = categorical_column_names[column] if categorical_column_names is not None else None | |
| if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders: | |
| # Use pre-fitted encoder from full dataset (e.g. 101K rows) | |
| label_encoder = preloaded_encoders[col_name] | |
| encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float) | |
| else: | |
| # Fallback: fit on current data | |
| label_encoder = LabelEncoder() | |
| encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) | |
| # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data | |
| preloaded_encoders: dict[str, LabelEncoder] | None = None | |
| if label_encoders_path is not None: | |
| _pkl_path = Path(label_encoders_path) | |
| if _pkl_path.exists(): | |
| with open(_pkl_path, "rb") as _f: | |
| preloaded_encoders = pickle.load(_f) | |
| if categorical_column_names is not None: | |
| missing = set(categorical_column_names) - set(preloaded_encoders) | |
| if missing: | |
| raise ValueError( | |
| f"Missing label encoders for categorical columns: {sorted(missing)}" | |
| ) | |
| categorical_data_encoded = [] | |
| label_encoders = {} | |
| for column in range(all_categorical_data.shape[1]): | |
| col_name = categorical_column_names[column] if categorical_column_names is not None else None | |
| if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders: | |
| # Use pre-fitted encoder from full dataset (e.g. 101K rows) | |
| label_encoder = preloaded_encoders[col_name] | |
| encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float) | |
| elif preloaded_encoders is None: | |
| # Fallback: fit on current data | |
| label_encoder = LabelEncoder() | |
| encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) | |
| else: | |
| raise KeyError(f"No cached encoder found for categorical column: {col_name}") |
🧰 Tools
🪛 OpenGrep (1.22.0)
[ERROR] 109-109: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.
(coderabbit.deserialization.python-pickle)
🪛 Ruff (0.15.14)
[error] 109-109: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py` around lines 103 - 122,
When a label_encoders_path is supplied you must fail fast instead of mixing
preloaded and newly-fitted encoders: after loading preloaded_encoders (from
label_encoders_path) validate that categorical_column_names is not None and that
every name in categorical_column_names exists as a key in preloaded_encoders; if
any are missing, raise a clear error (or return/raise ValueError) rather than
falling back to fitting per-column. Update the loop that currently checks
preloaded_encoders and conditionally fits (the block using preloaded_encoders,
label_encoder, encoded_labels and the fallback LabelEncoder()) to assume
encoders are present when label_encoders_path was provided and only fit new
encoders when no path was provided; include the check up front so you never mix
cached and freshly-fit encoders.
| # Look for pre-fitted label encoders in the parent directories of the data | ||
| import os as _os | ||
|
|
||
| _le_path = None | ||
| for _parent in [ | ||
| _os.path.join("whitebox_single_table_DI", "label_encoders.pkl"), | ||
| _os.path.join("whitebox_single_table_70", "label_encoders.pkl"), | ||
| ]: | ||
| if _os.path.exists(_parent): | ||
| _le_path = _parent | ||
| break |
There was a problem hiding this comment.
Remove the cwd-dependent encoder discovery from Dataset.from_df.
This generic constructor now changes behavior based on whether two attack-specific relative paths happen to exist in the current working directory. That makes preprocessing non-reproducible across launch locations and can silently load an unrelated encoder set. Please pass the encoder path in explicitly from the caller instead of probing whitebox_single_table_* here.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/midst_toolkit/models/clavaddpm/dataset.py` around lines 380 - 390, The
Dataset.from_df constructor currently probes the CWD for attack-specific
relative files using the local _le_path loop (import os as _os and the for
_parent ... if _os.path.exists ...), which must be removed; instead add an
explicit optional parameter (e.g., encoder_path=None) to Dataset.from_df and use
that value as the label-encoder path (leave None if not provided) rather than
auto-discovering whitebox_single_table_* files, remove the os import and the
_parent loop, and update callers to pass the encoder_path from their context so
encoder discovery is deterministic and not CWD-dependent.
PR Type
[Feature | Fix | Documentation | Other ]
Fix
Short Description
This pull request is doing three things:
Tests