Updated TF attack by bzamanlooy · Pull Request #142 · VectorInstitute/midst-toolkit

bzamanlooy · 2026-05-27T16:20:59Z

PR Type

[Feature | Fix | Documentation | Other ]
Fix

Short Description

This pull request is doing three things:

Modifying the Clavaddpm setup to allow for profited label encoders to be passed during training.
Modifying the TF attack to figure out the Gaussian noise dimension from the data.
Modifying the TF attack to not refit transformations when loading tables.

Tests

Updated there TF attack tests to reflect a more realistic setup

coderabbitai · 2026-05-27T16:28:21Z

📝 Walkthrough

Walkthrough

This PR refines the Tartan Federer membership inference attack implementation by introducing label encoder reuse, improving dataset preprocessing, and refining noise dimension inference. The changes add optional support in encode_and_merge_features to load pre-fitted label encoders from cached pickle files, update the attack's dataset preparation to apply pre-fitted numerical transforms directly and disable normalization in the final output, and replace static noise dimension calculation with dynamic probing of the diffusion model's numerical feature count at runtime. Integration tests are updated with new hyperparameters and expected performance metrics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Updated TF attack' is vague and generic, failing to convey what the actual changes accomplish.	Use a more descriptive title such as 'Allow pretrained label encoders and fix noise dimension in TF attack' to clearly communicate the main changes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers the main changes and includes the required PR Type and Short Description sections, though it lacks detail in some areas.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch diabetes-tf-attack

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py (1)
127-145: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Align categorical dtype with checkpointed LabelEncoder expectations.

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py builds categorical_features with to_numpy() (no dtype enforcement), but src/midst_toolkit/models/clavaddpm/dataset.py stringifies categorical columns with to_numpy(dtype=np.str_) when constructing the training categorical arrays used for LabelEncoder fitting/saving. This dtype mismatch can cause LabelEncoder.transform() to fail for int/bool categories treated as unseen labels at attack time.
🔧 Minimal fix
-    categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy()}
+    categorical_features = {DataSplit.TRAIN.value: data[categorical_column_names].to_numpy(dtype=np.str_)}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around
lines 127 - 145, The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 468-475: The current loop sets _relation_order to [] for
non-"tabddpm" so noise_dimension is never assigned and later causes
UnboundLocalError; add an explicit guard that validates model_type before
attempting to probe the checkpoint (e.g., check model_type == "tabddpm" and
raise a clear exception like ValueError("unsupported model_type: ...")
otherwise) so you never open first_model_path/_parent_child ckpt for unsupported
types; update the logic around _relation_order, the checkpoint probe using
CustomUnpickler, and ensure get_score()’s error path remains the single source
of truth for unsupported models by raising the clearer error early.

In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py`:
- Around line 103-122: When a label_encoders_path is supplied you must fail fast
instead of mixing preloaded and newly-fitted encoders: after loading
preloaded_encoders (from label_encoders_path) validate that
categorical_column_names is not None and that every name in
categorical_column_names exists as a key in preloaded_encoders; if any are
missing, raise a clear error (or return/raise ValueError) rather than falling
back to fitting per-column. Update the loop that currently checks
preloaded_encoders and conditionally fits (the block using preloaded_encoders,
label_encoder, encoded_labels and the fallback LabelEncoder()) to assume
encoders are present when label_encoders_path was provided and only fit new
encoders when no path was provided; include the check up front so you never mix
cached and freshly-fit encoders.

In `@src/midst_toolkit/models/clavaddpm/dataset.py`:
- Around line 380-390: The Dataset.from_df constructor currently probes the CWD
for attack-specific relative files using the local _le_path loop (import os as
_os and the for _parent ... if _os.path.exists ...), which must be removed;
instead add an explicit optional parameter (e.g., encoder_path=None) to
Dataset.from_df and use that value as the label-encoder path (leave None if not
provided) rather than auto-discovering whitebox_single_table_* files, remove the
os import and the _parent loop, and update callers to pass the encoder_path from
their context so encoder discovery is deterministic and not CWD-dependent.

---

Outside diff comments:
In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py`:
- Around line 127-145: The categorical features are created with
data[categorical_column_names].to_numpy() which can yield non-string dtypes and
mismatch the stringified categories used when fitting/saving LabelEncoder in
clavaddpm.dataset; change the construction of categorical_features (and the
local all_categorical_features used before encoding) to explicitly convert to
strings (e.g., to_numpy(dtype=np.str_) or .astype(str)) so
label_encoders[column_index].transform receives the same dtype it was trained
on; keep the rest of the loop (noise_scale handling and encoding steps)
unchanged and use the existing symbols categorical_features,
all_categorical_features, label_encoders,
get_categorical_and_numerical_column_names, and DataSplit.TRAIN to locate the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bb57c420-9245-46a5-8738-a40f7dda137e

📥 Commits

Reviewing files that changed from the base of the PR and between f56d16e and 8b9415a.

📒 Files selected for processing (4)

src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py
src/midst_toolkit/models/clavaddpm/dataset.py
src/midst_toolkit/models/clavaddpm/dataset_utils.py
tests/integration/attacks/tartan_federer/test_tartan_federer_attack.py

coderabbitai · 2026-05-27T16:28:27Z

+    _relation_order = [("None", "trans")] if model_type == "tabddpm" else []
+    for _parent, _child in _relation_order:
+        _ckpt_path = first_model_path / f"{_parent}_{_child}_ckpt.pkl"
+        with open(_ckpt_path, "rb") as _f:
+            _probe_model = CustomUnpickler(_f).load()
+        noise_dimension = _probe_model.diffusion.num_numerical_features
+        log(INFO, f"Noise dimension read from diffusion model: {noise_dimension}")
+        break


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject unsupported model_type before probing the checkpoint.

For anything other than tabddpm, _relation_order is empty and noise_dimension is never assigned, so Line 477 fails with UnboundLocalError instead of the clearer unsupported-model error used later in get_score().

🔧 Suggested guard

- _relation_order = [("None", "trans")] if model_type == "tabddpm" else [] + if model_type != "tabddpm": + raise ValueError(f"Unsupported model_type for Tartan Federer attack training: {model_type}") + _relation_order = [("None", "trans")]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/attacks/tartan_federer/tartan_federer_attack.py` around lines 468 - 475, The current loop sets _relation_order to [] for non-"tabddpm" so noise_dimension is never assigned and later causes UnboundLocalError; add an explicit guard that validates model_type before attempting to probe the checkpoint (e.g., check model_type == "tabddpm" and raise a clear exception like ValueError("unsupported model_type: ...") otherwise) so you never open first_model_path/_parent_child ckpt for unsupported types; update the logic around _relation_order, the checkpoint probe using CustomUnpickler, and ensure get_score()’s error path remains the single source of truth for unsupported models by raising the clearer error early.

coderabbitai · 2026-05-27T16:28:27Z

+    # Load pre-fitted label encoders from pkl if provided, otherwise fit on current data
+    preloaded_encoders: dict[str, LabelEncoder] | None = None
+    if label_encoders_path is not None:
+        _pkl_path = Path(label_encoders_path)
+        if _pkl_path.exists():
+            with open(_pkl_path, "rb") as _f:
+                preloaded_encoders = pickle.load(_f)
+
    categorical_data_encoded = []
    label_encoders = {}
    for column in range(all_categorical_data.shape[1]):
-        label_encoder = LabelEncoder()
-        encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)
+        col_name = categorical_column_names[column] if categorical_column_names is not None else None
+        if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders:
+            # Use pre-fitted encoder from full dataset (e.g. 101K rows)
+            label_encoder = preloaded_encoders[col_name]
+            encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float)
+        else:
+            # Fallback: fit on current data
+            label_encoder = LabelEncoder()
+            encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't silently mix cached and freshly fit encoders.

If label_encoders_path is stale or points at the wrong file, missing columns fall through to fit_transform() and you end up with a mixed encoder set that no longer matches the checkpoint you meant to reuse. Validate the full categorical_column_names set up front and fail fast instead of partially re-fitting.

🔧 Suggested guard

if label_encoders_path is not None: _pkl_path = Path(label_encoders_path) if _pkl_path.exists(): with open(_pkl_path, "rb") as _f: preloaded_encoders = pickle.load(_f) + if categorical_column_names is not None: + missing = set(categorical_column_names) - set(preloaded_encoders) + if missing: + raise ValueError( + f"Missing label encoders for categorical columns: {sorted(missing)}" + ) @@ - else: + elif preloaded_encoders is None: # Fallback: fit on current data label_encoder = LabelEncoder() encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float) + else: + raise KeyError(f"No cached encoder found for categorical column: {col_name}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Load pre-fitted label encoders from pkl if provided, otherwise fit on current data

preloaded_encoders: dict[str, LabelEncoder] | None = None

if label_encoders_path is not None:

_pkl_path = Path(label_encoders_path)

if _pkl_path.exists():

with open(_pkl_path, "rb") as _f:

preloaded_encoders = pickle.load(_f)

categorical_data_encoded = []

label_encoders = {}

for column in range(all_categorical_data.shape[1]):

label_encoder = LabelEncoder()

encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)

col_name = categorical_column_names[column] if categorical_column_names is not None else None

if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders:

# Use pre-fitted encoder from full dataset (e.g. 101K rows)

label_encoder = preloaded_encoders[col_name]

encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float)

else:

# Fallback: fit on current data

label_encoder = LabelEncoder()

encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)

# Load pre-fitted label encoders from pkl if provided, otherwise fit on current data

preloaded_encoders: dict[str, LabelEncoder] | None = None

if label_encoders_path is not None:

_pkl_path = Path(label_encoders_path)

if _pkl_path.exists():

with open(_pkl_path, "rb") as _f:

preloaded_encoders = pickle.load(_f)

if categorical_column_names is not None:

missing = set(categorical_column_names) - set(preloaded_encoders)

if missing:

raise ValueError(

f"Missing label encoders for categorical columns: {sorted(missing)}"

)

categorical_data_encoded = []

label_encoders = {}

for column in range(all_categorical_data.shape[1]):

col_name = categorical_column_names[column] if categorical_column_names is not None else None

if preloaded_encoders is not None and col_name is not None and col_name in preloaded_encoders:

# Use pre-fitted encoder from full dataset (e.g. 101K rows)

label_encoder = preloaded_encoders[col_name]

encoded_labels = label_encoder.transform(all_categorical_data[:, column]).astype(float)

elif preloaded_encoders is None:

# Fallback: fit on current data

label_encoder = LabelEncoder()

encoded_labels = label_encoder.fit_transform(all_categorical_data[:, column]).astype(float)

else:

raise KeyError(f"No cached encoder found for categorical column: {col_name}")

🧰 Tools

🪛 OpenGrep (1.22.0)

[ERROR] 109-109: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.

(coderabbit.deserialization.python-pickle)

🪛 Ruff (0.15.14)

[error] 109-109: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/models/clavaddpm/dataset_utils.py` around lines 103 - 122, When a label_encoders_path is supplied you must fail fast instead of mixing preloaded and newly-fitted encoders: after loading preloaded_encoders (from label_encoders_path) validate that categorical_column_names is not None and that every name in categorical_column_names exists as a key in preloaded_encoders; if any are missing, raise a clear error (or return/raise ValueError) rather than falling back to fitting per-column. Update the loop that currently checks preloaded_encoders and conditionally fits (the block using preloaded_encoders, label_encoder, encoded_labels and the fallback LabelEncoder()) to assume encoders are present when label_encoders_path was provided and only fit new encoders when no path was provided; include the check up front so you never mix cached and freshly-fit encoders.

coderabbitai · 2026-05-27T16:28:28Z

+        # Look for pre-fitted label encoders in the parent directories of the data
+        import os as _os
+
+        _le_path = None
+        for _parent in [
+            _os.path.join("whitebox_single_table_DI", "label_encoders.pkl"),
+            _os.path.join("whitebox_single_table_70", "label_encoders.pkl"),
+        ]:
+            if _os.path.exists(_parent):
+                _le_path = _parent
+                break


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Remove the cwd-dependent encoder discovery from Dataset.from_df.

This generic constructor now changes behavior based on whether two attack-specific relative paths happen to exist in the current working directory. That makes preprocessing non-reproducible across launch locations and can silently load an unrelated encoder set. Please pass the encoder path in explicitly from the caller instead of probing whitebox_single_table_* here.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/midst_toolkit/models/clavaddpm/dataset.py` around lines 380 - 390, The Dataset.from_df constructor currently probes the CWD for attack-specific relative files using the local _le_path loop (import os as _os and the for _parent ... if _os.path.exists ...), which must be removed; instead add an explicit optional parameter (e.g., encoder_path=None) to Dataset.from_df and use that value as the label-encoder path (leave None if not provided) rather than auto-discovering whitebox_single_table_* files, remove the os import and the _parent loop, and update callers to pass the encoder_path from their context so encoder discovery is deterministic and not CWD-dependent.

bzamanlooy added 3 commits May 26, 2026 18:06

diabetes adapeted TF attack

dda5dc8

Adapt Tartan Federer attack for diabetes

a5b00ed

Updated test

8b9415a

bzamanlooy requested review from emersodb, fatemetkl and lotif and removed request for lotif May 27, 2026 16:21

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

bzamanlooy added 4 commits May 27, 2026 12:30

cleaning up and ruff check comment

616c96d

ruff

0b698b1

changed atack numbers with a cpu run to make more stable

200eb88

addressed coderabbit comments

c01f3a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated TF attack#142

Updated TF attack#142
bzamanlooy wants to merge 7 commits into
mainfrom
diabetes-tf-attack

bzamanlooy commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

coderabbitai Bot May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bzamanlooy commented May 27, 2026

PR Type

Short Description

Tests

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant