[Q&A] PyTorch backend compression of converted TF `se_e2_a` models causes unstable LAMMPS MD #5438

Dm0216 · 2026-05-08T06:40:15Z

Dm0216
May 8, 2026

Question

Title: PyTorch backend compression of converted TF se_e2_a models causes unstable LAMMPS MD

System Information:

OS: Kubuntu / Linux
Hardware: NVIDIA RTX A4000
DeePMD-kit Version: v3.1.4.dev69+g0a481dede (PyTorch backend)
PyTorch Version: v2.10.0+cu130
LAMMPS Version: 22 Jul 2025 (Update 2)
Descriptor Type: se_e2_a (Model: 5 elements Pb I C H N)

Description:
Compressing a legacy TensorFlow se_e2_a potential that has been converted to the PyTorch backend consistently results in an unstable model in LAMMPS leaving residual forces that immediately cause a "Lost atoms" error during MD. The uncompressed .pth models execute flawlessly and maintain perfect energy/force parity with the original TF model.

Steps to Reproduce:

Approach 1: Direct Conversion and Compression

dp convert-backend model5.pb model5.pth
dp --pt compress -i model5.pth -o model5_compressed.pth

Result: MD test runs are completely stable for the uncompressed model5.pth. MD test runs immediately explode for model5_compressed.pth.

Approach 2: Conversion + 0-Step Initialization + Compression
To ensure the compression tables had the correct spatial boundaries (d_low), a 0-step initialization was performed using the full dataset.

dp --pt train convert-pt.json --init-frz-model model5.pb
dp --pt freeze -o model5.1.pth
dp --pt compress -i model5.1.pth -o model5.1_compressed.pth

Result: MD test runs are completely stable for the uncompressed model5.1.pth. MD test runs immediately explode for model5.1_compressed.pth.

LAMMPS Failure Output (Compressed Models):

Minimization stats:
  Stopping criterion = linesearch alpha is zero
  Energy initial, next-to-last, final = 
     -1362.52296661576  -1362.53777500403  -1362.53777500403
  Force max component initial, final = 75.44272 75.401807
...
ERROR: Lost atoms: original 348 current 347 (src/thermo.cpp:526)

Workaround Limitation:
Attempting to force mathematically smooth splines to bypass the CG minimizer failure by increasing the grid resolution (-s 0.001 -e 10) generates a 7.6 GB .pth file. Loading this file into LAMMPS immediately triggers a C++ libtorch deserialization crash:
ERROR on proc 0: DeePMD-kit C API Error: PK (/home/dm/deepmd-kit-v3.1.4/source/lmp/pair_deepmd.cpp:572)

DeePMD-kit Version

No response

Backend and its version

No response

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

No response

Reproducible Example, Input Files, and Commands

No response

Further Information, Files, and Links

No response

wanghan-iapcm · 2026-05-17T07:52:20Z

wanghan-iapcm
May 17, 2026
Maintainer

Hi @Dm0216, thanks for the very detailed report. I tried to reproduce on a fresh fp64 se_e2_a model (TF type_one_side=False, 2 types, 100-step training on the example water dataset) — dp convert-backend → dp --pt compress works cleanly there, with compressed vs uncompressed forces matching to ~3e-14 across configurations whose minimum pair distance is 0.9 Å. So the pt compression code is not categorically broken; the failure looks specific to either your 5-type weight set or to a runtime configuration LAMMPS exercises that pure-Python eval does not.

Suggested quick workaround first: please verify that your model is type_one_side=True, and if it isn't, re-train (or fine-tune) with type_one_side=True. Convert-backend and compression both work much more reliably on the type_one_side=True path — there is one embedding network per neighbor type instead of one per (center, neighbor) pair, so the tabulation builds fewer, larger-coverage tables that are less sensitive to per-pair boundary effects. Many TF legacy models default to type_one_side=False; flipping it for the new train usually has negligible accuracy impact for energy/force prediction. This may be the cleanest fix in your situation.

If you'd like to understand the root cause regardless, the most likely structural cause for the unstable MD is runtime ss exceeding the table's trained upper bound:

During compression, the table upper bound is computed from the model's stored min_nbor_dist (set at training time):
s_upper = ((1/min_nbor_dist) * sw - davg_s) / dstd_s
At runtime, ss = (1/r * sw - davg_s) / dstd_s. If LAMMPS hands the model an atom pair with r < min_nbor_dist (initial config, bad starting structure, or a thermal spike at the very first step), then ss > s_upper and the lookup falls into the extrapolation region.
The polynomial spline in the extrapolation region is fit to embedding-network output at xx values the network never saw during training, so both the spline values and their derivatives are essentially unconstrained — forces of arbitrary magnitude are possible. The 75 eV/Å your LAMMPS log shows is consistent with this failure mode but doesn't by itself prove it; the Python diagnostic below distinguishes.

The 7.6 GB file from -s 0.001 -e 10 failing to load is an unrelated LAMMPS-side libtorch deserialization issue (file too large for the legacy load path); the bound issue is what's causing the unstable MD with the default-stride compressed model.

Quick diagnostic to confirm and localize — please run before LAMMPS, in pure Python, against the exact starting structure LAMMPS reads. Important: if your LAMMPS run uses periodic boundary conditions (the default for boundary p p p), pass the cell to DeepPot.eval and compute the minimum distance with PBC accounted for; otherwise omit the box and use plain distances. Could you also confirm whether the run is PBC or non-periodic?

import numpy as np
from ase import Atoms
from deepmd.infer import DeepPot

# coord (natoms, 3), atype (natoms,), box (3, 3) from your LAMMPS data file
# Set pbc=True/False matching your LAMMPS `boundary` setting.
USE_PBC = True

dp_un = DeepPot("model5.1.pth")
dp_co = DeepPot("model5.1_compressed.pth")

box_arg = box.reshape(1, 9) if USE_PBC else None
e_un, f_un, _ = dp_un.eval(coord.reshape(1, -1), box_arg, atype)
e_co, f_co, _ = dp_co.eval(coord.reshape(1, -1), box_arg, atype)

print("uncomp |F| max:", float(np.max(np.abs(f_un))))
print("comp   |F| max:", float(np.max(np.abs(f_co))))
print("max force diff:", float(np.max(np.abs(f_un - f_co))))

# PBC-aware minimum-image pair distance (ase handles the cell properly)
atoms = Atoms(
    positions=coord.reshape(-1, 3),
    cell=box.reshape(3, 3) if USE_PBC else None,
    pbc=USE_PBC,
)
d = atoms.get_all_distances(mic=USE_PBC)
np.fill_diagonal(d, np.inf)  # mask self-distances
print("min pair distance (mic):", float(d.min()))

# And the min_nbor_dist baked into the .pth itself
import torch
m = torch.jit.load("model5.1.pth", map_location="cpu")
print("model min_nbor_dist:    ", float(m.min_nbor_dist))

Three outcomes tell us where to look:

comp |F| max matches the 75 eV/Å number and is much larger than uncomp |F| max → the bug is reproducible from Python eval on this single frame. If min pair distance (mic) is below model min_nbor_dist, the over-extrapolation hypothesis above is the cause. Workarounds: pre-relax with the uncompressed model, or freeze the model with a smaller min_nbor_dist (use dp neighbor-stat on a wider training set or set it manually before compress).
Diffs are O(1e-3) or larger but forces are reasonable → genuine compression accuracy regression for your specific weights. In that case it would be very useful if you could attach model5.1.pth and the read_data data file so we can reproduce locally; the compressed model is fully self-contained and reproduces deterministically.
Diffs are O(1e-10) or smaller → the bug is LAMMPS-side (pair_deepmd.cpp interpretation of compressed PT models), not in the compression itself. We'd then investigate the C++ dispatch path.

For reference, the internal indexing/bound calculations I traced in pt-side compression appear correct (table net-name ↔ embedding_net_nodes mapping, runtime (center, neighbor) ↔ embedding_idx mapping, and _get_env_mat_range extracting the s-component bound after the min/max collapse). The cosmetic over-extension of upper to the sx/sy/sz axis bounds (visible in the compression log as a 13.13 upper instead of the s-component's ~7.7) does not actually corrupt runtime values, because the table polynomials at queried indices are still computed from in-distribution vv/dd/d2.

1 reply

Dm0216 May 18, 2026
Author

Hi. Thank you so much for the response. My model was set to "type_one_side": false, and I also think that this looks specific to my 5-type weight set, either way, I'm retraining the model in PyTorch now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] PyTorch backend compression of converted TF `se_e2_a` models causes unstable LAMMPS MD #5438

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Q&A] PyTorch backend compression of converted TF se_e2_a models causes unstable LAMMPS MD #5438

Uh oh!

Dm0216 May 8, 2026

Question

DeePMD-kit Version

Backend and its version

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

Details

Reproducible Example, Input Files, and Commands

Further Information, Files, and Links

Replies: 1 comment · 1 reply

Uh oh!

wanghan-iapcm May 17, 2026 Maintainer

Uh oh!

Dm0216 May 18, 2026 Author

[Q&A] PyTorch backend compression of converted TF `se_e2_a` models causes unstable LAMMPS MD #5438

Dm0216
May 8, 2026

Replies: 1 comment 1 reply

wanghan-iapcm
May 17, 2026
Maintainer

Dm0216 May 18, 2026
Author