Skip to content

[tensilelite] Fix --use-cache SIGSEGV from cached library-path drift#7735

Open
Alex-Vasile wants to merge 4 commits into
developfrom
users/alvasile/fix/use-cache-library-path-sigsegv
Open

[tensilelite] Fix --use-cache SIGSEGV from cached library-path drift#7735
Alex-Vasile wants to merge 4 commits into
developfrom
users/alvasile/fix/use-cache-library-path-sigsegv

Conversation

@Alex-Vasile
Copy link
Copy Markdown
Contributor

Motivation

--use-cache SIGSEGVs on single-arch builds (gfx1201). The crash is in LLVMLoadLibraryFilegetFileAsStream returns an error, the code derefs the null unique_ptr anyway.

Root cause is Python-side. _benchmarkProblemType recomputed the library-file path via libraryDir() and assumed it matched what writeBenchmarkFiles had written. They diverged: single-arch builds use per-arch subdirs, the cached read assumed flat library/. The .ini pointed at a missing file.

Technical Details

Four commits:

  1. fix(loader) — null-check ErrorOr in LLVMLoadLibraryFile. Defense-in-depth: turns any future wrong-path bug into a one-line error instead of a crash.

  2. fix(cache) — principal fix. writeBenchmarkFiles returns the relative library-file path it wrote, the build persists it in cache.yaml, --use-cache reads it back. No recomputation, no drift. Legacy cache.yaml files hit the existing KeyError handler and trigger a clean recompile.

  3. fix(TensileClientConfig) — two tangential bugs found while testing. Stale imports (Tensile.Common no longer re-exports those names) and a positional None for libraryFile that emitted library-file= into the .ini.

  4. refactor(ClientWriter)libraryFile is now required on both writeClientConfigIni and the writeClientConfig wrapper (kw-only). The default was the failure mode behind this bug; closing it at both frames stops a future caller from re-introducing it. Updates the remaining call site and adds signature + static call-site tests.

(2) is the load-bearing change; (1)(3)(4) stand on their own.

Test Plan

  • Unit tests cover the cache round-trip, the required-parameter invariants, and that every writeClientConfigIni call site passes libraryFile=.
  • gfx1201 build with --use-cache against a populated cache: no SIGSEGV, same results as a from-scratch run.

Test Result

Submission Checklist

…ryFile

When llvm::MemoryBuffer::getFileAsStream fails (e.g. library-file in
ClientParameters.ini points at a missing path), the returned ErrorOr
wraps a null unique_ptr. The previous code dereferenced it
unconditionally, producing a SIGSEGV with no diagnostic.

This is defense-in-depth on top of the cache.yaml LibraryFile fix: the
principled fix makes the wrong-path scenario unreachable from the Python
side, this fix makes any future wrong-path bug a clean error with the
filename in it, not a crash.
Root cause of the gfx1201 SIGSEGV: the --use-cache path reconstructed
the library file location via libraryDir(sourcePath, cmdLineArchs) and
assumed it matched whatever writeBenchmarkFiles had actually written.
When the layouts diverged (single-arch builds use per-arch subdirs via
libraryDir(); the cached read assumed the flat library/ layout), the
.ini pointed at a missing file and the C++ client crashed loading it.

Fix: writeBenchmarkFiles now returns the relative path it wrote, the
build phase persists it as LibraryFile in cache.yaml alongside
CodeObjectFiles, and the --use-cache path reads it back instead of
recomputing. No reconstruction = no chance of cached read diverging
from what the build wrote.

_readCacheIfValid now returns a dict {CodeObjectFiles, LibraryFile}
rather than a bare list. Legacy cache.yaml files written before
LibraryFile was persisted are caught by the existing KeyError handler,
treated as invalid, and trigger a clean recompile (one-time cost, no
harm). The defensive "cachedLibraryFile is None" branch the earlier
draft carried at the call site is unreachable through this path and
omitted — if the invariant ever breaks via hand-edited cache.yaml,
os.path.join raises TypeError at the bug site with a clean trace.

The on-disk existence check at the call site stays — it guards a real,
user-reproducible scenario (cache dir partially deleted between runs)
and produces a friendlier message than an LLVM open failure.
…ry-file

Two tangential bugs in the standalone bin/TensileClientConfig entry
point, both surfaced while testing the cache.yaml fix:

1. Imports of globalParameters, assignGlobalParameters, restoreDefault-
   GlobalParameters, and __version__ from Tensile.Common were stale —
   those names now live in Tensile.Common.GlobalParameters (and
   __version__ in Tensile/__init__.py). The module wouldn't even
   import. Updated to the canonical paths used by every other module
   in the package.

2. The call to writeClientConfigIni passed a None positional for
   libraryFile, which emitted a literal `library-file=` line (empty
   value) into the generated .ini once the inner function's None-
   fallback was removed — an unusable .ini. Synthesize a libraryFile
   path anchored to the directory of args.OutputConfig (required arg,
   so always available). Mirrors the prior implicit default of
   "library/TensileLibrary.yaml" but pointed at a real directory.
Removes the libraryFile=None default from writeClientConfigIni (inner)
and the libraryFile=None default from writeClientConfig (wrapper, one
frame up), making the parameter required at both. The wrapper variant
is made keyword-only since all current callers already pass it as a
kwarg and the parameter sits behind configBase in the positional list.

Why both: the inner-function default was the silent-wrong-path failure
mode that produced the gfx1201 SIGSEGV — when libraryFile was omitted,
it fell back to library/TensileLibrary.yaml (flat layout), which is
wrong whenever the build used a per-arch subdir. Removing the default
at the wrapper too closes the parallel hole one frame up: a future
caller that omits the kwarg there would propagate None positionally
into writeClientConfigIni and produce library-file=None in the .ini —
the same bug class.

Also updates CreateBenchmarkClientParametersForSizes to pass an
explicit libraryFile= built from the libraryPath it already computed.

Tests added:
  - libraryFile is a required parameter on both writeClientConfigIni
    and writeClientConfig (signature-level invariant)
  - every writeClientConfigIni call site in the codebase mentions
    libraryFile= in its arglist (static check via inspect.getsource +
    a hand-rolled paren matcher; the matcher's limitations are
    documented in its docstring)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant