Replace core parsing/resolution engine with Rubydex#447
Open
Replace core parsing/resolution engine with Rubydex#447
Conversation
Replace the entire AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. The key architectural change is that Packwerk no longer parses files individually and walks AST nodes to find constant references. Instead, Rubydex indexes all workspace files in a single batch call, resolves all constants, and provides resolved references with direct links to their target declarations. Core changes: - RunContext: rewritten to create a Rubydex::Graph, index the full workspace for resolution, and iterate resolved constant references to detect cross-package violations - ParseRun: simplified to two phases (index_and_resolve + find_offenses) instead of per-file parallel processing - Association detection: rewritten using Prism native AST (no longer needs the parser gem translation layer), runs as a supplementary pass since Rubydex doesn't understand ActiveRecord associations - ERB support: simplified to extract_ruby_source which feeds into Rubydex's index_source API - ApplicationValidator: uses Rubydex::Graph instead of ConstantResolver Removed dependencies: constant_resolver, parallel, ast, parser (direct) Added dependency: rubydex Kept: prism (for association detection), better_html (for ERB) Deleted 16 source files, 11 test files, 3 RBI files (~3,400 lines removed). Net change: +403 / -3,391 lines.
ad87dca to
7fa6be2
Compare
Update tapioca require file to remove constant_resolver, parallel, spring, and minitest/autorun; add prism and rubydex requires. Regenerate all gem RBIs, cleaning up stale files from older gem versions. Add ostruct gem for Ruby 4.0 compatibility with yard/tapioca.
7fa6be2 to
46eea11
Compare
Contributor
|
Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore? |
Member
Author
That's correct. Rubydex can do proper Ruby constant resolution, so we don't need to use Zeitwerk heuristics to figure out what constant references resolve to based on their filename. At least, we shouldn't, and, if there are any problems with the resolution, then we should fix them in Rubydex. |
ERB files fed to graph.index_source need a file:// URI, not a bare path. Also add location_to_relative_path helper that catches NotFileUriError for any edge cases where a location doesn't have a file:// URI.
The post-graph work (especially the association detection pass that re-parses all files with Prism) is a significant portion of total runtime. On the Shopify monolith, the association pass takes ~39s single-threaded. Parallelize the Prism parsing phase of association detection using the parallel gem. The resolution and violation checking phases remain sequential since they use shared state (graph + package_set). The parallel flag flows from Configuration -> ParseRun -> RunContext as before.
Split collect_constant_reference_offenses into two phases: 1. Extract: iterate Rubydex's resolved references and pull all needed data into plain Ruby hashes grouped by source file. This must be sequential since it crosses the Rust FFI boundary. 2. Check: process each file's references for dependency violations in parallel using forked workers. Only plain Ruby objects (strings, integers, hashes) cross the fork boundary -- no Rust FFI objects. On the Shopify monolith, the post-graph reference iteration + violation checking was the biggest bottleneck at ~134s single-threaded. The extraction phase remains sequential but the violation checking across ~57k cross-package references is now parallelized.
- Update gemspec to require rubydex >= 0.1.0.beta12 - Point Gemfile to beta12 from RubyGems - Restore accidentally commented-out offense formatter calls in CheckCommand#run - Move parallel flag to RunContext constructor instead of passing through find_offenses kwargs (avoids Mocha/Sorbet interaction issues)
Shared namespaces like GraphApi and Checkouts are defined in dozens of packages across a monolith. Rubydex resolves a constant reference to a Declaration, but the first definition might be in a different package even though the namespace is also defined locally. Before: used declaration.definitions.first to get the target path, causing 219k false violations on Shopify Core. After: check ALL definitions of the target constant. If any definition lives in the same package as the source file, the reference is local and is skipped. Only references where no definition exists in the source package are reported as cross-package violations.
Iterating all definitions of a constant for every one of 9M references
was O(refs * defs_per_constant), which caused a 15+ minute hang on
the Shopify monolith where shared namespaces have hundreds of defs.
Pre-compute a hash of constant_name → {packages, target_path} by
iterating graph.declarations once upfront. The per-reference check
is now a single Set#include? call -- O(1).
The old pipeline's ConstantDiscovery always produced fully qualified constant names with :: prefix (e.g. ::Order, ::Foo::Bar). The package_todo.yml files use this format for listed violations. Rubydex's declaration.name returns the qualified name without the :: prefix, causing all ~11k listed violations to appear as new offenses (50k total on Core) since PackageTodo#listed? couldn't match them. Fix: prepend :: to const_name in both the constant reference extraction and association detection paths.
location.to_file_path does URI parsing and validation on every call, which costs 57s for 2.7M references on Core. location.uri returns the raw string with no parsing, taking only 11s for the same work. Also precompute a file→package hash map to avoid repeated path prefix searches in PackageSet#package_from_path during the hot loops.
When a constant has multiple definitions (canonical file + class reopenings), prefer the definition whose file path matches Zeitwerk naming conventions (e.g. ApiClient → api_client.rb). Before: used the first definition found by iterating graph.declarations, which could be a class reopening like api_client/extension.rb. This caused false violations when the canonical definition was in a different package than the reopening. After: convert constant name to Zeitwerk path suffix using ActiveSupport::Inflector.underscore, and prefer definitions matching that suffix. Falls back to the first definition if no Zeitwerk match.
…iles Instead of reparsing all 57k Ruby files with Prism to find association calls, use Rubydex's method_references to identify which files contain has_many/belongs_to/etc calls. On Shopify Core, only 1,211 files (1.3%) have associations. Parsing just those takes 0.4s vs 29s for all files.
Rubydex represents singleton classes as Foo::<Foo> and singleton methods as Foo::<Foo>#bar(). When ERB code calls LocalizedDocsHelper.docs_url, the reference resolves to the metaclass declaration LocalizedDocsHelper::<LocalizedDocsHelper>, producing a constant name that doesn't match the package_todo.yml entry ::LocalizedDocsHelper. Strip the ::<ClassName> metaclass suffix from declaration names so they match the plain Ruby constant names used in package_todo.yml. Also merge cache entries for a constant and its metaclass since they represent the same Ruby constant.
Benchmarks on Shopify Core show parallel is 2-3x SLOWER than serial: check violations: 5.1s serial vs 10.2s parallel association pass: 2.5s serial vs 8.0s parallel The heavy lifting is now in Rust (Rubydex indexing + resolution). The remaining Ruby work (violation checking, association parsing of ~1,200 files) is too lightweight for fork-based parallelism to help. The fork/IPC overhead dominates.
Three optimizations targeting the top bottlenecks from profiling: 1. Eliminate redundant Dir.glob (8.3s → 0s): reuse the file list from FilesForProcessing instead of re-globbing in index_and_resolve. For scoped checks, compute the full workspace set in UsesParseRun. 2. Inline URI→path conversion in hot loops: replace method calls to location_to_relative_path with direct byteslice on location.uri, avoiding method dispatch overhead on 2.7M+ iterations. 3. Cache per-file package lookups: avoid redundant package_for calls when multiple refs come from the same file. Hoist Zeitwerk suffix computation outside the inner definition loop.
eda8221 to
3342d3d
Compare
The old ConstantResolver needed Rails/Zeitwerk autoload paths to map constants to file paths. Rubydex resolves constants from actual parsed definitions without needing any autoload path information. Changes: - Remove RailsLoadPaths module and its usage in Configuration - Configuration#load_paths now returns an empty hash (kept for API compat) - Remove RAILS_ENV=test setting in executable - Remove load_paths parameter from RunContext interface - Remove zeitwerk from gemspec dependencies Saves ~13s of Rails boot time on CI on Shopify Core. Also enables packwerk to work on non-Rails Ruby projects.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
parsergem as a direct dependencyArchitecture change
Before: For each file → parse with Prism → walk AST nodes → extract constant references via
ConstNodeInspector→ resolve viaconstant_resolvergem → check package violationsAfter: Index all workspace files in one
Rubydex::Graphbatch call → resolve all constants → iterateResolvedConstantReferenceobjects with direct links to target declarations → check package violations. Association detection (has_many,belongs_to, etc.) runs as a supplementary Prism-based pass since Rubydex doesn't understand ActiveRecord semantics.Dependencies
constant_resolverrubydexparallelast(direct)parser(direct)Kept:
prism(for association detection),better_html(for ERB),activesupport,sorbet-runtime,zeitwerk,bundler.Deleted files (16 source + 11 test + 3 RBI)
The entire per-file parsing pipeline:
file_processor,node_processor,node_processor_factory,node_visitor,node_helpers,const_node_inspector,constant_name_inspector,constant_discovery,parsed_constant_definitions,reference_extractor,unresolved_reference,association_inspector,cache,parsers/ruby,parsers/factory,parsers/parser_interface.Key design decisions
index_and_resolveindexes ALL Ruby files in the workspace (not just the scoped check set) so Rubydex can resolve cross-package references. Only the scoped files are checked for violations.has_many :ordersas a method call, not a constant reference. Usesgraph.resolve_constantto resolve the inferred constant name.Parsers::Erbsimplified toextract_ruby_sourcewhich feeds intograph.index_source.parallelflag is accepted but ignored.Test results
All 136 tests pass (excluding
spring_command_testandautoload_testwhich have pre-existing Ruby 4.0 incompatibilities unrelated to this change).