CPG_OPTIMIZATION_STRATEGIES.MD
As ElixirScope projects grow, managing the performance and memory footprint of Code Property Graphs (CPGs) becomes critical. This document outlines strategies for optimizing CPG generation, storage, querying, and incremental updates. These optimizations will be implemented across modules like CPGBuilder, EnhancedRepository, MemoryManager, and potentially a new CPGOptimizer.
The CPG data will be stored across multiple ETS tables managed by EnhancedRepository:
@cpg_nodes_table:{{module_name, function_key, cpg_node_id}, serialized_CPGNode.t()}@cpg_edges_table:{{from_node_id, to_node_id, edge_type}, serialized_CPGEdge.t()}@cpg_analysis_cache:{{cpg_unit_key, algorithm_name, version}, serialized_results}
Indexing Strategy:
- Adjacency list indexes for efficient graph traversal
- Type-based indexes for filtering by node/edge types
- Community membership indexes for architectural analysis
Query Optimization:
- Leverage existing
QueryIndexespattern for CPG-specific indexes - Use ETS
selectwith match specifications for complex graph queries - Cache frequently accessed subgraphs in memory
- PRD Link: FR5.1
- Goal: Avoid full CPG rebuilds for minor code changes; update only affected CPG portions.
- Responsibility:
CPGBuilder,EnhancedRepository,Synchronizer. - Strategy:
- Change Detection (
FileWatcher,Synchronizer):- When a file changes, the
Synchronizerreceives the event. - It determines the scope of the change (e.g., specific functions modified, module-level changes).
- When a file changes, the
- Granular Re-analysis (
CPGBuilder,ASTAnalyzer, Graph Generators):- Instead of re-analyzing the whole module, only re-analyze the AST of modified functions.
- Regenerate CFG, DFG for affected functions.
- Targeted CPG Updates (
CPGBuilder):- Identify CPG nodes and edges corresponding to the changed code parts using AST Node IDs.
- Node Updates: Update properties of existing CPG nodes.
- Edge Updates: Add/remove/update CPG edges connected to modified nodes.
- Structural Changes: If function signatures change or functions are added/deleted, more significant CPG surgery is needed (add/remove subgraphs).
- Dependency Propagation (
CPGBuilder,CPGSemantics):- After local CPG updates, identify if these changes affect inter-functional CPG edges (e.g., call graph edges, inter-procedural data flow edges).
- Update these connecting edges.
- Algorithmic Result Invalidation/Recomputation (
CPGMath,CPGSemantics,MemoryManager):- Invalidate cached algorithmic results (centrality, communities, paths) that are affected by the CPG change.
- Trigger partial/focused recomputation of these algorithms. For instance, if a node's connectivity changes, its centrality and the centrality of its neighbors might need re-evaluation. Community structures might shift locally.
- Change Detection (
- Challenges:
- Precisely mapping source code changes to CPG diffs.
- Efficiently updating inter-procedural dependencies.
- Minimizing recomputation of global graph algorithms.
- Detailed Strategy:
- Change Identification (
Synchronizer,FileWatcher):FileWatcherdetects a file change.Synchronizerreceives the changedfile_path. It invokesProjectPopulator.parse_and_analyze_file(file_path)to get the newEnhancedModuleData(which includes new AST, function list, etc.).- The
Synchronizerthen needs to signal to theCPGBuilderorEnhancedRepositorythat a specific module's CPG needs an update, providing the old and newEnhancedModuleDataor just the new one and the module name.
- Delta Calculation (Conceptual - within
CPGBuilderor a dedicated diffing utility):- Compare the old
EnhancedModuleData.ast(or its function list) with the new one to identify added, deleted, or modified functions. - For modified functions, a more granular AST diff might be performed to pinpoint exact changes.
- Compare the old
- Targeted CPG Regeneration/Update (
CPGBuilder):- Deleted Functions: Remove corresponding CPG nodes and edges related to these functions. This includes their internal CFG/DFG representations within the CPG and any inter-procedural CPG edges (calls, data flows) connected to them.
- Added Functions: Generate new CPG subgraphs for these functions and integrate them into the main CPG, adding necessary inter-procedural edges.
- Modified Functions:
- Option A (Simpler): Treat as delete + add. Remove old CPG subgraph, generate new one.
- Option B (More Complex, Efficient): Attempt to patch the existing CPG subgraph.
- Regenerate CFG/DFG for the modified function's AST.
- Diff the new CFG/DFG against the old one embedded in the CPG.
- Update CPG nodes/edges based on this diff. This is highly complex.
- Inter-Procedural Edge Updates (
CPGBuilder):- After local CPG changes, re-evaluate call graph edges and inter-procedural data flow edges connected to the modified/added/deleted functions.
- If a function's signature changed, update all call sites.
- Algorithmic Result Invalidation (
CPGBuildernotifyingMemoryManageror directly updatingCPGData):- Crucially, any change to the CPG structure necessitates invalidating cached results from
CPGMathandCPGSemantics(centrality, communities, paths, etc.) that depend on the modified parts of the graph. - A simple strategy is to invalidate all algorithmic results for the entire CPG containing the change.
- A more advanced strategy would be to identify the "blast radius" of the CPG change and only invalidate/recompute algorithmic results for the affected subgraph and its dependencies.
- Crucially, any change to the CPG structure necessitates invalidating cached results from
- Change Identification (
- Implementation in
EnhancedRepository/CPGBuilder:EnhancedRepositorymight expose anupdate_cpg_for_module(module_name, new_module_ast_or_data)function.CPGBuilderwould contain the logic to perform the delta analysis and targeted updates on aCPGData.t()struct.- The
CPGData.t()struct should perhaps include a version or checksum to help with cache invalidation.
- PRD Link: FR5.2
- Goal: Analyze CPG query specifications and choose an optimal execution strategy.
- Responsibility: A new
CPGOptimizermodule, used byQueryBuilderorQueryExecutor. - Strategy:
- Query Analysis:
- Input:
QueryBuilder.query_t()struct. - Identify target entities (nodes, edges, types of nodes/edges).
- Identify filter conditions and their selectivity.
- Identify requested graph traversals or algorithmic computations (e.g., "find path", "get centrality > X").
1a. Query Parsing & Analysis (
QueryBuilder): QueryBuilderparses the query spec.
- Input:
- Plan Generation (
CPGOptimizer):- Identify CPG entities involved (nodes, edges, specific types).
- Determine if the query can leverage existing indexes (e.g., index on node types, AST Node IDs, pre-computed centrality scores).
- For queries involving graph traversals (e.g., pathfinding, impact analysis):
- Estimate the scope of traversal.
- Choose appropriate graph traversal algorithms (BFS, DFS, Dijkstra's, A*).
- Consider if parts of the query can be satisfied by cached algorithmic results.
- For queries involving multiple conditions: Reorder filter application for early pruning of the search space. 2a. Plan Generation:
- Index First: Prioritize using available ETS indexes in
EnhancedRepositoryorQueryIndexeswithinCPGData(e.g., index on CPG node types, AST Node IDs, pre-computed high-centrality nodes). - Filter Ordering: Apply most selective filters first to reduce the working set of nodes/edges.
- Traversal Strategy: For pathfinding or neighborhood queries, choose appropriate graph traversal algorithms (BFS for shortest path, DFS for reachability/all paths) considering graph characteristics.
- Algorithm Offloading: If a query asks for a metric like "nodes with PageRank > 0.1", check if PageRank is pre-computed and cached. If not, decide whether to compute it for the whole graph or use an approximation if the query scope is limited.
- Join Optimization (if querying across CPGs or with runtime data): Standard database join optimization techniques (e.g., hash join, sort-merge join based on estimated cardinalities).
- Cost Estimation: Refine the cost estimation provided by
QueryBuilderby considering the CPG's specific structure (e.g., number of nodes of a certain type, average degree) and the chosen execution plan.
- Query Analysis:
- Example:
- Query: "Find all functions (:cpg_nodes of type :function_def) with betweenness_centrality > 0.5 and that call
Ecto.Repo.all/2." - Strategy 1 (No CPG Algo Index): Iterate all function CPG nodes, check for
Ecto.Repo.all/2call (fast index), then compute/lookup betweenness for matching nodes. - Strategy 2 (With CPG Algo Index): Lookup nodes with betweenness_centrality > 0.5 (fast index if available), then filter those for type :function_def and check for
Ecto.Repo.all/2call. CPGOptimizerwould choose the strategy with the lower estimated cost.- Query: "Find functions (CPG node type
:function_def) in moduleMyModthat callExternal.API.call/0and have betweenness centrality > 0.5."
- Query: "Find all functions (:cpg_nodes of type :function_def) with betweenness_centrality > 0.5 and that call
- Example 2:
CPGOptimizerPlan:- Fetch CPG for
MyMod. - Filter CPG nodes for
ast_type == :function_def(uses index on CPG node properties if available). - For remaining nodes, check outgoing
:call_graphCPG edges forExternal.API.call/0(uses CPG edge index if available). - For nodes passing step 3, retrieve/compute betweenness centrality (check
CPGData.metadata.cached_centrality_betweennessor callCPGMath). - Filter by centrality > 0.5.
- Fetch CPG for
-
PRD Link: FR5.3
-
Goal: Reduce the in-memory and on-disk footprint of CPGs.
-
Strategies:
- String Interning (
EnhancedRepository, CPG Data Structures):- Store common strings found in CPG node/edge properties (e.g., variable names, function names, literal strings from AST) in a shared string pool (e.g., an ETS table mapping strings to integer IDs).
- Nodes/edges store integer IDs instead of full strings.
- Selective Property Storage (
CPGData):- Not all properties might be needed for all nodes/edges. Use sparse maps or different node/edge structs for different types to avoid storing many
nilfields.
- Not all properties might be needed for all nodes/edges. Use sparse maps or different node/edge structs for different types to avoid storing many
- Data Compression (
MemoryManager,EnhancedRepository):- For CPGs of modules/functions not recently accessed, compress their serialized
CPGData(or parts of it, like detailed AST snippets within nodes) when persisted or held in a lower-priority memory cache. :erlang.term_to_binary(data, [:compressed])
- For CPGs of modules/functions not recently accessed, compress their serialized
- Lazy Loading of CPG Components (Conceptual -
EnhancedRepository):- When loading a
CPGDatafor a module, initially load only a summary or essential graph structure. - Load detailed node properties (e.g., full AST snippets, detailed DFG information within a CPG node) on demand when a query specifically requires them.
- When loading a
- (Future) Off-Heap/Disk Storage for Large CPGs:
- For extremely large projects, investigate storing parts of the CPG (e.g., less frequently accessed nodes/edges, or large property values) off the BEAM heap or on disk, managed by a system like RocksDB or a custom ETS-backed paging mechanism.
- String Interning (
-
Strategies & Responsibility:
- String Interning (
CPGBuilder,EnhancedRepository):- When
CPGBuildercreatesCPGNode.t()andCPGEdge.t(), common strings (function names, variable names, literal values from AST, node/edge types/subtypes) should be interned. EnhancedRepositorycan maintain a global (per-repository instance) or per-CPG intern pool (e.g., an ETS table mapping strings to integer IDs, or using Elixir atoms if the set is bounded and known).- CPG nodes/edges store these integer IDs.
- Functions retrieving CPG data for display/analysis would de-intern these IDs.
- When
- Selective Property Storage (
CPGNode.t(),CPGEdge.t()design):- The
CPGNode.t()andCPGEdge.t()structs might have many optional fields (control_flow_info,data_flow_info,unified_properties). Use maps for these fields so only present data consumes memory, rather than fixed struct fields that might often benil. - Alternatively, use different specialized structs for different conceptual CPG node/edge types if properties vary significantly (though this increases type complexity).
- The
- Data Compression (
MemoryManager,EnhancedRepository):MemoryManager, during itscompress_old_analysiscycle or when handling memory pressure, can identify CPGs (or parts of CPGs like large AST snippets within nodes) for modules that are infrequently accessed.- It can then request
EnhancedRepositoryto serialize theseCPGData.t()objects (or their large sub-components) using:erlang.term_to_binary(data, [:compressed]). EnhancedRepositorywould store the compressed binary and mark the in-memory version as eligible for GC or replace it with a "lazy-load" stub.- When accessed again,
EnhancedRepositorydecompresses the data.
- Lazy Loading of CPG Components (
EnhancedRepository,CPGBuilder):- When
EnhancedRepository.get_enhanced_module/1loads a module, itsCPGData.t()might initially be a "summary" CPG. - Detailed information within CPG nodes (e.g., full DFG structure for a function node, detailed AST snippets) or expensive-to-load parts of the CPG (e.g., full inter-procedural data flow edges) are loaded on demand by
CPGBuilderor specialized functions when a query explicitly needs them. - This requires
CPGData.t()to support partial loading andCPGBuilderto have functions likeCPGBuilder.load_detailed_node_info(cpg_summary, node_id).
- When
- String Interning (
-
EnhancedRepository.get_performance_metrics/0can expose aggregated CPG operation timings.
- PRD Link: FR5.4
- Goal: Cache results of computationally expensive graph algorithms to speed up subsequent queries.
- Strategy:
- Cache Location:
- Within
CPGData.t(): Add fields likemetadata: %{cached_centrality_pagerank: %{...}, cached_communities_louvain: %{...}}. This ties cached results directly to a specific CPG version. MemoryManagerCaches: Use dedicated ETS tables managed byMemoryManager(e.g.,@cpg_analysis_cache_table) to store results keyed by{cpg_checksum, algorithm_name, params}.
- Within
- Cache Key Generation:
- Include CPG identifier (e.g., module name + file hash to represent CPG version).
- Include algorithm name and its specific parameters (e.g.,
{:pagerank, alpha: 0.85}).
- Invalidation:
- CPG Structural Change: If the CPG structure for a module/function is updated (due to code changes), all cached algorithmic results for that CPG must be invalidated. The
SynchronizerorCPGBuilderwould signal this. - TTL: Standard Time-To-Live policies managed by
MemoryManager. - LRU/LFU Eviction: When cache limits are reached, evict less relevant results.
- CPG Structural Change: If the CPG structure for a module/function is updated (due to code changes), all cached algorithmic results for that CPG must be invalidated. The
- Granularity of Caching:
- Cache entire result sets (e.g., all centrality scores for a CPG).
- Cache results for specific queries (e.g., "top 10 nodes by betweenness centrality").
- Cache Location:
- Example Workflow (Centrality):
- Query requests "nodes with PageRank > 0.01".
QueryExecutorchecksMemoryManagercache for{:pagerank, cpg_id, %{threshold: 0.01}}. Miss.QueryExecutorchecks ifcpg.metadata.cached_centrality_pagerankexists. Miss.CPGSemantics(viaCPGMath) computes all PageRank scores forcpg.- Result stored in
cpg.metadata.cached_centrality_pagerankand potentially inMemoryManager's cache. - Query is satisfied from the freshly computed (and now cached) scores.
- Strategy & Responsibility:
- Cache Storage (
CPGData.metadata,MemoryManager):- Lightweight results/Per-CPG results: Store directly in
CPGData.t().metadata(e.g.,%{cached_centrality_pagerank: %{"nodeA" => 0.1, ...}}). WhenCPGDatais serialized/compressed, these caches go with it. - Heavier results/Cross-CPG results/Query-specific results: Use
MemoryManager's ETS-based caches (e.g.,@cpg_cache_table,@analysis_cache_table). The key could be{cpg_version_checksum, :algorithm_name, algorithm_params_hash}.
- Lightweight results/Per-CPG results: Store directly in
- Cache Population (
CPGMath,CPGSemantics,QueryExecutor):- After a
CPGMathorCPGSemanticsfunction computes an expensive result (e.g.,community_louvain), it (or the callingQueryExecutor) should offer it to the appropriate cache.
- After a
- Cache Lookup (
CPGMath,CPGSemantics,QueryExecutor):- Before computation, these functions first check the relevant cache.
- Invalidation (
CPGBuilder,Synchronizer,MemoryManager):- Structural CPG Changes: When
CPGBuilder(triggered bySynchronizer) performs an incremental update on aCPGData.t(), it must invalidate relevant cached algorithmic results. This could be by:- Deleting specific keys from
CPGData.t().metadata. - Updating a version/checksum on
CPGData.t(), which automatically invalidatesMemoryManagercache entries keyed with the old version/checksum. - Notifying
MemoryManagerto evict entries related to the modified CPG.
- Deleting specific keys from
- TTL/LRU:
MemoryManagerhandles standard TTL and LRU eviction for its caches.
- Structural CPG Changes: When
- Cache Storage (
- Responsibility:
EnhancedRepository,CPGBuilder,CPGMath,CPGSemantics. - Strategy:
- Use
ElixirScope.Utils.measure/1to wrap key operations:CPGBuilder.build_cpg/2(overall time).CPGBuilderincremental update steps.- Individual algorithm executions in
CPGMathandCPGSemantics(e.g.,strongly_connected_components,dependency_impact_analysis). - CPG query execution phases in
QueryExecutor.
- Report these metrics to a central collector, possibly
EnhancedRepository's stats orMemoryManager.
- Use
- The
EnhancedRepositoryandCPGBuildershould useElixirScope.Utils.measure/1to track durations of key CPG operations:- CPG generation time (full and incremental).
- Specific graph algorithm execution times (e.g., centrality calculation).
- CPG query execution time.
- These metrics can be reported to
MemoryManageror a dedicated performance tracking system to identify bottlenecks in the CPG layer itself.
defmodule CPGMonitoring do
use GenServer
# Monitor ETS table sizes and performance
def handle_info(:monitor_ets_health, state) do
tables = [@cpg_nodes_table, @cpg_edges_table, @cpg_analysis_cache]
health_metrics = Enum.map(tables, fn table ->
info = :ets.info(table)
%{
table: table,
size: info[:size],
memory: info[:memory] * :erlang.system_info(:wordsize),
type: info[:type]
}
end)
# Report to telemetry
:telemetry.execute([:elixir_scope, :cpg, :ets_health], health_metrics)
# Check for alert conditions
check_alert_conditions(health_metrics)
{:noreply, state}
end
defp check_alert_conditions(metrics) do
Enum.each(metrics, fn %{table: table, memory: memory} ->
if memory > @memory_alert_threshold do
Logger.warn("CPG ETS table #{table} memory usage: #{memory} bytes")
# Trigger cleanup or alerting
end
end)
end
enddef monitor_cpg_query_performance(query_spec, fun) do
start_time = System.monotonic_time()
try do
result = fun.()
duration = System.monotonic_time() - start_time
duration_ms = System.convert_time_unit(duration, :native, :millisecond)
:telemetry.execute(
[:elixir_scope, :cpg, :query, :duration],
%{duration_ms: duration_ms},
%{query_type: classify_query_type(query_spec)}
)
if duration_ms > @slow_query_threshold do
Logger.warn("Slow CPG query detected: #{duration_ms}ms - #{inspect(query_spec)}")
end
result
rescue
error ->
:telemetry.execute(
[:elixir_scope, :cpg, :query, :error],
%{},
%{error: inspect(error), query_type: classify_query_type(query_spec)}
)
reraise error, __STACKTRACE__
end
end