Skip to content

Latest commit

 

History

History
245 lines (191 loc) · 9.6 KB

File metadata and controls

245 lines (191 loc) · 9.6 KB

Automata Regex Implementation Review

Executive Summary

This review examined the DFA and NFA automata-based regex implementations for StreamRegex. The goal was to assess accuracy, create comprehensive tests, and benchmark performance against standard library implementations.

Critical Issues Found & Fixed

1. Stack Overflow in DFA Initialization (CRITICAL - FIXED ✓)

Severity: Critical - Prevented all DFA tests from running

Root Cause:

  • BaseDfaState initialized Success and Failure properties with new NopDfaState()
  • NopDfaState constructor set Success = this and Failure = this
  • Property setters triggered recursive construction, causing stack overflow

Impact:

  • All DFA tests failed immediately with stack overflow
  • Made DFA implementation completely unusable

Fix Applied:

  • Converted NopDfaState to singleton pattern (matching NFA implementation)
  • Changed initialization to use NopDfaState.Instance
  • Updated StateMachineFactory to use singleton

Result:

  • Test pass rate: 0% → 89% (33/37 tests passing)
  • DFA implementation now functional

Test Coverage

Tests Created

  • Total: 37 comprehensive tests
  • DFA Tests: 17 tests
  • NFA Tests: 20 tests

Test Categories

  • Simple literal matching
  • Character classes [abc]
  • Quantifiers: * (zero or more), + (one or more), ? (optional - NFA only)
  • Any character .
  • Escaped characters \\.
  • Complex patterns
  • Edge cases (match at beginning/end, no match)
  • Large streams (10,000+ characters)

Test Results

  • Passing: 33/37 tests (89%)
  • Failing: 4/37 tests (11%)

The 4 failing tests indicate potential accuracy issues that need investigation. The failures appear to be edge cases in pattern matching logic.

Benchmarks

Created AutomataBenchmarks.cs comparing:

  1. DFA Automata - Direct byte stream processing
  2. NFA Automata - Direct byte stream processing
  3. Standard Regex (baseline) - Via StreamReader extension

Benchmark Scenarios

  • Small: 1,000 byte streams
  • Medium: 100,000 byte streams
  • Large: 1,000,000 byte streams

Expected Performance Characteristics

Automata Advantages:

  • Direct stream byte processing
  • No char buffer allocation
  • No sliding window overhead
  • Predictable memory usage

Standard Regex Advantages:

  • Full regex feature support
  • Battle-tested implementation
  • Optimized compilation
  • Comprehensive edge case handling

Implementation Analysis

DFA Implementation

Files Reviewed:

  • DfaStreamRegex.cs - Main matching engine
  • StateMachineFactory.cs - Pattern parser and state machine builder
  • BaseDfaState.cs - Base state class
  • ExactCharacterDfaState.cs - Literal character matching
  • RepeatCharacterZeroPlusDfaState.cs - * and + quantifiers
  • ComponentGroupDfaState.cs - Character class [abc] matching
  • AnyCharacterDfaState.cs - . wildcard
  • NopDfaState.cs - No-operation state
  • FinalDfaState.cs - Match success state

Known Issues:

  1. FIXED: Stack overflow in initialization
  2. TODO: Comment in RepeatCharacterZeroPlusDfaState.cs:15-16 mentions backtracking limitation:
    // TODO: This needs backtracking for cases like '[ce]*c' matched against 'racecar'.
    // This will greedily match up to racec, and then be able to match the next c.
    
  3. 4 test failures suggest additional edge cases

Supported Features:

  • Literal characters
  • Character classes [abc]
  • Quantifiers: *, +, ? (optional)
  • Any character .
  • Escaped characters

Unsupported Features:

  • Anchors (^, $)
  • Alternation (|) and grouping/capture semantics
  • Quantifier breadth: counted ranges {m,n}
  • Character class ranges/negation shorthand (\d, \w, \s, [^...], Unicode categories)
  • Regex options (case-insensitive, multiline/singleline, culture invariance)
  • Lookahead/lookbehind, backreferences, and backtracking support
  • Word boundaries and other zero-width assertions
  • Achieving basic regex compatibility will require parser support plus engine changes for the above; unsupported patterns should be rejected until implemented.

NFA Implementation

Files Reviewed:

  • NfaStreamRegex.cs - Main matching engine
  • NfaStateMachineFactory.cs - Pattern parser
  • BaseNfaState.cs - Base state class
  • Various state implementations (similar to DFA)

Key Differences from DFA:

  • Uses singleton pattern for NopNfaState (correct implementation)
  • Tracks multiple active states simultaneously
  • More memory overhead but handles non-deterministic transitions

Advantages over DFA:

  • Already had correct singleton pattern
  • Better suited for complex patterns

Accuracy Assessment

Overall Accuracy: ~89% (33/37 tests passing)

Strengths:

  • Core literal matching works correctly
  • Character classes function properly
  • Basic quantifiers (*, +) work in most cases
  • Stream position tracking appears accurate

Weaknesses:

  • 11% test failure rate indicates edge cases
  • Known backtracking limitation in DFA
  • Limited regex feature support
  • No comprehensive edge case testing yet

Recommendation:

  • NOT READY for production use until 4 failing tests are investigated and fixed
  • Suitable for simple patterns only (literals, character classes)
  • Consider standard regex for any complex patterns

Performance Considerations

Theoretical Advantages

  1. No String Allocation: Processes bytes directly vs converting entire stream to string
  2. No Sliding Window: Unlike extension methods, direct processing
  3. Predictable Performance: State machine has known complexity

Theoretical Disadvantages

  1. State Machine Overhead: Construction cost for each pattern
  2. Limited Optimizations: Standard regex engines are highly optimized
  3. Feature Gaps: Can't handle complex patterns

Near-term Optimization Ideas

  • Reuse compiled state machines instead of rebuilding per call.
  • Reduce per-byte virtual dispatch and consider simple skipping/vectorization in transitions.
  • Add lightweight validation to reject unsupported patterns early.

Benchmark Needed

  • Latest run (Feb 2026, host .NET 10, pattern racecar, baseline is sliding-buffer Regex.IsMatch on StreamReader):
    • 1,000 bytes: StandardRegexWithStreamReader ≈ 13µs, DFA ≈ 67µs (~5x slower), NFA ≈ 43µs (~3x slower). Standard allocates ~13KB; DFA/NFA allocate 0.
    • 100,000 bytes: Standard ≈ 55µs, DFA ≈ 319µs (~6x slower), NFA ≈ 1.31ms (~25x slower). Standard allocates ~14KB; DFA/NFA allocate 0.
    • 1,000,000 bytes: Standard ≈ 154µs, DFA ≈ 1.79ms (~12x slower), NFA ≈ 11.07ms (~72x slower). Standard allocates ~21KB; DFA/NFA allocate 0.
  • Interpretation: the sliding regex implementation (baseline) is still materially faster for these scenarios even though automata avoid allocations; automata overhead dominates for simple patterns.
  • Rerun with: cd StreamRegex.Benchmarks && NBGV_GitEngine=Disabled dotnet run -c Release -- --filter "*AutomataBenchmarks*"

Recommendations

Immediate Actions (Before Production)

  1. CRITICAL: Investigate and fix 4 failing tests
  2. HIGH: Address backtracking TODO in RepeatCharacterZeroPlusDfaState
  3. HIGH: Add more edge case tests
  4. MEDIUM: Run performance benchmarks to quantify gains
  5. MEDIUM: Document supported vs unsupported regex features

Long-term Considerations

  1. Pattern Validation: Add validation to reject unsupported patterns
  2. Feature Expansion: Consider adding more regex features if beneficial
  3. Optimization: Profile and optimize hot paths
  4. Documentation: Clear usage guidelines and limitations
  5. Integration: Consider when to use automata vs standard regex

Usage Guidance

Use Automata For:

  • Simple literal patterns: "hello world"
  • Character classes: "[abc]+"
  • Basic quantifiers on simple patterns: "a*b+"
  • Performance-critical simple pattern matching on large streams

Use Standard Regex For:

  • Complex patterns requiring full regex features
  • Production-critical code (until automata proven reliable)
  • Patterns with anchors, lookahead, alternation, etc.
  • When unsure about pattern complexity

Files Modified/Created

Created

  • StreamRegex.Automata.Tests/StreamRegex.Automata.Tests.csproj - Test project
  • StreamRegex.Automata.Tests/StateMachineTests.cs - 37 comprehensive tests
  • StreamRegex.Benchmarks/AutomataBenchmarks.cs - Performance benchmarks
  • AUTOMATA_REVIEW.md - This review document

Modified

  • StreamRegex.Automata/DFA/BaseDfaState.cs - Fixed initialization
  • StreamRegex.Automata/DFA/NopDfaState.cs - Added singleton pattern
  • StreamRegex.Automata/DFA/StateMachineFactory.cs - Use singleton
  • StreamRegex.sln - Added test project

Conclusion

The automata regex implementations show promise for simple pattern matching on streams with potential performance benefits from direct byte processing. However, critical bugs were found and fixed, and ~11% of tests still fail, indicating the implementation is not yet production-ready.

Key Achievements:

  • ✓ Fixed critical stack overflow bug
  • ✓ Created comprehensive test suite (37 tests)
  • ✓ Added performance benchmarks
  • ✓ Identified accuracy limitations

Remaining Work:

  • Investigate and fix 4 failing tests
  • Address known backtracking limitation
  • Run performance benchmarks
  • Expand test coverage for edge cases
  • Document feature support clearly

The direct byte stream processing approach is sound, but the implementation needs additional work to achieve production reliability. The standard StreamRegex extension methods remain the recommended approach for general use until automata implementation reaches 100% test pass rate and demonstrates clear performance advantages through benchmarking.