Performance Optimization Guide for PrivacyChatBoX

This document outlines the performance optimizations implemented in the PrivacyChatBoX application, focusing on improving scanning speed, memory efficiency, and overall application responsiveness.

1. Regex Engine Optimization

Pre-compiled Regex Patterns

The original implementation compiled regex patterns on every match operation, which was inefficient. We've improved this by pre-compiling all regex patterns at module load time:

# Before
for pattern_name, pattern in patterns.items():
    matches = re.findall(pattern, text)  # Compiles pattern every time

# After
# At module load time
COMPILED_PATTERNS = {}
for pattern in DEFAULT_PATTERNS:
    COMPILED_PATTERNS[pattern["name"]] = {
        "regex": re.compile(pattern["pattern"]),
        "level": pattern["level"],
        "confidence": pattern["confidence"]
    }

# During scanning
for pattern_name, pattern_info in compiled_patterns.items():
    matches = pattern_info["regex"].findall(text)  # Uses pre-compiled pattern

Confidence Scoring System

We've added a confidence scoring system to each pattern to reduce false positives:

DEFAULT_PATTERNS = [
    {"name": "credit_card", "pattern": r"...", "level": "standard", "confidence": 0.95},
    {"name": "email", "pattern": r"...", "level": "standard", "confidence": 0.9},
    # ...
]

The scan_text function now accepts a minimum_confidence parameter (default: 0.7) that filters out patterns with low confidence scores, reducing false positives while maintaining detection accuracy.

2. File Processing Optimization

Chunked File Reading

Large files are now processed in chunks to prevent memory issues:

file_processor.py provides specialized extractors for different file types
Each extractor yields text in manageable chunks
Supported file types include:
- PDF files (using pdfminer.six)
- Word documents (using python-docx)
- Excel spreadsheets (using openpyxl)
- CSV files
- Plain text files

Example usage:

from file_processor import scan_file_chunks

# Scan a large PDF file
sensitive, patterns, time_taken = scan_file_chunks(
    file_path="large_document.pdf",
    file_type="pdf",
    scanner_func=scan_function,
    chunk_size=2000,
    max_workers=4
)

3. Parallel Processing with ThreadPoolExecutor

File scanning now uses parallel processing to improve performance:

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    # Submit all scan tasks
    future_to_chunk = {executor.submit(scanner_func, chunk): i for i, chunk in enumerate(chunks)}
    
    # Process results as they complete
    for future in future_to_chunk:
        chunk_sensitive, chunk_detected = future.result()
        # Merge results...

Benefits:

Faster processing of large files
Better utilization of multi-core CPUs
Increased throughput for batch operations

4. Performance Metrics and Logging

Added performance metrics logging to identify bottlenecks:

# At the end of scan_text function
scan_time = time.time() - start_time
print(f"Privacy scan completed in {scan_time:.4f}s: found {len(detected)} pattern types")

This helps identify slow patterns or processing bottlenecks.

5. Optional Dependencies

Document processing libraries are loaded conditionally:

try:
    from pdfminer.high_level import extract_text_to_fp
    PDFMINER_AVAILABLE = True
except ImportError:
    PDFMINER_AVAILABLE = False

This ensures the application functions even when some document processing libraries are not available, with graceful fallbacks.

6. How to Enable These Optimizations

The optimizations are enabled by default in the latest version. To make use of the chunked file processing:

For direct file scanning from disk, use the new scan_file_path function:

from privacy_scanner import scan_file_path

sensitive, detected, processing_time = scan_file_path(
    user_id=current_user_id,
    file_path="/path/to/file.pdf", 
    file_name="file.pdf",
    file_type="pdf"
)

For in-memory content, continue using the existing scan_file_content function:

from privacy_scanner import scan_file_content

sensitive, detected = scan_file_content(
    user_id=current_user_id,
    file_content=content_string,
    file_name="example.txt"
)

7. Future Optimization Opportunities

NLP-assisted pattern detection: Using spaCy to pre-filter text chunks before applying regex patterns
Pattern caching: Implement caching of scan results for repeated text chunks
GPU acceleration: For local LLM document processing
Adaptive chunk sizing: Adjust chunk size based on available system memory and file characteristics
Distributed scanning: For enterprise deployments with very high scanning volume

8. Benchmarking

To benchmark the performance improvements, compare the processing times for different file types:

import time
from privacy_scanner import scan_text, scan_file_path

def benchmark_scan(file_path, file_type, user_id):
    # Benchmark old method (load entire file)
    with open(file_path, 'r', errors='ignore') as f:
        content = f.read()
    
    start_time = time.time()
    sensitive1, detected1 = scan_text(user_id, content)
    old_time = time.time() - start_time
    
    # Benchmark new method (chunked processing)
    start_time = time.time()
    sensitive2, detected2, _ = scan_file_path(user_id, file_path, file_path, file_type)
    new_time = time.time() - start_time
    
    print(f"File: {file_path}")
    print(f"Old method: {old_time:.4f}s, detected: {len(detected1)} pattern types")
    print(f"New method: {new_time:.4f}s, detected: {len(detected2)} pattern types")
    print(f"Speedup: {old_time/new_time:.2f}x")

9. Requirements

These optimizations require the following additional dependencies:

pdfminer.six: For PDF text extraction
python-docx: For Word document processing
openpyxl: For Excel spreadsheet processing
concurrent-log-handler: For thread-safe logging

These dependencies are included in the updated pyproject.toml file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Optimization Guide for PrivacyChatBoX

1. Regex Engine Optimization

Pre-compiled Regex Patterns

Confidence Scoring System

2. File Processing Optimization

Chunked File Reading

3. Parallel Processing with ThreadPoolExecutor

4. Performance Metrics and Logging

5. Optional Dependencies

6. How to Enable These Optimizations

7. Future Optimization Opportunities

8. Benchmarking

9. Requirements

FilesExpand file tree

Optimization_Guide.md

Latest commit

History

Optimization_Guide.md

File metadata and controls

Performance Optimization Guide for PrivacyChatBoX

1. Regex Engine Optimization

Pre-compiled Regex Patterns

Confidence Scoring System

2. File Processing Optimization

Chunked File Reading

3. Parallel Processing with ThreadPoolExecutor

4. Performance Metrics and Logging

5. Optional Dependencies

6. How to Enable These Optimizations

7. Future Optimization Opportunities

8. Benchmarking

9. Requirements