A comprehensive command-line tool for file type identification with advanced cybersecurity and digital forensics capabilities.
- Magic Byte Detection: Advanced signature-based file type identification
- Entropy Analysis: Statistical analysis for detecting encryption, compression, and obfuscation
- Heuristic Classification: Pattern-based identification for unknown formats
- Deep Structural Parsing: Format-specific metadata extraction (optional)
- Security Analysis: Obfuscation, packing, and malware detection (optional)
- Machine Learning: AI-powered classification for unknown formats (optional)
- YARA Integration: Threat detection using YARA rules (optional)
- Human-readable text with color coding
- JSON for machine processing
- XML for enterprise integration
- YAML for configuration management
- CSV for batch analysis
- Basic: Quick signature detection only
- Standard: Include entropy and heuristic analysis (default)
- Deep: Full structural parsing and metadata extraction
- Forensic: Complete security analysis with anomaly detection
- Python 3.8 or higher
- pip package manager
# Clone or download the project
cd file_type_identification_tool
# Install basic dependencies (optional - tool works without them)
pip install PyYAML colorama tabulate
# Or install all optional dependencies for full functionality
pip install -r requirements.txtFor full functionality, install these optional packages:
# For YARA rule support
pip install yara-python
# For enhanced machine learning
pip install scikit-learn joblib
# For PE/ELF parsing
pip install pefile pyelftools
# For image analysis
pip install Pillow
# For advanced output formatting
pip install tabulate coloramaQuick Demo (No Dependencies Required):
# Create sample files and run demonstration
python3 demo.py --create-samples
python3 demo.py demo_samples/sample.pdf
python3 demo.py demo_samples/* --format jsonFull Tool Usage:
# Analyze a single file
python3 src/main.py document.pdf
# Analyze multiple files
python3 src/main.py file1.exe file2.bin file3.doc
# Recursive directory analysis
python3 src/main.py /path/to/directory --recursiveQuick signature detection only:
python3 src/main.py suspicious_file.bin --analysis-level basicDeep structural analysis:
python3 src/main.py document.pdf --analysis-level deepComplete forensic analysis:
python3 src/main.py malware.exe --analysis-level forensicEnable obfuscation detection:
python3 src/main.py packed_executable.exe --security-analysisEnable YARA scanning:
python3 src/main.py suspicious_file.bin --yara-scanEnable machine learning classification:
python3 src/main.py unknown_file.dat --ml-analysisJSON output for automation:
python3 src/main.py file.exe --output-format jsonSave results to file:
python3 src/main.py file.exe --save-output results.json --output-format jsonTable format for multiple files:
python3 src/main.py *.exe --table-formatDisable colors for scripting:
python3 src/main.py file.exe --no-colorsCustom configuration file:
python3 src/main.py file.exe --config custom_config.yamlAdjust file size limits:
python3 src/main.py large_file.bin --max-file-size 2048Custom signature detection:
python3 src/main.py file.exe --signature-bytes 64Verbose logging:
python3 src/main.py file.exe --verboseDebug mode:
python3 src/main.py file.exe --debugThe tool uses a YAML configuration file (config.yaml) that can be customized:
# General settings
general:
signature_bytes: 32
max_file_size: 1024 # MB
verbose: false
# Analysis settings
analysis:
calculate_entropy: true
extract_strings: true
calculate_hashes: true
hash_algorithms:
- md5
- sha1
- sha256
# Security analysis
security:
detect_obfuscation: false
detect_packers: false
entropy_threshold: 7.5
# YARA integration
yara:
enabled: false
rules_path: "data/yara_rules/"
timeout: 30
# Output settings
output:
format: "text"
use_colors: true
show_confidence: true$ python3 demo.py demo_samples/sample.pdf
================================================================================
FILE TYPE IDENTIFICATION ANALYSIS
================================================================================
π FILE INFORMATION
----------------------------------------
Path: demo_samples/sample.pdf
Name: sample.pdf
Size: 330 bytes
Extension: .pdf
π IDENTIFICATION RESULTS
----------------------------------------
File Type: PDF Document
Category: document
MIME Type: application/pdf
Confidence: 95.0%
π ANALYSIS DETAILS
----------------------------------------
Entropy: 4.73
Bytes Analyzed: 330
Printable Ratio: 88.2%
```
### Example 2: Security Analysis
```bash
$ python3 demo.py demo_samples/suspicious.sh
π SECURITY INDICATORS
----------------------------------------
[HIGH] Suspicious Code Pattern
Contains potentially dangerous pattern: eval(
[HIGH] Suspicious Code Pattern
Contains potentially dangerous pattern: system(
π FILE HASHES
----------------------------------------
MD5: ec03738ce00020150f26fc9b8be97ab8
SHA1: 43a9689cef2d1c1b43abf089eb82fab3fe9a1253
SHA256: 2a8a046fc97242c78186faed9e5fe04b388176dff9c8d209bc24a002507613b0$ python3 demo.py demo_samples/* --format json
π SUMMARY
==============================
Files analyzed: 6
Successful: 6
Failed: 0
Total time: 0.001 seconds
Average time: 0.000 secondsThe tool supports identification of:
- Windows PE (.exe, .dll, .sys)
- Linux ELF (.elf, .so)
- macOS Mach-O (.dylib, .bundle)
- ZIP (.zip, .jar, .docx, .xlsx)
- RAR (.rar)
- 7-Zip (.7z)
- TAR (.tar, .tar.gz)
- JPEG (.jpg, .jpeg)
- PNG (.png)
- GIF (.gif)
- BMP (.bmp)
- TIFF (.tif, .tiff)
- PDF (.pdf)
- Microsoft Office (.doc, .xls, .ppt)
- RTF (.rtf)
- XML/HTML (.xml, .html)
- MP3 (.mp3)
- WAV (.wav)
- AVI (.avi)
- MP4 (.mp4)
- Python (.py)
- JavaScript (.js)
- PowerShell (.ps1)
- Shell Scripts (.sh, .bash)
- JSON (.json)
- CSV (.csv)
- SQLite (.sqlite, .db)
- Registry Hives (.reg)
- Entropy analysis for encryption detection
- String pattern analysis for code obfuscation
- Packer identification (UPX, Themida, VMProtect)
- YARA rule integration
- Known malware signature detection
- Suspicious pattern identification
- Hash calculation (MD5, SHA-1, SHA-256)
- Metadata extraction
- Embedded file detection
- Anomaly identification
The tool can also be used as a Python library:
# Simple demonstration - works without dependencies
import sys
sys.path.append('.')
from demo import SimpleFileAnalyzer
# Initialize analyzer
analyzer = SimpleFileAnalyzer()
# Analyze file
result = analyzer.analyze_file("demo_samples/sample.pdf")
# Access results
if result['success']:
ident = result['identification']
print(f"File type: {ident['type']}")
print(f"Confidence: {ident['confidence']:.1%}")
print(f"Entropy: {result['analysis']['entropy']:.2f}")
# Check for security indicators
for indicator in result['security']['indicators']:
print(f"Security: {indicator['type']} ({indicator['severity']})")- Fork the repository
- Create a feature branch (
git checkout -b feature/new-analyzer) - Make your changes
- Add tests for new functionality
- Run the test suite (
python -m pytest tests/) - Submit a pull request
To add a new analyzer:
- Create a new class inheriting from
BaseAnalyzer - Implement required methods:
can_analyze()andanalyze() - Register the analyzer in the main engine
- Add appropriate tests
from src.analyzers.base import BaseAnalyzer, AnalysisResult
class MyCustomAnalyzer(BaseAnalyzer):
def can_analyze(self, file_data, file_path=None):
# Return True if this analyzer can handle the file
return True
def analyze(self, file_data, file_path=None, analysis_level=None, **kwargs):
result = AnalysisResult(self.name)
# Perform analysis and populate result
return resultThis project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions:
- Create an issue on GitHub
- Check the documentation in the
docs/directory - Review the example configurations in
examples/
- Download/clone the project
- Run the demo:
python3 demo.py --create-samples - Analyze files:
python3 demo.py demo_samples/sample.pdf - Try JSON output:
python3 demo.py demo_samples/* --format json
Run the included tests:
# Basic functionality test
python3 simple_test.py
# Full CLI test (requires demo samples)
python3 test_cli.py- Initial release
- Magic byte detection with 50+ signatures
- Shannon entropy analysis
- Security pattern detection
- Multi-format output (text/JSON)
- Hash calculation (MD5/SHA1/SHA256)
- Extensible architecture
- Working demonstration tool