Skip to content

DebdootManna/file_type_identification_tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

File Type Identification Tool

A comprehensive command-line tool for file type identification with advanced cybersecurity and digital forensics capabilities.

Features

Core Capabilities

  • Magic Byte Detection: Advanced signature-based file type identification
  • Entropy Analysis: Statistical analysis for detecting encryption, compression, and obfuscation
  • Heuristic Classification: Pattern-based identification for unknown formats
  • Deep Structural Parsing: Format-specific metadata extraction (optional)
  • Security Analysis: Obfuscation, packing, and malware detection (optional)
  • Machine Learning: AI-powered classification for unknown formats (optional)
  • YARA Integration: Threat detection using YARA rules (optional)

Output Formats

  • Human-readable text with color coding
  • JSON for machine processing
  • XML for enterprise integration
  • YAML for configuration management
  • CSV for batch analysis

Analysis Levels

  • Basic: Quick signature detection only
  • Standard: Include entropy and heuristic analysis (default)
  • Deep: Full structural parsing and metadata extraction
  • Forensic: Complete security analysis with anomaly detection

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Quick Install

# Clone or download the project
cd file_type_identification_tool

# Install basic dependencies (optional - tool works without them)
pip install PyYAML colorama tabulate

# Or install all optional dependencies for full functionality
pip install -r requirements.txt

Optional Dependencies

For full functionality, install these optional packages:

# For YARA rule support
pip install yara-python

# For enhanced machine learning
pip install scikit-learn joblib

# For PE/ELF parsing
pip install pefile pyelftools

# For image analysis
pip install Pillow

# For advanced output formatting
pip install tabulate colorama

Usage

Basic Usage

Quick Demo (No Dependencies Required):

# Create sample files and run demonstration
python3 demo.py --create-samples
python3 demo.py demo_samples/sample.pdf
python3 demo.py demo_samples/* --format json

Full Tool Usage:

# Analyze a single file
python3 src/main.py document.pdf

# Analyze multiple files  
python3 src/main.py file1.exe file2.bin file3.doc

# Recursive directory analysis
python3 src/main.py /path/to/directory --recursive

Analysis Levels

Quick signature detection only:

python3 src/main.py suspicious_file.bin --analysis-level basic

Deep structural analysis:

python3 src/main.py document.pdf --analysis-level deep

Complete forensic analysis:

python3 src/main.py malware.exe --analysis-level forensic

Security Analysis

Enable obfuscation detection:

python3 src/main.py packed_executable.exe --security-analysis

Enable YARA scanning:

python3 src/main.py suspicious_file.bin --yara-scan

Enable machine learning classification:

python3 src/main.py unknown_file.dat --ml-analysis

Output Options

JSON output for automation:

python3 src/main.py file.exe --output-format json

Save results to file:

python3 src/main.py file.exe --save-output results.json --output-format json

Table format for multiple files:

python3 src/main.py *.exe --table-format

Disable colors for scripting:

python3 src/main.py file.exe --no-colors

Advanced Options

Custom configuration file:

python3 src/main.py file.exe --config custom_config.yaml

Adjust file size limits:

python3 src/main.py large_file.bin --max-file-size 2048

Custom signature detection:

python3 src/main.py file.exe --signature-bytes 64

Verbose logging:

python3 src/main.py file.exe --verbose

Debug mode:

python3 src/main.py file.exe --debug

Configuration

The tool uses a YAML configuration file (config.yaml) that can be customized:

# General settings
general:
  signature_bytes: 32
  max_file_size: 1024  # MB
  verbose: false

# Analysis settings
analysis:
  calculate_entropy: true
  extract_strings: true
  calculate_hashes: true
  hash_algorithms:
    - md5
    - sha1
    - sha256

# Security analysis
security:
  detect_obfuscation: false
  detect_packers: false
  entropy_threshold: 7.5

# YARA integration
yara:
  enabled: false
  rules_path: "data/yara_rules/"
  timeout: 30

# Output settings
output:
  format: "text"
  use_colors: true
  show_confidence: true

Examples

Example 1: Basic File Analysis

$ python3 demo.py demo_samples/sample.pdf

================================================================================
                          FILE TYPE IDENTIFICATION ANALYSIS
                          ================================================================================

                          πŸ“ FILE INFORMATION
                          ----------------------------------------
                          Path: demo_samples/sample.pdf
                          Name: sample.pdf
                          Size: 330 bytes
                          Extension: .pdf

                          πŸ” IDENTIFICATION RESULTS
                          ----------------------------------------
                          File Type: PDF Document
                          Category: document
                          MIME Type: application/pdf
                          Confidence: 95.0%

                          πŸ“Š ANALYSIS DETAILS
                          ----------------------------------------
                          Entropy: 4.73
                          Bytes Analyzed: 330
                          Printable Ratio: 88.2%
                          ```

### Example 2: Security Analysis
```bash
$ python3 demo.py demo_samples/suspicious.sh

πŸ”’ SECURITY INDICATORS
----------------------------------------
[HIGH] Suspicious Code Pattern
  Contains potentially dangerous pattern: eval(
[HIGH] Suspicious Code Pattern  
  Contains potentially dangerous pattern: system(

πŸ” FILE HASHES
----------------------------------------
MD5: ec03738ce00020150f26fc9b8be97ab8
SHA1: 43a9689cef2d1c1b43abf089eb82fab3fe9a1253
SHA256: 2a8a046fc97242c78186faed9e5fe04b388176dff9c8d209bc24a002507613b0

Example 3: Batch Analysis

$ python3 demo.py demo_samples/* --format json

πŸ“Š SUMMARY
==============================
Files analyzed: 6
Successful: 6
Failed: 0
Total time: 0.001 seconds
Average time: 0.000 seconds

File Type Support

The tool supports identification of:

Executables

  • Windows PE (.exe, .dll, .sys)
  • Linux ELF (.elf, .so)
  • macOS Mach-O (.dylib, .bundle)

Archives

  • ZIP (.zip, .jar, .docx, .xlsx)
  • RAR (.rar)
  • 7-Zip (.7z)
  • TAR (.tar, .tar.gz)

Images

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • GIF (.gif)
  • BMP (.bmp)
  • TIFF (.tif, .tiff)

Documents

  • PDF (.pdf)
  • Microsoft Office (.doc, .xls, .ppt)
  • RTF (.rtf)
  • XML/HTML (.xml, .html)

Media

  • MP3 (.mp3)
  • WAV (.wav)
  • AVI (.avi)
  • MP4 (.mp4)

Scripts

  • Python (.py)
  • JavaScript (.js)
  • PowerShell (.ps1)
  • Shell Scripts (.sh, .bash)

Data Formats

  • JSON (.json)
  • CSV (.csv)
  • SQLite (.sqlite, .db)
  • Registry Hives (.reg)

Security Features

Obfuscation Detection

  • Entropy analysis for encryption detection
  • String pattern analysis for code obfuscation
  • Packer identification (UPX, Themida, VMProtect)

Threat Intelligence

  • YARA rule integration
  • Known malware signature detection
  • Suspicious pattern identification

Forensic Analysis

  • Hash calculation (MD5, SHA-1, SHA-256)
  • Metadata extraction
  • Embedded file detection
  • Anomaly identification

API Usage

The tool can also be used as a Python library:

# Simple demonstration - works without dependencies
import sys
sys.path.append('.')
from demo import SimpleFileAnalyzer

# Initialize analyzer
analyzer = SimpleFileAnalyzer()

# Analyze file
result = analyzer.analyze_file("demo_samples/sample.pdf")

# Access results
if result['success']:
    ident = result['identification']
    print(f"File type: {ident['type']}")
    print(f"Confidence: {ident['confidence']:.1%}")
    print(f"Entropy: {result['analysis']['entropy']:.2f}")
    
    # Check for security indicators
    for indicator in result['security']['indicators']:
        print(f"Security: {indicator['type']} ({indicator['severity']})")

Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-analyzer)
  3. Make your changes
  4. Add tests for new functionality
  5. Run the test suite (python -m pytest tests/)
  6. Submit a pull request

Adding New Analyzers

To add a new analyzer:

  1. Create a new class inheriting from BaseAnalyzer
  2. Implement required methods: can_analyze() and analyze()
  3. Register the analyzer in the main engine
  4. Add appropriate tests
from src.analyzers.base import BaseAnalyzer, AnalysisResult

class MyCustomAnalyzer(BaseAnalyzer):
    def can_analyze(self, file_data, file_path=None):
        # Return True if this analyzer can handle the file
        return True
    
    def analyze(self, file_data, file_path=None, analysis_level=None, **kwargs):
        result = AnalysisResult(self.name)
        # Perform analysis and populate result
        return result

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

For issues, questions, or contributions:

  • Create an issue on GitHub
  • Check the documentation in the docs/ directory
  • Review the example configurations in examples/

Quick Start

  1. Download/clone the project
  2. Run the demo: python3 demo.py --create-samples
  3. Analyze files: python3 demo.py demo_samples/sample.pdf
  4. Try JSON output: python3 demo.py demo_samples/* --format json

Testing

Run the included tests:

# Basic functionality test
python3 simple_test.py

# Full CLI test (requires demo samples)
python3 test_cli.py

Changelog

Version 1.0.0

  • Initial release
  • Magic byte detection with 50+ signatures
  • Shannon entropy analysis
  • Security pattern detection
  • Multi-format output (text/JSON)
  • Hash calculation (MD5/SHA1/SHA256)
  • Extensible architecture
  • Working demonstration tool

About

CLI tool for file type identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages