Microstructure

Microstructure is an institutional-grade Limit Order Book (LOB) research pipeline built for modeling high-frequency price dynamics. Featuring a robust polyglot architecture, the system leverages a C++ (pybind11) integration to process 10-level deep asynchronous Kraken WebSocket feeds. This high-performance engine feeds a PyTorch Lightning DeepLOB (CNN-LSTM) architecture designed to model short-horizon volume-weighted mid-price returns.

Production Infrastructure

The ingestion daemon subscribes to Kraken L2 WebSocket book streams and maintains local 10-level bid and ask books per symbol. Each normalized L2Snapshot is written as append-only JSONL so training jobs can consume deterministic, replayable records without depending on a running exchange connection.

Polyglot Pipeline Architecture

flowchart LR
    A[Kraken L2 WebSocket] -->|Async JSON| B(CryptoFeedAgent)
    B -->|Normalized L2Snapshot| C[L2Snapshot Queue]
    C --> D[L2JsonlWriter]
    C --> E[C++ Pybind11 Engine]
    E -->|Tensor Windows| F[PyTorch LOBDataModule]
    F --> G(DeepLOB CNN-LSTM)

The containerized daemon is designed for long-running collection. Docker owns the runtime environment, .env controls symbols and persistence settings, and mounted volumes retain distributed snapshot shards under data/l2.

Rigorous Research Methodology

Market microstructure data is noisy, asynchronous, and frequently incomplete. The pipeline handles this by sorting snapshots by symbol, timestamp, and sequence; padding missing book levels with neutral values; transforming price levels relative to the current volume-weighted mid; and preserving raw event order for offline audits.

Research code must avoid look-ahead bias. Normalization for experiments uses backward-looking rolling windows only, and validation rigorously employs Purged K-Fold or Walk-Forward splits rather than random train/test splits.

Target Formulation

Tensor creation flows through the PyTorch Lightning LOBDataModule, which builds rolling LOB windows and calculates the prediction target. The target represents the volume-weighted mid-price return over the next $k$ ticks:

$$ y_{t,k} = \frac{VWAP_{t+k}}{VWAP_{t}} - 1 $$

C++ Acceleration Layer

The research DataModule delegates CPU-bound LOB preprocessing to a required C++17/pybind11 extension. This architectural decision was critical to bypass the Python GIL and ensure that high-frequency tick reconstruction—which includes building relative price/volume features, applying backward-looking rolling normalization, and constructing per-symbol rolling windows—does not bottleneck the deep learning training loop. While the API maintains a stable Python interface (LOBDataModule), the computational heavy-lifting remains strictly in C++.

Local Compilation

Build the extension locally using the provided virtual environment configuration:

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
cmake -S . -B build
cmake --build build

Microsecond Event-Driven C++ Backtester

We have replaced standard vectorized Python backtesting with an institutional-grade, event-driven C++ backtesting engine. This engine processes Limit Order Book (LOB) updates as discrete events, simulating exact queue positions and microsecond latency.

stateDiagram-v2
    [*] --> EventQueue: Push Event
    
    state EventQueue {
        [*] --> PriorityQueue: Prioritized by Microsecond Timestamp
    }
    
    EventQueue --> DataHandler: MARKET_DATA Event
    DataHandler --> UpdateLOB: Update Best Bid/Ask & Volume
    UpdateLOB --> EventQueue
    
    EventQueue --> ExecutionHandler: ORDER Event
    ExecutionHandler --> SimulateLatency: Add 5ms Delay
    SimulateLatency --> ScheduleFill: Push FILL Event
    ScheduleFill --> EventQueue
    
    EventQueue --> ProcessFill: FILL Event
    ProcessFill --> UpdateLedger: Execute Trade & Calc Slippage/PnL
    UpdateLedger --> [*]

This engine is exposed to Python via pybind11, allowing research scripts to easily instantiate the backtester, pass in historical LOB datasets, and retrieve the final trade ledger and PnL.

Quick Start

Clone the repository:

git clone https://github.com/nxd914/microstructure.git
cd microstructure

Start the ingestion daemon with Docker:

cp .env.example .env 2>/dev/null || true
docker compose -f deploy/docker-compose.yml up --build

Environment Settings

Variable	Default	Description
`KRAKEN_SYMBOLS`	`BTC,ETH`	Target pairs for websocket ingestion.
`KRAKEN_BOOK_DEPTH`	`10`	Order book depth levels to track.
`SNAPSHOT_QUEUE_SIZE`	`5000`	Max buffered snapshots.
`L2_PERSIST_JSONL`	`true`	Toggle disk writing.
`L2_JSONL_OUTPUT_DIR`	`data/l2`	Local storage directory.

Run the local research and regression suite:

source .venv/bin/activate
pytest

Repository Layout

microstructure/
├── config/                                # YAML/JSON hyperparameters and env settings
├── core/                                  # Shared runtime utilities
├── deploy/                                # Container and service files
├── docs/                                  # Deep architecture notes & methodology
├── scripts/                               # Execution scripts (e.g., train_model.py, run_backtest.py)
├── strategies/crypto/
│   ├── daemon.py                          # Ingestion daemon entry point
│   ├── agents/                            # Kraken L2 feed agent
│   ├── core/                              # Config, L2 models, JSONL writer, logging
│   └── research/                          # DataModule, targets, DeepLOB model scaffold
└── tests/                                 # Feed, storage, target, data, and model tests

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
.hypothesis/constants		.hypothesis/constants
config		config
core		core
cpp		cpp
deploy		deploy
docs		docs
scripts		scripts
strategies		strategies
tests		tests
.claudeignore		.claudeignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
MONITORING.md		MONITORING.md
README.md		README.md
__init__.py		__init__.py
conftest.py		conftest.py
gemini.md		gemini.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Microstructure

Production Infrastructure

Polyglot Pipeline Architecture

Rigorous Research Methodology

Target Formulation

C++ Acceleration Layer

Local Compilation

Microsecond Event-Driven C++ Backtester

Quick Start

Environment Settings

Repository Layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Microstructure

Production Infrastructure

Polyglot Pipeline Architecture

Rigorous Research Methodology

Target Formulation

C++ Acceleration Layer

Local Compilation

Microsecond Event-Driven C++ Backtester

Quick Start

Environment Settings

Repository Layout

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages