Teaching Neural Networks the Art of Forgetting
Overview • Architecture • Installation • Quick Start • Theory • Insights
"The serpent that devours itself to be reborn — features consumed, transformed, and emerged anew."
|
Standard residual networks can only add information. They lack the ability to erase, forget, or reflect — leading to residual accumulation where noisy features persist indefinitely. A geometric residual connection that learns:
|
|
OUROBOROS enables neural networks to:
| Capability | Description |
|---|---|
| ✨ Selective Forgetting | Surgically erase outdated or noisy information |
| 🔄 Feature Reflection | Model oscillatory and oppositional dynamics |
| 🎯 Spectral Control | Shape layer-wise transitions with precision |
| ⚡ Gradient Stability | Maintain gradient flow with gated identity |
At the heart of OUROBOROS lies the Delta Operator — a generalized Householder transformation:
| Symbol | Name | Description | Range |
|---|---|---|---|
| k(X) | Reflection Direction | Unit vector defining transformation axis | ‖k‖ = 1 |
| β(X) | Scalar Gate | Controls transformation intensity | [0, 2] |
| v(X) | Value Vector | New information to inject | ℝᵈᵛ |
A single learnable scalar dynamically interpolates between three geometric operations:
| β Value | Transformation | Eigenvalue | Effect |
|---|---|---|---|
| β → 0 | Identity | λ = 1 | Pass through unchanged |
| β → 1 | Projection | λ = 0 | Erase component along k |
| β → 2 | Reflection | λ = -1 | Flip direction along k |
Vector v transformed by Delta Operator. P(v) = projection (β=1), R(v) = reflection (β=2). Vector k = hyperplane normal.
The input X splits into three learnable branches that compute k, β, and v, which combine through the Delta operation with a skip connection.
Each OuroborosBlock contains:
- RMSNorm → Attention → Ouroboros Residual
- RMSNorm → MLP → Ouroboros Residual
X_{l+1} = X_l + β · k · (vᵀ − kᵀ · X_l)
↑ ↑
TARGET CURRENT
(what to write) (what exists)
This unifies three operations with a single gate:
| Operation | Formula | Effect |
|---|---|---|
| Erasure | −β · k · (kᵀ · X) |
Removes component along k |
| Writing | +β · k · vᵀ |
Injects new information |
| Sync | Same β |
Both scale together |
Theorem: For
A = I − β·k·kᵀwhere‖k‖ = 1:
σ(A) = { 1, 1, ..., 1, (1−β) }
└────┬────┘
(d−1) times
| Property | Formula | Notes |
|---|---|---|
| Eigenvalue along k | λ_k = 1 − β |
Controlled by gate |
| Eigenvalues in k⊥ | λ = 1 |
Multiplicity: d−1 |
| Determinant | det(A) = 1 − β |
Zero at β=1 |
| Orthogonality | AᵀA = I |
When β ∈ {0, 2} |
| Involution | A² = I |
When β = 2 |
| Property | ResNet | OUROBOROS |
|---|---|---|
| Eigenvalues | ≈ 1 + ε | ∈ [-1, 1] |
| Negative λ | ❌ No | ✅ Yes |
| Singular | ❌ No | ✅ Yes (β=1) |
| Data-dependent | ❌ Fixed | ✅ Learnable |
Key Insight: The geometric coherence term k_i · k_j enables learned feature interactions without explicit cross-attention.
# Clone the repository
git clone https://github.com/DivyamTalwar/OUROBOROS.git
cd ouroboros
# Install dependencies
pip install torch>=2.0 transformers einops
# Install in development mode
pip install -e .| Package | Version | Purpose |
|---|---|---|
| Python | ≥ 3.8 | Runtime |
| PyTorch | ≥ 2.0 | Deep learning |
| Transformers | ≥ 4.30 | HuggingFace |
| einops | ≥ 0.6 | Tensor ops |
from model.ouroboros import OuroborosModel, OuroborosConfig
import torch
# Configure the model
config = OuroborosConfig(
vocab_size=50304,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=6,
head_dim=128,
)
# Initialize model
model = OuroborosModel(config)
print(f"Parameters: {model.get_num_params():,}")
# Forward pass
input_ids = torch.randint(0, 50304, (2, 512))
labels = input_ids.clone()
logits, loss = model(input_ids, targets=labels)
print(f"Loss: {loss.item():.4f}")from torch.optim import AdamW
model = OuroborosModel(config).cuda()
optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
for batch in dataloader:
input_ids, labels = batch['input_ids'].cuda(), batch['labels'].cuda()
logits, loss = model(input_ids, targets=labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Loss: {loss.item():.4f}")| Parameter | Type | Default | Description |
|---|---|---|---|
vocab_size |
int | 50304 | Vocabulary size |
hidden_size |
int | 768 | Model dimension |
num_hidden_layers |
int | 12 | Number of blocks |
num_attention_heads |
int | 6 | Attention heads |
head_dim |
int | 128 | Dimension per head |
block_size |
int | 1024 | Max sequence length |
| Parameter | Type | Default | Description |
|---|---|---|---|
ouroboros_k_eps |
float | 1e-6 | k normalization ε |
ouroboros_beta_init |
float | 1.0 | Initial β [0, 2] |
ouroboros_v_sigmoid |
bool | True | Sigmoid on v |
ouroboros_v_sigmoid_scale |
float | 4.0 | v scale factor |
OUROBOROS/
├── 📄 README.md # Documentation
├── assets/ # Images
│ ├── banner.png
│ ├── architecture.png
│ ├── beta_spectrum.png
│ ├── geometric_transform.png
│ ├── dataflow.png
│ ├── model_architecture.png
│ └── feature_coupling.png
└── 📁 model/
└── ouroboros.py # Core implementation
| Challenge | ResNet | OUROBOROS |
|---|---|---|
| Noisy features accumulate | ❌ Can only add | ✅ Can erase |
| Oscillatory patterns | ❌ No negative λ | ✅ λ ∈ [-1, 1] |
| Feature interference | ❌ No filter | ✅ Projection |
| Gradient stability | ✅ Identity | ✅ Gated identity |
OUROBOROS is the depth-wise dual of time-wise recurrence:
Time (DeltaNet): S_t = A · S_{t-1} + β · k · vᵀ
Depth (OUROBOROS): X_{l+1} = A · X_l + β · k · vᵀ
When β ≠ 1, the Delta Operator is invertible:
A⁻¹ = I + (β / (1−β)) · k · kᵀ
At β = 2: A = A⁻¹ (orthogonal involution).
@software{ouroboros2025,
title = {OUROBOROS: Optimal Unified Residual Operations with
Bounded Orthogonal Reflection and Spectral Control},
year = {2025},
url = {https://github.com/DivyamTalwar/OUROBOROS}
}- Fork the repository
- Create feature branch:
git checkout -b feature/amazing - Commit changes:
git commit -m 'Add feature' - Push:
git push origin feature/amazing - Open a Pull Request
Creative Commons Attribution 4.0 International (CC-BY-4.0)
The ancient serpent eating its own tail — a symbol of cyclical transformation.
Features are consumed, transformed, and reborn through each layer.
Built with 💜 for the ML community






