Skip to content

Skyroot-Solutions/Oprel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Oprel SDK

Production-ready local LLM inference that beats Ollama in performance

PyPI version Python 3.9+ GitHub License: MIT

Oprel is a high-performance Python library for running large language models and multimodal AI locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and intelligent optimization.

🚀 Key Features

  • Multi-Backend Architecture:

    • llama.cpp: Text generation & vision (GGUF models)
    • ComfyUI Integration: Image & video generation (Diffusion models)
    • Hybrid GPU/CPU: Smart layer distribution for low VRAM
  • Smart Hardware Optimization:

    • Hybrid Offloading: Run 13B models on 4GB GPUs by splitting layers between GPU/CPU
    • Auto-Quantization: Automatically selects best quality quantization based on available VRAM
    • CPU Acceleration: AVX2/AVX512 optimization (30-50% faster than Ollama's defaults)
    • KV-Cache Aware: Precise memory planning prevents OOM crashes
  • Production Reliability:

    • Memory Pressure Monitor: Proactive warnings before crashes
    • Idle Cleanup: Automatically frees GPU/CPU resources when inactive (15min timeout)
    • Zero-Latency: Server mode keeps models cached for instant response
    • Robust Error Handling: Clear error messages, no silent failures
  • Oprel Studio: Premium Web UI for chat, model management, and real-time hardware monitoring

  • Ollama Compatibility: Drop-in replacement for Ollama API

📦 Installation

pip install oprel
# For server mode
pip install oprel[server]

⚡ Quick Start

CLI Usage

# Chat with a model (auto-downloaded)
oprel run qwencoder "Explain recursion in one sentence"

# Interactive chat mode
oprel run llama3.1

# Server mode for persistent caching
oprel serve
oprel run llama3.1 "Hello"  # Instant response!

# Vision models
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Start Oprel Studio (Web UI)
oprel start

Python API

from oprel import Model

# Auto-optimized loading
model = Model("qwencoder") 
print(model.generate("Write a binary search in Python"))

🌐 Oprel Studio: The Ultimate Local AI Workspace

Oprel Studio is a premium, browser-based command center for your local AI models. Designed for engineers and researchers, it provides a state-of-the-art interface that transforms raw inference into a productive workspace.

✨ Immersive Chat Experience

  • Fluid Streaming: ultra-fast Server-Sent Events (SSE) for instant, typewriter-style responses.
  • Thinking Process Visualization: DeepSeek-R1 and other reasoning models show their internal "chain of thought" in a beautiful, expandable workspace.
  • Rich Markdown & Code: Full GFM support with syntax highlighting for 50+ languages.
  • Artifacts Canvas: Generate Mermaid diagrams or HTML/Tailwind previews and view them in a dedicated side-panel next to your chat.
  • Multi-modal Support: Drag and drop images for vision-capable models (Qwen-VL, Llama-3.2 Vision).

🔌 Beyond Local: External Cloud Providers

Manage your local models alongside industry-leading cloud APIs in one unified interface:

  • Google Gemini: Full support for 2.0 Flash/Pro with free-tier quota integration.
  • NVIDIA NIM: High-performance inference via NVIDIA's accelerated cloud.
  • Groq: Record-breaking inference speeds via LPU™ technology.
  • OpenRouter: Access 200+ models from a single API key.
  • Custom OpenAI: Connect any internal or third-party OpenAI-compatible server.

🏛️ Visual Model Registry

  • One-Click Deployment: Pull, load, and switch between models without ever touching the terminal.
  • Quantization Intelligence: See available quants (Q4_K, Q8_0, etc.) and their memory footprint before loading.
  • Smart Status: Real-time indicators show which model is currently taking up VRAM/RAM.

📊 Real-time Hardware Analytics

Monitor your system's performance as the model generates:

  • Tokens per Second (TPS): Live tracking of inference performance.
  • VRAM & RAM: Precise graphs showing memory consumption across CPU and GPU.
  • CPU/GPU Utilization: Monitor load to ensure your system is running optimally.

🚀 Usage

Start Oprel Studio and it will automatically open in your default browser:

oprel start

The interface is hosted at http://localhost:11435/gui/.

🎨 Image & Video Generation

ComfyUI is embedded - auto-installs and downloads models automatically!

Usage

# Specify model in command
oprel gen-image sdxl-turbo "a cyberpunk city at night"

# High quality with FLUX
oprel gen-image flux-1-schnell "a majestic dragon" --width 1024 --height 1024 --steps 30

# With negative prompt
oprel gen-image sdxl-turbo "a cute cat" --negative "blurry, low quality"

# First time downloads model automatically
oprel gen-image flux-1-dev "stunning landscape"  # Auto-downloads 23GB

Download Models

# List available image models
oprel list-models --category text-to-image

# Pre-download model
oprel pull flux-1-schnell

# Pull video model
oprel pull svd-xt

🔍 Text Embeddings

Generate embeddings for semantic search and RAG applications:

CLI Usage

# Single text embedding
oprel embed nomic-embed-text "Hello world"

# Process files (PDF, DOCX, TXT, JSON)
oprel embed nomic-embed-text --files document.pdf report.docx notes.txt

# Batch processing from file (one text per line)
oprel embed nomic-embed-text --batch texts.txt --output embeddings.json

# JSON output format
oprel embed nomic-embed-text "Machine learning" --format json

Python API

from oprel import embed

# Single embedding
vector = embed("Hello world", model="nomic-embed-text")
print(f"Dimensions: {len(vector)}")

# Batch embeddings
vectors = embed(
    ["Document 1", "Document 2", "Document 3"],
    model="nomic-embed-text"
)

# Semantic search
import math

def cosine_similarity(a, b):
    dot = sum(x*y for x,y in zip(a,b))
    mag_a = math.sqrt(sum(x*x for x in a))
    mag_b = math.sqrt(sum(x*x for x in b))
    return dot / (mag_a * mag_b)

query = embed("machine learning topic")
docs = embed(["AI concepts", "cooking recipes", "ML algorithms"])
similarities = [cosine_similarity(query, doc) for doc in docs]
best_match = similarities.index(max(similarities))
print(f"Best match: Document {best_match}")

Available Embedding Models

  • nomic-embed-text: General-purpose (768 dims)
  • bge-m3: Multilingual support (1024 dims)
  • all-minilm-l6-v2: Lightweight & fast (384 dims)
  • snowflake-arctic: Optimized for RAG (1024 dims)
# List all embedding models
oprel list-models --category embeddings

Available Models:

  • sdxl-turbo - Fastest (1-4 steps, 7GB) ⚡
  • flux-1-schnell - Fast + quality (4 steps, 23GB)
  • flux-1-dev - Best quality (28 steps, 23GB)
  • sd-1.5 - Lightweight (4GB)

Vision Models

# Ask about an image
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Multi-image analysis
oprel vision qwen3-vl-14b "Compare these images" --images img1.jpg img2.jpg img3.jpg

🛠️ Advanced Features

Hybrid GPU/CPU Offloading

Run larger models on limited VRAM by intelligently splitting layers.

# Automatically calculated during load
# Example: "20/40 layers on GPU, 20 on CPU"

Smart Quantization

Auto-selects the best quantization that fits your hardware.

oprel run llama3.1 --quantization auto  # Default

OpenAI & Ollama Compatible Server (Week 14 ✨)

Production-ready API server with smart model management

Start the server:

oprel serve --host 127.0.0.1 --port 11435

The server provides:

  • OpenAI API compatibility: /v1/chat/completions, /v1/completions, /v1/models
  • Ollama API compatibility: /api/chat, /api/generate, /api/tags
  • Smart Model Management:
    • Models stay loaded for 15 minutes after last use
    • Automatic model switching when switching between models
    • Zero manual load/unload needed
  • Fast SSE Streaming: Server-Sent Events for instant token delivery
  • CORS Support: Use from web applications

OpenAI API Examples

Python (using OpenAI SDK):

from openai import OpenAI

# Point to local Oprel server
client = OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="not-needed"  # Oprel doesn't require API keys
)

# Chat completion
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[
        {"role": "user", "content": "Write a Python function to reverse a string"}
    ],
    stream=True  # Enable streaming for fast responses
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

cURL:

# Chat completions (streaming)
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Text Completions
curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

# List Models
curl http://localhost:11435/v1/models

Ollama API Examples

Python (using Ollama SDK):

import ollama

# Works directly with Ollama SDK
client = ollama.Client(host='http://localhost:11435')
response = client.chat(
    model='llama3', 
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True
)

for chunk in response:
    print(chunk['message']['content'], end='')

cURL:

# Ollama-style chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# List models (Ollama format)
curl http://localhost:11434/api/tags

Model Management Behavior

The server automatically manages models with these rules:

  1. First Request: Model is loaded (takes ~5-30s depending on size)
  2. Subsequent Requests: Model is already loaded (instant response)
  3. Model Switch: Old model unloads, new model loads automatically
  4. Idle Timeout: After 15 minutes of no requests, model is unloaded to free memory
  5. No Manual Management: You never need to call load/unload - it's automatic!

Example workflow:

# Start server
oprel serve

# In another terminal:
# First request - loads qwen3-14b (~10s load time)
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Hi"}]}'

# Second request - instant! Model already loaded
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Tell me a joke"}]}'

# Switch to different model - automatically unloads qwen3-14b and loads llama3.1
curl http://localhost:11434/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hi"}]}'

# After 15 minutes of inactivity, llama3.1 is automatically unloaded

Health Check

curl http://localhost:11434/health
# Returns: {"status":"healthy","timestamp":1234567890,"current_model":"qwen3-14b"}

📊 Benchmarks vs Ollama

Feature Ollama Oprel SDK
Model Discovery 10-30s Instant (<100ms)
Memory Planning Basic Precise (KV-Cache aware)
Low VRAM Support Fails/Slow Hybrid Offloading
CPU Speed Standard 30-50% Faster (AVX)
Vision Models Partial Full Support
Image/Video Gen No ComfyUI Integration
Crash Safety Frequent OOM Proactive Warnings
Auto-Optimization Manual config Fully Automatic

🧩 Supported Models

Text Generation Models (GGUF - llama.cpp backend)

  • Qwen 3 / 2.5: Best all-around models (32B, 14B, 8B, 3B)
  • Qwen 3 Coder: SOTA for code generation (32B, 14B, 8B)
  • DeepSeek R1: Advanced reasoning (14B, 8B, 7B, 1.5B)
  • Llama 3.3 / 3.1: Meta's flagship (70B, 8B)
  • Gemma 3 / 2: Google's efficient models (27B, 12B, 9B, 4B)
  • Phi-4: Microsoft's compact powerhouse (14B)

Vision Models (VLMs) - GGUF + mmproj

  • Qwen3-VL: Multi-image understanding (32B, 14B, 7B - supports up to 8 images)
  • Qwen2.5-VL: Proven vision model (7B, 3B)
  • Llama 3.2 Vision: Meta's VLM (11B)
  • MiniCPM-V: Efficient mobile-ready VLM (2.6B)
  • Moondream 2: Lightweight vision (1.8B)

Image Generation (Safetensors - ComfyUI backend)

Requires ComfyUI running:

  • FLUX.1-dev: Best quality
  • FLUX.1-schnell: Fast generation
  • SDXL Turbo: Fastest (1-4 steps)

Video Generation (ComfyUI + AnimateDiff)

Requires ComfyUI with video nodes:

  • AnimateDiff
  • Stable Video Diffusion (SVD)
  • Custom workflows

View all available GGUF models:

oprel list-models --category text-generation
oprel list-models --category vision
oprel list-models --category coding
oprel list-models --category reasoning

License

MIT License. Made with ❤️ for local AI developers.

About

Oprel is a high-performance Python library for running large language models locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and full multimodal support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors