Skip to content

justi/ruby_llm-contract

Repository files navigation

ruby_llm-contract

Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.

Companion gem for ruby_llm.

The problem

Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt — did accuracy improve or drop? You have no data. Just gut feeling.

The fix

class ClassifyTicket < RubyLLM::Contract::Step::Base
  prompt do
    system "You are a support ticket classifier."
    rule "Return valid JSON only, no markdown."
    rule "Use exactly one priority: low, medium, high, urgent."
    example input: "My invoice is wrong", output: '{"priority": "high"}'
    user "{input}"
  end

  output_schema do
    string :priority, enum: %w[low medium high urgent]
    string :category
  end

  validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
  retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
end

result = ClassifyTicket.run("I was charged twice")
result.ok?               # => true
result.parsed_output     # => {priority: "high", category: "billing"}
result.trace[:cost]      # => 0.000032
result.trace[:model]     # => "gpt-4.1-nano"

Bad JSON? Auto-retry. Wrong value? Escalate to a smarter model. Schema violated? Caught client-side even if the provider ignores it. All with cost tracking.

Which model should I use?

Define test cases. Compare models. Get data.

ClassifyTicket.define_eval("regression") do
  add_case "billing", input: "I was charged twice", expected: { priority: "high" }
  add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
  add_case "outage",  input: "Database is down",    expected: { priority: "urgent" }
end

comparison = ClassifyTicket.compare_models("regression",
  models: %w[gpt-4.1-nano gpt-4.1-mini])

Real output from real API calls:

Model                      Score       Cost  Avg Latency
---------------------------------------------------------
gpt-4.1-nano                0.67    $0.000032      687ms
gpt-4.1-mini                1.00    $0.000102     1070ms

Cheapest at 100%: gpt-4.1-mini
comparison.best_for(min_score: 0.95)  # => "gpt-4.1-mini"

# Inspect failures
comparison.reports["gpt-4.1-nano"].failures.each do |f|
  puts "#{f.name}: expected #{f.expected}, got #{f.output}"
  puts "  mismatches: #{f.mismatches}"
  # => outage: expected {priority: "urgent"}, got {priority: "high"}
  #      mismatches: {priority: {expected: "urgent", got: "high"}}
end

Pipeline

Chain steps with fail-fast. Hallucination in step 1 stops before step 2 spends tokens.

class TicketPipeline < RubyLLM::Contract::Pipeline::Base
  step ClassifyTicket,  as: :classify
  step RouteToTeam,     as: :route
  step DraftResponse,   as: :draft
end

result = TicketPipeline.run("I was charged twice")
result.ok?                          # => true
result.outputs_by_step[:classify]   # => {priority: "high", category: "billing"}
result.trace.total_cost             # => 0.000128

CI gate

# RSpec — block merge if accuracy drops or cost spikes
expect(ClassifyTicket).to pass_eval("regression")
  .with_context(model: "gpt-4.1-mini")
  .with_minimum_score(0.8)
  .with_maximum_cost(0.01)

# Rake — run all evals across all steps
require "ruby_llm/contract/rake_task"
RubyLLM::Contract::RakeTask.new do |t|
  t.minimum_score = 0.8
  t.maximum_cost = 0.05
end
# bundle exec rake ruby_llm_contract:eval

Detect quality drops

Save a baseline. Next run, see what regressed.

report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_baseline!(model: "gpt-4.1-nano")

# Later — after prompt change, model update, or provider weight shift:
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
diff = report.compare_with_baseline(model: "gpt-4.1-nano")

diff.regressed?    # => true
diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
diff.score_delta   # => -0.33
# CI: block merge if any previously-passing case now fails
expect(ClassifyTicket).to pass_eval("regression")
  .with_context(model: "gpt-4.1-nano")
  .without_regressions

Track quality over time

# Save every eval run
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
report.save_history!(model: "gpt-4.1-nano")

# View trend
history = report.eval_history(model: "gpt-4.1-nano")
history.score_trend   # => :stable_or_improving | :declining
history.drift?        # => true (score dropped > 10%)

Run evals fast

# 4x faster with parallel execution
report = ClassifyTicket.run_eval("regression",
  context: { model: "gpt-4.1-nano" },
  concurrency: 4)

Predict cost before running

ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
# => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }

Install

gem "ruby_llm-contract"
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }

Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).

Docs

Guide
Getting Started Features walkthrough, model escalation, eval
Best Practices 6 patterns for bulletproof validates
Output Schema Full schema reference + constraints
Pipeline Multi-step composition, timeout, fail-fast
Testing Test adapter, RSpec matchers
Migration Adopting the gem in existing Rails apps

Roadmap

v0.4 (current): Observability & scale — eval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging.

v0.3: Baseline regression detection, migration guide, production hardening.

v0.5: Prompt A/B testing — compare_with(OtherStep) for data-driven prompt engineering with regression safety. Cross-provider comparison docs.

License

MIT

About

Know which LLM model to use, what it costs, and when accuracy drops. Companion gem for ruby_llm.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages