BeyondQuestions (BeQu)

An open knowledge evaluation benchmark for Large Language Models.

BeQu evaluates what LLMs actually know by prompting them to freely surface structured knowledge about entities, then verifying every generated statement against a reference corpus built from Wikipedia and the web. Unlike fixed Q&A benchmarks, BeQu measures both precision (are the elicited triples correct?) and recall (how much of the reference knowledge does the model cover?).

Paper under peer review at ARR May 2026. All data, code, and elicited triples are available in this repository.

Overview

Traditional NLP benchmarks rely on fixed question-answering pairs, which introduces availability bias: facts not explicitly queried are invisible to evaluation, even when a model possesses them. BeQu goes further by prompting LLMs to freely emit (subject, predicate, object) triples about Wikipedia entities, then measuring coverage and correctness against a per-entity reference corpus.

Key capabilities:

Prompt LLMs to generate knowledge triples for any set of Wikipedia entities
Build reference corpora from Wikipedia articles and web search results
Evaluate elicited triples via RAG-based LLM verification (precision) and coverage checking (recall)
Compare results across 20 models, multiple prompt formats, reasoning effort levels, entity popularity, and knowledge domains

Website

The repository includes a self-contained interactive website (index.html) with six sections:

Leaderboard

The primary view. Shows all 20 models ranked by Entailment F1, with sortable columns for F1, Precision, Recall, and Contradiction rate. Results can be filtered by commercial vs. open-weight models and searched by name. The view includes:

A Precision–Recall scatter plot for all 20 models (no model achieves both high precision and high recall — a clean frontier emerges)
A Key Findings panel summarising the four main experimental takeaways
A F1 by knowledge domain bar chart for three models (GPT-5.4, DeepSeek V3.2, Llama 4 Scout 17B) across 10 domains
Six experiment tabs: Overall ranking · By reasoning effort · By domain · By prompt format · By triple range · By entity popularity

Methodology

Explains the four-step pipeline: entity selection → knowledge elicitation → reference corpus construction → two-way verification. Covers evaluation scale details: 500 sampled triples per model per direction, top-10 RAG retrieval, text-embedding-3-small embeddings, Llama 4 Scout as the NLI judge (90% agreement with human labels).

Datasets

Describes the four entity lists:

#	Name	Size	Purpose
01	Random entities	10,000	Primary benchmark; hard-filtered + LLM-judged
02	Domain-balanced	10 × 100	Per-domain F1 analysis across 10 Wikipedia categories
03	Non-existent entities	10	Hallucination probe (hand-crafted fictional entities)
App.	Popularity tiers	3 × 200	Effect of entity popularity on model performance

Also shows a hallucination bar chart for the non-existent entity experiment: GPT-5.4 fully abstains (0 triples); Llama 4 Scout generates 32 triples over 7 entities; DeepSeek V3.2 generates 131 triples over 4 entities.

About

Background on the benchmark motivation and design philosophy.

Entity Lists

An interactive browser for the four Wikipedia entity sets used across BeQu experiments. Clicking a dataset card fetches the corresponding JSON from GitHub and renders a searchable, paginated table (50 rows per page). Four datasets are available:

Card	Dataset	Columns
Dataset 01	Random entities (10,000)	Entity
Dataset 02	By domain (10 × 100)	Entity · Wikidata ID
Appendix E	By popularity (3 × ~200)	Entity · Popularity · Wikidata statements · Wikidata ID
Dataset 03	Non-existent entities (10)	Entity

The Domains dataset exposes subset chips for each of the 10 domains (Person, Organization, Location, …). The Popularity dataset exposes chips for each tier (High 66–100%, Mid 33–66%, Low 0–33%). Entity titles link to Wikipedia; Wikidata IDs link to Wikidata.

Elicited Triples

An interactive data browser that loads the raw elicited_triples.csv files directly from the repository via the GitHub raw content API. Select any model from the sidebar, then choose an experiment or subset to browse the full triple table (subject · predicate · object) with live search and pagination (50 rows per page). All 20 models and their respective experiment subsets are available.

To use the website, open index.html in any modern browser. The Entity Lists and Elicited Triples sections fetch data from GitHub and therefore require an internet connection.

Architecture

BeyondQuestions/
├── BeQu.py                          # Main CLI entry point
├── experiment_tracker.py            # Experiment deduplication
├── combine_results.py               # Result aggregation
│
├── ELICITATION/                     # Knowledge elicitation pipeline
│   ├── main.py                      # OpenAI Batch API orchestration
│   ├── local.py                     # OpenRouter / local API wrapper
│   ├── gpt_kbc.py                   # GPT KBC runner (workspace isolation)
│   ├── template_utils.py            # Jinja2 prompt template loading
│   └── templates/prompts/           # Prompt templates
│       ├── GPTKB.jinja              # Direct triple elicitation (default)
│       ├── GPTKB_2x_repetition.jinja
│       ├── GPTKB_3x_repetition.jinja
│       ├── wikidata_schema.jinja
│       ├── schemaorg_schema.jinja
│       └── LMCRAWL/                 # Two-step elicitation
│           ├── predicates.jinja     # Step 1: elicit predicates
│           └── objects.jinja        # Step 2: elicit objects
│
├── EVALUATION/                      # Triple evaluation pipeline
│   ├── main.py                      # Evaluation coordinator
│   ├── request.py                   # LLM verification (JSON schema output)
│   ├── process_request.py           # Batch evaluation
│   ├── wikipedia_triple_extractor.py # Reference construction
│   ├── rag_retriever.py             # Dense embedding retrieval
│   ├── retrieve_passages.py         # Passage utilities
│   ├── wikipedia_utils.py           # Wikipedia API wrapper
│   ├── wikidata_utils.py            # Wikidata API wrapper
│   └── network_utils.py             # Retry decorator
│
├── ENTITY LISTS/                    # Test entity sets
│   ├── RANDOM/                      # 10K random Wikipedia entities
│   ├── DOMAINS/                     # Entities grouped by domain
│   └── POPULARITY/                  # Entities grouped by page-view popularity
│
├── REFERENCE CORPUS/
│   ├── random/GT.json.gz
│   ├── domains/GT.json.gz
│   └── popularity/GT.json.gz
│
├── ELICITED TRIPLES/                # Model outputs (20 models × experiments)
├── RESULTS/                         # Evaluation results
│   └── combined_results_manual.csv
└── index.html                       # Interactive website (single file, no build step)

Elicitation Pipeline

LLMs are prompted to generate (subject, predicate, object) triples for each entity using configurable Jinja2 templates. The default prompt is the GPTKB style; seven additional templates cover schema-constrained, repeated, and two-step elicitation variants.

Reference Corpus Construction

For each entity, the pipeline:

Fetches the full Wikipedia article text
Retrieves the top-20 web search results via Brave Search API
Uses an LLM to extract triples from all sources
Caches everything (text, URLs, triples) in a JSON file

Evaluation Pipeline

Precision (RAG-based):

For each elicited triple, concatenate subject + predicate + object as a query
Retrieve the top-10 most similar passages from the reference corpus (text-embedding-3-small)
A judge LLM (Llama 4 Scout) classifies: entailment / contradiction / neutral
Precision = fraction of triples classified as entailment

Recall (Coverage Check):

For each ground truth triple, check if any elicited triple covers it
Verify with the judge LLM using all ground truth sources
Recall = fraction of ground truth triples covered

NLI label meanings:

Label	Meaning
Entailment	Verified as factually correct
Contradiction	Contradicts reference corpus
Neutral	Truth cannot be determined from reference

Experimental Settings

BeQu covers five experimental configurations:

#	Setting	Description
01	`random`	200-entity sample from the 10K random list — the primary BeQu ranking
02	`reasoning_effort`	Low / medium / high reasoning effort on Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro
03	`domains`	100 entities per domain; includes a non-existent entity hallucination probe
04	`prompts`	Eight prompt templates (GPTKB, LMCRAWL, Schema.org, Wikidata schema, and repetition variants) run on Kimi K2.5
05	`ranges`	Vary the number of expected triples per entity (six range buckets)
App.	`popularity`	Three popularity tiers (0–33%, 33–66%, 66–100% by Wikidata statement count)

Datasets

Random entities (10,000)

Randomly sampled Wikipedia article titles passing hard filters: minimum article length (≥ 2,000 chars), minimum Wikidata statement count (≥ 10), non-disambiguation page. Survivors are then judged by an LLM for informativeness, ambiguity, and suitability (Likert ≥ 3 on all axes). This is the primary benchmark dataset.

Domain-balanced (10 × 100)

100 entities per domain across: person, organisation, location, event, work of art, artifact, scientific concept, cultural concept, animal, plant. Single-domain entities only, enabling per-domain F1 analysis.

Non-existent entities (10)

Hand-crafted plausible and absurd fictional entities — e.g. Valdora Strait, U-Bahn Dresden, iPhone 19 Pro, Helios Prize for Digital Arts, Gulf of Varennes. No reference corpus; the test is whether models abstain from generating triples at all.

Popularity tiers (3 × 200)

The random subset partitioned into low / mid / high popularity buckets by Wikidata statement count. Used in a supplementary experiment.

Tested Models

BeQu has been evaluated on 20 models:

Commercial:

Provider	Model	ID
OpenAI	GPT-5.4	`gpt-5.4-2026-03-05`
OpenAI	GPT-5 Mini	`gpt-5-mini-2025-08-07`
OpenAI	GPT-5 Nano	`gpt-5-nano-2025-08-07`
OpenAI	GPT-OSS-120B	`openai_gpt-oss-120b`
Anthropic	Claude Opus 4.6	`anthropic_claude-opus-4.6`
Anthropic	Claude Sonnet 4.6	`anthropic_claude-sonnet-4.6`
Anthropic	Claude Haiku 4.5	`anthropic_claude-haiku-4.5`
Google DeepMind	Gemini 3.1 Pro	`google_gemini-3.1-pro-preview`
Google DeepMind	Gemini 3 Flash	`google_gemini-3-flash-preview`
Google DeepMind	Gemini 3.1 Flash Lite	`google_gemini-3.1-flash-lite-preview`
xAI	Grok 4.1 Fast	`x-ai_grok-4.1-fast`
MiniMax	MiniMax M2.5	`MiniMaxAI_MiniMax-M2.5`

Open / API-Served:

Provider	Model	ID
Google	Gemma 3 4B	`google_gemma-3-4b-it`
Google	Gemma 3 12B	`google_gemma-3-12b-it`
Google	Gemma 3 27B	`google_gemma-3-27b-it`
Meta AI	Llama 4 Scout 17B	`meta-llama_Llama-4-Scout-17B-16E-Instruct`
Mistral AI	Mistral Large	`mistralai_mistral-large-2512`
DeepSeek	DeepSeek V3.2	`deepseek-ai_DeepSeek-V3.2`
Alibaba	Qwen3.5 27B	`qwen_qwen3.5-27b`
Moonshot AI	Kimi K2.5	`moonshotai_kimi-k2.5`

Key Findings

Finding 01 — Benchmark durability. All models are far from saturating open-ended knowledge generation. Entailment F1 spans 0.171–0.473 across the 20 models. Open-source models are competitive: Kimi K2.5 ranks 2nd overall.

Finding 02 — Reasoning has negligible effect. Unlike most NLP tasks, reasoning effort makes almost no difference for open-ended knowledge expression. All nine model–effort combinations cluster within F1 0.400–0.484.

Finding 03 — Schemas vs. creativity. Schema enforcement boosts precision (Schema.org reaches 77.2%) but significantly lowers recall to only 14.0%. Open-ended prompts win on F1.

Finding 04 — Hard-wired operating points. Explicit precision–recall steering is limited. Prompt repetition (3×) gives a modest recall gain (27.2 → 31.6%), but further repetitions degrade performance sharply.

Setup

Prerequisites

Python 3.8+

Installation

git clone [ANONYMIZED]
cd BeyondQuestions

# Pull large files (reference corpus, detailed results)
git lfs pull

pip install openai anthropic requests pandas loguru tqdm \
            sentence-transformers torch jinja2 fire

Usage

All workflows are driven through the BeQu.py CLI.

1 — Build Reference Corpus Only

python BeQu.py \
  --entities_file_path "path/to/json/file" \
  --ground_truth_dir_path "path/to/output" \
  --build_ground_truth_only

2 — Elicit and Evaluate a Single Model

python BeQu.py \
  --entities_file_path "path/to/json/file" \
  --api [YOUR_API] \
  --model_elicitation [MODEL] \
  --prompt_template_dir_elicitation ELICITATION/templates/prompts/ \
  --reasoning_effort_elicitation [EFFORT] \
  --elicited_triples_dir "ELICITED TRIPLES" \
  --ground_truth_dir_path "path/to/reference" \
  --results_dir_path RESULTS \
  --llm_judge [JUDGE] \
  --sample_size 500 \
  --seed 42

3 — Skip Elicitation (Evaluate Pre-existing Triples)

python BeQu.py \
  --skip_elicitation \
  --elicited_triples_dir "ELICITED TRIPLES" \
  --ground_truth_dir_path "path/to/reference" \
  --results_dir_path RESULTS

4 — Batch Evaluate All Models

python BeQu.py \
  --evaluate_all_models \
  --elicited_triples_dir "ELICITED TRIPLES" \
  --results_dir_path RESULTS

5 — Combine Results into a Single CSV

python combine_results.py --results_dir RESULTS

Data Formats

Entity List (input JSON)

[
  {"title": "Albert Einstein", "length": 5974},
  {"title": "Marie Curie", "length": 4801}
]

Elicited Triples (output CSV)

subject,predicate,object,subject_name
Albert_Einstein,instanceOf,physicist,Albert Einstein
Albert_Einstein,birthPlace,Ulm,Albert Einstein
Albert_Einstein,fieldOfWork,theoretical physics,Albert Einstein

Evaluation Results (per-model CSV)

Model,Metric,Setting,Total #Triples,Entailment,Contradiction,Neutral
claude-sonnet-4-6,Precision (RAG),random,500,320,15,165

Combined Results (`combined_results_manual.csv`)

Model,Setting,Domain,Entailment_F1,Entailment_Precision,Entailment_Recall
claude-sonnet-4-6,random,,0.4316,0.646,0.324

Results

Aggregated results are in RESULTS/combined_results_manual.csv. Individual per-model outputs are under RESULTS/{model}/{setting}/:

File	Contents
`results.csv`	Aggregate precision / recall metrics
`results_by_category.csv`	Metrics broken down by Wikipedia domain
`results_by_popularity.csv`	Metrics broken down by popularity bucket
`results_detailed.csv`	Per-triple entailment labels (Git LFS)

Overall leaderboard (Entailment F1, random setting, 200 entities):

Rank	Model	F1	Precision	Recall
1	Claude Opus 4.6	0.473	59.6%	39.2%
2	Kimi K2.5	0.450	54.4%	38.4%
3	Gemini 3 Flash	0.444	47.8%	41.4%
4	Claude Sonnet 4.6	0.432	64.6%	32.4%
5	GPT-5.4	0.421	76.6%	29.0%
6	Gemini 3.1 Pro	0.401	79.2%	26.8%
7	Mistral Large	0.396	53.4%	31.4%
8	Gemini 3.1 Flash Lite	0.383	64.6%	27.2%
9	DeepSeek V3.2	0.372	46.0%	31.2%
10	GPT-5 Mini	0.353	62.6%	24.6%
11	Grok 4.1 Fast	0.338	53.2%	24.8%
12	MiniMax M2.5	0.323	60.8%	22.0%
13	GPT-OSS-120B	0.318	41.8%	25.6%
14	Claude Haiku 4.5	0.299	61.4%	19.8%
15	Qwen3.5 27B	0.296	49.0%	21.2%
16	Gemma 3 27B	0.282	33.4%	24.4%
17	GPT-5 Nano	0.271	88.0%	16.0%
18	Gemma 3 12B	0.254	38.4%	19.0%
19	Llama 4 Scout 17B	0.253	63.2%	15.8%
20	Gemma 3 4B	0.171	36.4%	11.2%

Design Notes

Experiment deduplication: Configuration hashes (SHA-256) are stored in .experiment_tracking.json so runs are never repeated accidentally.
Memory efficiency: Evaluation results are streamed to CSV incrementally rather than held in memory.
Reproducibility: Fix --seed and --sample_size to reproduce any published result exactly.

License

This project is released under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. You are free to share and adapt the material for any purpose, provided appropriate credit is given.

Please cite our work: [TBD upon publication]

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
ELICITATION		ELICITATION
ELICITED TRIPLES		ELICITED TRIPLES
ENTITY LISTS		ENTITY LISTS
EVALUATION		EVALUATION
REFERENCE CORPUS		REFERENCE CORPUS
RESULTS		RESULTS
.gitattributes		.gitattributes
BeQu.py		BeQu.py
LICENSE		LICENSE
README.md		README.md
combine_results.py		combine_results.py
experiment_tracker.py		experiment_tracker.py
index.html		index.html

Folders and files

Latest commit

History

Repository files navigation

BeyondQuestions (BeQu)

Table of Contents

Overview

Website

Leaderboard

Methodology

Datasets

About

Entity Lists

Elicited Triples

Architecture

Elicitation Pipeline

Reference Corpus Construction

Evaluation Pipeline

Experimental Settings

Datasets

Random entities (10,000)

Domain-balanced (10 × 100)

Non-existent entities (10)

Popularity tiers (3 × 200)

Tested Models

Key Findings

Setup

Prerequisites

Installation

Usage

1 — Build Reference Corpus Only

2 — Elicit and Evaluate a Single Model

3 — Skip Elicitation (Evaluate Pre-existing Triples)

4 — Batch Evaluate All Models

5 — Combine Results into a Single CSV

Data Formats

Entity List (input JSON)

Elicited Triples (output CSV)

Evaluation Results (per-model CSV)

Combined Results (combined_results_manual.csv)

Results

Design Notes

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

Combined Results (`combined_results_manual.csv`)