Multi-session "all" recall at 12.5% is a fundamental retrieval challenge, not just an assembly/selection problem. Research into claude-mem, QMD, and supermemory reveals that no tool has solved multi-document aggregation well. The core issue is that embedding similarity concentrates on a single semantic cluster rather than spreading across diverse relevant sessions.
LongMemEval questions contain:
question_type: "multi-session", "single-session-assistant", "temporal-reasoning", etc.answer_session_ids: List of session IDs containing the answer (this is the key field!)haystack_sessions: Full conversation context per session
Total multi-session questions: 133
Question patterns by frequency:
| Pattern | Count | Example |
|---|---|---|
| "how many" | 67 | "How many different doctors did I visit?" |
| "total" | 36 | "What is the total amount I spent on luxury items?" |
| "all" | 10 | "List all the camping trips I went on" |
| "different" | 7 | "How many different projects have I led?" |
| "each" | 3 | "What did I buy at each store?" |
- Aggregation requires COMPLETE recall - Missing 1 of 5 doctor visits = wrong answer
- Sessions are semantically similar - All doctor visits embed close together
- Vector search concentrates - Top-k returns variations of the SAME best match
- No diversity mechanism - RRF reinforces top matches, doesn't spread
Query: "How many items of clothing do I need to pick up?" Expected: 3 items from 3 different sessions
- Session 1: "pick up dry cleaning for navy blue blazer"
- Session 2: "return the red dress at Nordstrom"
- Session 3: "collect the altered suit"
Problem: Vector search finds "navy blue blazer" first, then returns related blazer/suit results, missing the dress entirely.
Repository: https://github.com/thedotmack/claude-mem
Claude-mem is a session-based memory system for Claude Code, not a general retrieval system:
- SQLite storage with observations and session summaries
- No vector search - uses SQL queries with type/concept filtering
- Timeline-based assembly (chronological ordering)
- Per-project scoping
-
ObservationCompiler (
src/services/context/ObservationCompiler.ts)- Queries observations by type and concept tags
- No semantic search - relies on explicit categorization
LIMIT ?for result count control
-
Session Summaries
- Separate table for session-level summaries
- Chronological retrieval, not relevance-based
| Feature | Claude-Mem Approach | CEMS Relevance |
|---|---|---|
| Session boundaries | Explicit session_id tracking | Our source_ref is similar |
| Diversity | None - chronological | N/A |
| Aggregation | SQL DISTINCT/COUNT | Not applicable to vector search |
Claude-mem does NOT solve multi-session aggregation - it's designed for chronological context injection, not semantic retrieval. No techniques are transferable.
Repository: https://github.com/tobi/qmd
QMD implements a sophisticated hybrid search pipeline:
User Query
|
v
Query Expansion (fine-tuned model)
|
+---> Original Query (2x weight)
+---> Expanded Query 1
+---> Expanded Query 2
|
v
For Each Query:
+---> BM25 (FTS5)
+---> Vector Search
|
v
RRF Fusion + Top-Rank Bonus
|
v
LLM Reranking (Qwen3-Reranker)
|
v
Position-Aware Blending
|
v
Final Results
// store.ts:2118-2161
export function reciprocalRankFusion(
resultLists: RankedResult[][],
weights: number[] = [],
k: number = 60
): RankedResult[] {
// RRF formula: score = sum(weight / (k + rank + 1))
// Top-rank bonus (critical for preserving exact matches)
if (entry.topRank === 0) {
entry.rrfScore += 0.05; // Rank 1 bonus
} else if (entry.topRank <= 2) {
entry.rrfScore += 0.02; // Rank 2-3 bonus
}
}Key insight: Top-rank bonus protects documents that score #1 for the original query from being diluted by expansion queries.
// qmd.ts:2250-2259
let rrfWeight: number;
if (rrfRank <= 3) {
rrfWeight = 0.75; // Trust retrieval for top 3
} else if (rrfRank <= 10) {
rrfWeight = 0.60; // Balanced
} else {
rrfWeight = 0.40; // Trust reranker for lower ranks
}
const blendedScore = rrfWeight * rrfScore + (1 - rrfWeight) * rerankScore;Key insight: This prevents the reranker from destroying high-confidence retrieval results while allowing it to improve uncertain ones.
QMD uses a custom fine-tuned model (qmd-query-expansion-1.7B) that outputs structured expansion:
lex: keyword-based search terms
vec: semantic search terms
hyde: hypothetical document
| Technique | Implementation Difficulty | Expected Impact |
|---|---|---|
| Top-rank bonus in RRF | Easy (10 lines) | Low - we already have this |
| Position-aware reranker blending | Medium | ABANDONED - reranker hurts us |
| Query type-aware expansion | Medium | Already doing for temporal/preference |
QMD's reranker works because:
- They search markdown documents (clear relevance signals)
- They use a locally fine-tuned model
- Their queries are "information retrieval" style
Our reranker fails because:
- We search conversation snippets (fuzzy relevance)
- We use generic models (Qwen3-Reranker disagrees with our labels)
- Our queries need "user memory" understanding
QMD has excellent retrieval engineering but no explicit diversity/multi-document handling. Their approach improves single-best-match quality, not multi-session recall.
Repository: https://github.com/supermemoryai/supermemory
Supermemory is a hosted SaaS memory service:
- API-based (no local retrieval code to analyze)
- Profile-based memory organization
- MCP server for Claude integration
- Profile Structure (
src/tools-shared.ts)
interface ProfileWithMemories {
static?: Array<MemoryItem | string> // Stable preferences
dynamic?: Array<MemoryItem | string> // Recent activity
searchResults?: Array<MemoryItem | string>
}- Deduplication (
deduplicateMemories())
- Priority: Static > Dynamic > Search Results
- Simple string-based dedup
- API Endpoints
/v4/profile- Get user profile + optional query search- Combines profile data with semantic search results
Could not find evidence of 81% LongMemEval claim in their repository. Their README mentions:
- "AI second brain for saving and organizing"
- No benchmark numbers visible
Likely explanation: 81% may be "any" recall, not "all" recall. Or measured on a different dataset/subset.
| Feature | Supermemory Approach | CEMS Relevance |
|---|---|---|
| Profile separation | Static vs Dynamic memories | Could help with "stable preferences" |
| Deduplication | Priority-based string matching | We use memory_id dedup |
| Three-tier retrieval | Profile + Dynamic + Search | Similar to our profile probe |
Supermemory is a SaaS wrapper, not a retrieval innovation. Their profile structure is interesting but doesn't solve multi-session aggregation.
Paper: "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries" (Carbonell & Goldstein, 1998)
Formula:
MMR = argmax[lambda * Sim(d, Q) - (1-lambda) * max(Sim(d, S))]
Where:
Sim(d, Q)= relevance to querySim(d, S)= similarity to already-selected documentslambda= relevance/diversity tradeoff (0.5-0.7 typical)
Application to CEMS:
- Apply MMR during assembly (after RRF fusion)
- Penalize documents similar to already-selected ones
- Should naturally spread across sessions
Paper: "MDR: Retrieve, Read, Retrieve" (Xiong et al., 2021)
Approach:
- Initial retrieval with query
- Read top results
- Generate follow-up query based on what's missing
- Retrieve again
- Combine results
Application to CEMS: For "how many different doctors?":
- Retrieve with query -> Find Dr. Smith
- Generate: "doctor visits NOT Dr. Smith"
- Retrieve again -> Find Dr. Johnson
- Repeat until no new doctors found
Complexity: High - requires iterative LLM calls
Paper: "Decomposed Prompting: A Modular Approach" (Khot et al., 2023)
Approach:
- Break "How many X in total?" into sub-queries
- "What X did I do in January?" + "February?" + ...
- Aggregate results
Application to CEMS:
- Detect aggregation queries
- Generate temporal decomposition
- Run multiple retrieval rounds
- Dedupe and count
Paper: "Approximate Nearest Neighbor Negative Contrastive Learning" (Xiong et al., 2020)
Key insight: Dense retrieval inherently clusters similar documents. Solutions:
- Clustering-based sampling - Sample from different embedding clusters
- Negative feedback - Downweight vectors near already-selected items
- Sparse-dense hybrid - BM25 provides natural diversity
-
MMR at Assembly Time
- Modify
assemble_context_diverse()to use MMR formula - Penalize documents with high similarity to selected ones
- Expected improvement: +5-10% multi-session "all" recall
- Modify
-
Cluster-Based Selection
- Cluster candidate embeddings into k groups
- Select top result from each cluster
- Already partially doing this with session-based grouping
-
Query Decomposition for Aggregation
- For "how many X?" queries, generate sub-queries
- "List all X mentions" -> multiple retrievals
- LLM to dedupe and count
-
Negative Feedback Retrieval
- After finding top result, run: "X NOT {top_result}"
- Iteratively discover new relevant items
-
Multi-Hop Retrieval
- Full iterative retrieval with LLM-generated follow-ups
- Requires significant architecture change
-
Fine-Tuned Reranker
- Train on LongMemEval-style data
- Learn CEMS-specific relevance signals
Mathematical analysis:
- Average expected sessions per query: 3-5
- Vector search precision@50: ~40-60%
- Probability all relevant in top-50: 0.5^4 = 6.25%
- Our 12.5% is actually above random chance
Vector similarity is UNIMODAL - it finds the single best semantic cluster.
For "doctor visits", the embedding space looks like:
"doctor appointment"
|
Dr. Smith visit ------ X ------ Dr. Johnson visit
|
Dr. Chen visit
All doctor visits cluster together. Top-k returns variations of the same visit, not different visits.
- Different surface forms - "appointment", "checkup", "examination"
- Temporal anchors - "January doctor", "February doctor"
- Explicit enumeration - User mentioned "first doctor", "second doctor"
They're designed for document retrieval, not conversation memory:
- Documents have distinct titles/topics
- Documents are longer (more semantic differentiation)
- Users don't ask "how many documents about X?"
def mmr_select(candidates, selected, lambda_param=0.6):
"""Select next candidate using MMR."""
best_score = -inf
best_candidate = None
for c in candidates:
relevance = c.score
diversity = min([cosine_sim(c.embedding, s.embedding) for s in selected])
mmr_score = lambda_param * relevance - (1 - lambda_param) * diversity
if mmr_score > best_score:
best_score = mmr_score
best_candidate = c
return best_candidate- Current: Full memory content
- Proposed: First 200 chars + "..."
- Benefit: Fit 10x more sessions in same budget
- Detect: "how many", "total", "count"
- Decompose: Generate temporal/categorical sub-queries
- Aggregate: Dedupe by entity, count unique
Multi-session "all" recall is a fundamental limitation of dense retrieval, not a bug in CEMS. The tools analyzed (claude-mem, QMD, supermemory) do not solve this problem - they're optimized for different use cases.
The most promising approaches are:
- MMR diversity (immediate, moderate impact)
- Content truncation (immediate, enables more sessions)
- Query decomposition (medium effort, high impact for aggregation queries)
Reranking consistently hurts our performance and should not be pursued.