Ollama LLM Plugin for Fess

Overview

This plugin provides Ollama integration for Fess's RAG (Retrieval-Augmented Generation) features. It enables Fess to use locally hosted Ollama models for AI-powered search capabilities including intent detection, answer generation, document summarization, and FAQ handling.

Download

See Maven Repository.

Requirements

Fess 15.x or later
Java 21 or later
Ollama server running locally or accessible via network

Installation

Download the plugin JAR from the Maven Repository
Place it in your Fess plugin directory
Restart Fess

For detailed instructions, see the Plugin Administration Guide.

Configuration

Configure the following properties in fess_config.properties:

Property	Default	Description
`rag.llm.name`	-	Set to `ollama` to use this plugin
`rag.chat.enabled`	`false`	Enable RAG chat feature
`rag.llm.ollama.api.url`	`http://localhost:11434`	Ollama server root URL. The plugin appends `/api/chat` and `/api/tags`, so a trailing `/` or `/api` (the form shown in the Ollama docs, e.g. `http://localhost:11434/api` or `https://ollama.com/api`) is stripped automatically.
`rag.llm.ollama.answer.context.max.chars`	`10000`	Maximum characters for document context in answer generation
`rag.llm.ollama.availability.check.interval`	`60`	Interval (seconds) for checking Ollama server availability
`rag.llm.ollama.chat.evaluation.max.relevant.docs`	`3`	Maximum number of relevant documents for evaluation
`rag.llm.ollama.connect.timeout`	`5000`	TCP connect timeout (ms). Separate from `timeout` (read/response).
`rag.llm.ollama.default.max.tokens`	(unset)	Fallback when `<type>.max.tokens` is not set.
`rag.llm.ollama.default.temperature`	(unset)	Fallback when `<type>.temperature` is not set.
`rag.llm.ollama.default.thinking.budget`	(unset)	Fallback when `<type>.thinking.budget` is not set.
`rag.llm.ollama.faq.context.max.chars`	`6000`	Maximum characters for document context in FAQ generation
`rag.llm.ollama.model`	`gemma4:e4b`	Model name (e.g., `llama3:latest`, `mistral`)
`rag.llm.ollama.retry.base.delay.ms`	`2000`	Base delay (ms) for exponential backoff with ±20% jitter.
`rag.llm.ollama.retry.max`	`3`	Maximum total attempts on retryable HTTP errors (429/500/502/503/504) and connect-time IOExceptions.
`rag.llm.ollama.summary.context.max.chars`	`10000`	Maximum characters for document context in summary generation
`rag.llm.ollama.timeout`	`60000`	Response/read timeout (ms). For TCP connect timeout see `rag.llm.ollama.connect.timeout`.

Recommended num_ctx Setting

For gemma4:e4b with 16GB GPU, set:

rag.llm.ollama.default.num.ctx=8192

Per-Prompt-Type Parameters

You can configure top_p and top_k sampling parameters for each prompt type:

Property	Description
`rag.llm.ollama.<promptType>.top.p`	Top-p (nucleus) sampling parameter
`rag.llm.ollama.<promptType>.top.k`	Top-k sampling parameter

Retry behavior

Both chat() and streamChat() retry on:

HTTP 429 (Too Many Requests; Ollama Cloud and rate-limited proxies)
HTTP 500, 502, 503 (Ollama queue overload via OLLAMA_MAX_QUEUE), 504
IOException raised before a response is received (DNS, TCP, TLS, idle-socket failures)

Other 4xx errors are surfaced as LlmException immediately.

Streaming retries only the initial HTTP request. Once NDJSON bytes start flowing, in-stream errors (HTTP transport failures or NDJSON {"error": "..."} payloads) propagate immediately to LlmStreamCallback.onError(...) — no replay.

The retry status set tracks the documented Ollama errors.

Defaults can be overridden via rag.llm.ollama.retry.max and rag.llm.ollama.retry.base.delay.ms.

Stream completion log

A single INFO line is emitted per streamChat() call:

[LLM:OLLAMA] Stream completed. chunkCount=N, objectCount=N, firstChunkMs=N,
  elapsedTime=Nms, doneReason=stop, totalDurationMs=N, loadDurationMs=N,
  promptEvalDurationMs=N, evalDurationMs=N, promptEvalCount=N, evalCount=N,
  tokensPerSecond=N.NN, parseErrorCount=0

A sibling WARN line is emitted when done_reason is anything other than stop, load, or unload — most commonly length (context window truncation):

[LLM:OLLAMA] Stream finished abnormally. doneReason=length, evalCount=N, ...

Reasoning Model Configuration (e.g., qwen3.5)

Reasoning models like qwen3.5 use internal thinking tokens that improve answer quality but consume output tokens. Configure thinking per prompt type for optimal results.

rag.llm.ollama.model=qwen3.5:35b
rag.llm.ollama.timeout=120000

# Structured output / short responses - disable thinking
rag.llm.ollama.intent.thinking.budget=0
rag.llm.ollama.evaluation.thinking.budget=0
rag.llm.ollama.unclear.thinking.budget=0
rag.llm.ollama.noresults.thinking.budget=0
rag.llm.ollama.docnotfound.thinking.budget=0

# Answer generation - enable thinking with increased token limit
rag.llm.ollama.answer.thinking.budget=1
rag.llm.ollama.answer.max.tokens=16384
rag.llm.ollama.summary.thinking.budget=1
rag.llm.ollama.summary.max.tokens=16384
rag.llm.ollama.direct.thinking.budget=1
rag.llm.ollama.direct.max.tokens=8192
rag.llm.ollama.faq.thinking.budget=1
rag.llm.ollama.faq.max.tokens=8192

The thinking.budget parameter controls the Ollama think flag as a boolean:

0 — disable thinking (think: false)
Any positive value — enable thinking (think: true)
Not set — use model default (reasoning models default to thinking enabled)

When thinking is enabled, increase max.tokens to accommodate both thinking and content tokens.

thinking.level (GPT-OSS and other models that ignore the boolean form)

Per Ollama's thinking docs, the think field also accepts the string values high, medium, and low. GPT-OSS models in particular ignore the boolean form. Use rag.llm.ollama.<promptType>.thinking.level (or rag.llm.ollama.default.thinking.level) to send a string instead of a boolean:

rag.llm.ollama.model=gpt-oss:20b
rag.llm.ollama.answer.thinking.level=high
rag.llm.ollama.intent.thinking.level=low

When thinking.level is set, it overrides the boolean derived from thinking.budget for that prompt type. Allowed values: high, medium, low (case-insensitive). Invalid values are ignored with a WARN log and fall back to thinking.budget.

Features

Intent Detection - Determines user intent (search, summary, FAQ, unclear) and generates Lucene queries
Answer Generation - Generates answers based on search results with citation support
Document Summarization - Summarizes specific documents
FAQ Handling - Provides direct, concise answers to FAQ-type questions
Relevance Evaluation - Identifies the most relevant documents for answer generation
Streaming Support - Real-time response streaming via NDJSON format
Availability Checking - Validates Ollama server and model availability at configurable intervals

Ollama API Endpoints Used

GET /api/tags - Lists available models for availability checking
POST /api/chat - Performs chat completion (supports both standard and streaming modes)

Development

Building from Source

mvn clean package

Running Tests

mvn test

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ollama LLM Plugin for Fess

Overview

Download

Requirements

Installation

Configuration

Recommended num_ctx Setting

Per-Prompt-Type Parameters

Retry behavior

Stream completion log

Reasoning Model Configuration (e.g., qwen3.5)

thinking.level (GPT-OSS and other models that ignore the boolean form)

Features

Ollama API Endpoints Used

Development

Building from Source

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ollama LLM Plugin for Fess

Overview

Download

Requirements

Installation

Configuration

Recommended num_ctx Setting

Per-Prompt-Type Parameters

Retry behavior

Stream completion log

Reasoning Model Configuration (e.g., qwen3.5)

thinking.level (GPT-OSS and other models that ignore the boolean form)

Features

Ollama API Endpoints Used

Development

Building from Source

Running Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages