An agentic AI product backlog generator in six progressive versions. Each version introduces one new pattern, building on the previous, until the system is fully autonomous.
This project is intentionally $0. No OpenAI API key, no cloud credits, no free tiers that quietly start charging when you cross a token limit. If the fear of cloud costs has been stopping you from learning agentic AI - it doesn't have to.
This project is loosely inspired by Niklaus Wirth's classic "Program Development by Stepwise Refinement" - the idea that you build understanding by starting simple and adding one concept at a time. It works just as well for agentic AI as it did for structured programming. In this case, each version introduces exactly one new agentic pattern, building on the previous, until the system is fully autonomous.
Given a plain-English product requirement, the system generates:
- 1 Epic with business-level acceptance criteria
- 3 Features derived from the Epic, each with acceptance criteria
The requirement is passed via the command line:
python backlog_gen_v6.py "Grandma has a car and wants to know when she should refill fuel"- Python 3.10+
- ollama installed and running locally
pip install ollama
ollama pull gemma3:1b
ollama pull qwen3-coder:480b-cloudNote on models: v1, v2, and v3 use
gemma3:1bfor simplicity. v4 onwards useqwen3-coder:480b-cloudfor better instruction-following. If you want to use a different model, change theMODELconstant at the top of each file.mistralandllama3.2:3bare good alternatives.
A .vscode/settings.json is included that configures pytest as the test runner. Tests are discoverable via the Test Explorer (beaker icon in the sidebar) without any additional configuration.
If tests are not appearing in Test Explorer, open the Command Palette (Ctrl+Shift+P) and run:
Python: Select Interpreter
Make sure the selected interpreter is the same Python where you installed pytest and ollama.
| Version | Pattern | What's Introduced |
|---|---|---|
| v1-pipeline | Sequential Pipeline | ask_llm(), prompt templates, LLM chaining |
| v2-shared-state | Shared State | Single state dict passed through all functions |
| v3-planner-agent | Planner Agent | LLM decides which steps to run at runtime |
| v4-critic-loop | Critic Loop | Per-artifact critique and revision cycle |
| v5-multi-agent | Multi-Agent | Self-contained agent classes, Coordinator |
| v6-autonomous | Autonomous | Global critic, LLM-driven routing, selective re-execution |
Run each version with the same requirement so you can compare how the output and the console behavior evolve:
python backlog_gen_v1.py "Grandma has a car and wants to know when she should refill fuel"
python backlog_gen_v2.py "Grandma has a car and wants to know when she should refill fuel"
python backlog_gen_v3.py "Grandma has a car and wants to know when she should refill fuel"
python backlog_gen_v4.py "Grandma has a car and wants to know when she should refill fuel"
python backlog_gen_v5.py "Grandma has a car and wants to know when she should refill fuel"
python backlog_gen_v6.py "Grandma has a car and wants to know when she should refill fuel"v1-pipeline The output appears immediately with no intermediate steps visible. Notice how bare it is - no planning, no checking, no structure.
v2-shared-state
The output looks the same as v1, but the code is fundamentally different. Study how state flows through the functions. This is the scaffolding everything else is built on.
v3-planner-agent
Watch for the Planning workflow... and Confirmed Plan: lines. The LLM is now deciding what to do before doing it.
v4-critic-loop
This is where the console gets interesting. Watch the --- Critic Input --- block to see exactly what the critic receives. Notice whether it approves immediately or requests revisions. Try a vague requirement to force revisions.
v5-multi-agent
Watch for the [EpicAgent] Starting... and [FeaturesAgent] Starting... labels. The agents are now self-contained units. Notice that the Planner is still driving the sequence - the Coordinator doesn't hardcode what to run.
v6-autonomous
Watch for the [Global Critic] Verdict: line after both agents complete. This is the system evaluating its own output holistically. If it returns REDO: features, watch the FeaturesAgent re-run while the Epic is preserved. This is selective re-execution in action.
| Pattern | First appears in |
|---|---|
| LLM chaining | v1 |
| Shared state | v2 |
| LLM-driven control flow | v3 |
| Self-correction (critic loop) | v4 |
| Multi-agent encapsulation | v5 |
| Holistic quality review | v6 |
| LLM-driven routing | v6 |
| Selective re-execution | v6 |
| This project | Real-world equivalent |
|---|---|
| Planner | Router / Supervisor in LangGraph |
| Shared state | AgentState in LangGraph |
| Critic loop | Reflection pattern in LangChain |
| Coordinator | Orchestrator in AutoGen / CrewAI |
| Global critic + routing | Conditional edges in LangGraph |
pip install pytest pytest-covpython -m pytest -vpython -m pytest v4-critic-loop/test_backlog_gen_v4.py -vpython -m pytest --cov=. --cov-report=term-missingThe tests mock ollama entirely - no real LLM calls are made. This keeps tests fast, deterministic, and free. The principle is:
Mock the LLM. Test the orchestration.
The tests verify that our agents, planner, critic loop, and coordinator behave correctly regardless of what any LLM says. This is the standard approach used in production agentic systems.
Unit tests and evals are complementary but distinct:
| Unit Tests | Evals | |
|---|---|---|
| LLM calls | Mocked | Real |
| Speed | Milliseconds | Seconds |
| Cost | $0 | Tokens consumed |
| Deterministic | Yes | No |
| What they test | Orchestration logic | LLM output quality |
| Run frequency | Every commit | Periodically |
The tests in this repo are unit tests. If we want to verify that our prompts actually produce good epics and features from a real LLM, that is an eval - a separate concern covered in a future version.
Each version's tests cover only the patterns introduced in that version:
| Version | Key things tested |
|---|---|
| v1 | ask_llm, pipeline call order, response stripping |
| v2 | State read/write, state flows through pipeline |
| v3 | Planner parsing, TASK_MAP routing, invalid plan handling |
| v4 | Critic loop approval, revision, MAX_REVISIONS ceiling |
| v5 | Agent class interfaces, AGENT_MAP routing, lowercase normalization |
| v6 | _parse_redo routing, global critic input, selective re-execution |
README.md
v1-pipeline/
backlog_gen_v1.py
test_backlog_gen_v1.py
Readme.md
v2-shared-state/
backlog_gen_v2.py
test_backlog_gen_v2.py
Readme.md
v3-planner-agent/
backlog_gen_v3.py
test_backlog_gen_v3.py
Readme.md
v4-critic-loop/
backlog_gen_v4.py
test_backlog_gen_v4.py
Readme.md
v5-multi-agent/
backlog_gen_v5.py
test_backlog_gen_v5.py
Readme.md
v6-autonomous/
backlog_gen_v6.py
test_backlog_gen_v6.py
Readme.md