This repository contains the official implementation of the paper:
Self-Adaptive Prompt Engineering for Cost-Efficient Language Models: Switching between Chain-of-Thought and Direct Answer
We propose Self-Adaptive Prompt Engineering (SAPE), a simple yet effective prompting method that enables instruction-tuned language models to dynamically adjust the level of reasoning depending on input complexity. SAPE consistently reduces token usage compared to Chain-of-Thought (CoT) prompting while maintaining comparable accuracy.
Modern prompting strategies like CoT improve accuracy through step-by-step reasoning, but at the cost of longer outputs. On the other hand, DA prompts are efficient but often less accurate. SAPE bridges the two: it lets the model choose whether to reason step-by-step or answer directly.
- DA Prompt: “Only write the final answer.”
- CoT Prompt: “Think step-by-step, then give the final answer.”
- SAPE Prompt: “Use step-by-step reasoning only if it helps. Otherwise, answer directly.”
We evaluate SAPE across 3 instruction-tuned models:
Llama-3.2-3B-InstructPhi-4-mini-InstructMistral-7B-Instruct-v0.3
Using 8 benchmark datasets:
- GSM8K / GSM8K-Hard / MATH-500
- CommonSenseQA / HellaSwag / SimpleQA
- GPQA-Extended / MMLU-Pro
We measure:
- Accuracy
- Token length
- Average metrics
- SAPE reduces average token usage to nearly one-third of CoT with minimal accuracy loss.
- Models show distinct prompt-following behaviors.
- Llama aligns well with intended prompts.
- Phi ignores prompt structure.
- Mistral shows intermediate performance.
- SAPE adapts well to question complexity, providing flexible and cost-effective generation.
⚠️ One limitation is that models cannot reliably identify or report their own reasoning mode. This black-box behavior warrants further investigation.
📝 Evaluation framework is adapted from Evalchemy.
├── LLM_eval/
│ ├── benchmarks
│ ├── run scripts
│ └── evaluation .py codes
├── LLM_metrics/
│ └── run results
│ └── visualizations
└── README.md
# Run experiments
bash LLM_eval/scripts/run_experiments.sh
# Visualize results
python LLM_metrics/viz.pyRequires Python 3.8+ and access to Hugging Face models or locally downloaded checkpoints.
If you use this work, please cite:
@misc{kim2024sape,
title={Self-Adaptive Prompt Engineering for Cost-Efficient Language Models},
author={Yongjin Kim},
year={2024},
url={https://github.com/yjK199905/Self-Adaptive-Prompt-Engineering}
}If you have questions or feedback, feel free to open an issue or contact Yongjin Kim.

