Skip to content

yjk990506/Self-Adaptive-Prompt-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Self-Adaptive Prompt Engineering (SAPE)

This repository contains the official implementation of the paper:

Self-Adaptive Prompt Engineering for Cost-Efficient Language Models: Switching between Chain-of-Thought and Direct Answer

We propose Self-Adaptive Prompt Engineering (SAPE), a simple yet effective prompting method that enables instruction-tuned language models to dynamically adjust the level of reasoning depending on input complexity. SAPE consistently reduces token usage compared to Chain-of-Thought (CoT) prompting while maintaining comparable accuracy.


🔍 Overview

Modern prompting strategies like CoT improve accuracy through step-by-step reasoning, but at the cost of longer outputs. On the other hand, DA prompts are efficient but often less accurate. SAPE bridges the two: it lets the model choose whether to reason step-by-step or answer directly.

  • DA Prompt: “Only write the final answer.”
  • CoT Prompt: “Think step-by-step, then give the final answer.”
  • SAPE Prompt: “Use step-by-step reasoning only if it helps. Otherwise, answer directly.”

🧪 Experiments

We evaluate SAPE across 3 instruction-tuned models:

  • Llama-3.2-3B-Instruct
  • Phi-4-mini-Instruct
  • Mistral-7B-Instruct-v0.3

Using 8 benchmark datasets:

  • GSM8K / GSM8K-Hard / MATH-500
  • CommonSenseQA / HellaSwag / SimpleQA
  • GPQA-Extended / MMLU-Pro

We measure:

  • Accuracy
  • Token length
  • Average metrics

acc_group_bar tok_markers_accuracy


📌 Key Findings

  • SAPE reduces average token usage to nearly one-third of CoT with minimal accuracy loss.
  • Models show distinct prompt-following behaviors.
    • Llama aligns well with intended prompts.
    • Phi ignores prompt structure.
    • Mistral shows intermediate performance.
  • SAPE adapts well to question complexity, providing flexible and cost-effective generation.

⚠️ One limitation is that models cannot reliably identify or report their own reasoning mode. This black-box behavior warrants further investigation.

📝 Evaluation framework is adapted from Evalchemy.


💾 Repository Structure

├── LLM_eval/
│   ├── benchmarks
│   ├── run scripts
│   └── evaluation .py codes
├── LLM_metrics/
│   └── run results
│   └── visualizations
└── README.md

🚀 Getting Started

# Run experiments
bash LLM_eval/scripts/run_experiments.sh

# Visualize results
python LLM_metrics/viz.py

Requires Python 3.8+ and access to Hugging Face models or locally downloaded checkpoints.


📄 Citation

If you use this work, please cite:

@misc{kim2024sape,
  title={Self-Adaptive Prompt Engineering for Cost-Efficient Language Models},
  author={Yongjin Kim},
  year={2024},
  url={https://github.com/yjK199905/Self-Adaptive-Prompt-Engineering}
}

📬 Contact

If you have questions or feedback, feel free to open an issue or contact Yongjin Kim.

About

Reasoning or Not?? Do self-adaptive-prompt-engineering

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors