Table of Contents
Prompt Engineering Techniques: Chain-of-Thought & Few-Shot in 2026
Prompt engineering has evolved from simple “give me a summary” requests into a discipline that can squeeze extra intelligence, consistency, and safety out of large language models (LLMs). Below you will find three advanced families of techniques—chain-of-thought reasoning, few-shot scaffolding, and systematic prompt optimization—with concrete patterns, code snippets, and trade-offs you can apply tomorrow in production systems.
Chain-of-Thought: Teaching the Model to Reason Step-by-Step
The core idea is to elicit a trace of intermediate reasoning before the final answer. This mimics how humans solve multi-step problems and has been shown to improve accuracy on arithmetic, logic, and scientific reasoning tasks.
Zero-Shot CoT
No examples are required; you simply append an instruction that forces the model to think aloud.
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant that explains your reasoning before answering."},
{"role": "user", "content": "A train leaves Chicago heading west at 60 mph. Two hours later a second train leaves Chicago heading east at 45 mph. When will they be 500 miles apart?"}
]
)
A well-crafted system message or user prompt can trigger CoT even without examples:
Please solve the following problem by showing each step and then give the final answer in bold.
Few-Shot CoT
When you supply hand-crafted demonstrations, the model tends to follow the same reasoning pattern.
Q: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day using four eggs. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the market?
A: Janet starts with 16 eggs.
- She eats 3, so 16 - 3 = 13 remain.
- She uses 4 for muffins, so 13 - 4 = 9 remain.
- She sells 9 eggs at $2 each ⇒ 9 × $2 = $18 every day at the market.
Q: A train leaves Chicago heading west at 60 mph. Two hours later a second train leaves Chicago heading east at 45 mph. When will they be 500 miles apart?
Automatic CoT (Auto-CoT)
Instead of writing examples by hand, you cluster problems, generate rationales with a small model, and keep only the most diverse ones. This reduces prompt-engineering labor while maintaining coverage of reasoning styles.
Tools & Tips
- Delimiters: Use triple back-ticks or XML tags (
<reasoning>...</reasoning>) to isolate the trace; the final<answer>tag tells the model when to stop. - Length control: Add “Keep your reasoning under 5 sentences” to avoid verbose traces.
- Consistency: For arithmetic, force decimal alignment with a template: “Step 1: … → Step 2: … → Final: …”.
- Failure modes: CoT helps only when the task is decomposable; for open-ended creativity it can hurt performance.
Few-Shot Scaffolding: Structured Demonstrations with Roles and Constraints
Few-shot prompting becomes more reliable when each example is not just a question-answer pair but a miniature “workflow” that teaches the LLM how to behave.
Role Assignment
Assign a persona or role that the model must inhabit for the duration of the conversation.
You are Dr. Lee, a board-certified cardiologist reviewing patient echocardiogram reports.
Your task is to grade diastolic dysfunction on a 0-3 scale and write a one-sentence summary.
Report 1: ...
Grade: 1
Summary: Mild diastolic dysfunction with preserved EF.
Report 2: ...
Grade: 3
Summary: Severe restrictive pattern with elevated LVEDP.
Constraint Injection
Add explicit formatting rules so the model’s output is parseable later.
- Output format: JSON with keys: {"grade": int, "summary": str, "actionable": bool}
- grade must be 0, 1, 2, or 3
- actionable is true only if the summary contains the word "follow-up"
Contrastive Examples
Pairs of correct vs. incorrect traces can steer the model away from common mistakes.
Good: "Ejection fraction 55 % → Grade 1 diastolic dysfunction → Summary: Normal diastolic function."
Bad: "Ejection fraction 55 % → Grade 4 diastolic dysfunction → Summary: Severe systolic impairment."
Dynamic Few-Shot via Embeddings
Instead of hard-coding examples in the prompt, retrieve the most semantically similar demonstrations at runtime using a vector store.
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer("all-MiniLM-L6-v2")
query_emb = model.encode(user_question)
candidates = [...] # pre-loaded examples
scores = cosine_similarity([query_emb], [ex["emb"] for ex in candidates])[0]
top_k = sorted(zip(candidates, scores), key=lambda x: -x[1])[:3]
Practical Checklist
| Task | Guidance | Notes |
|---|---|---|
| Number of shots | Start with 3–5 | More can lead to overfitting or token waste |
| Diversity | Cover edge cases (ambiguous phrasing, missing data) | Ensures robustness across inputs |
| Order | Place the most representative example first | Position ambiguous or rare cases later |
| Validation | Log the model’s intermediate generations | Detect “copy-paste” from few-shot examples |
Systematic Prompt Optimization: Turning Heuristics into Experiments
Prompt engineering is no longer a game of guessing; it is an optimization loop that can be automated with LLMs themselves.
Prompt as Code
Treat the prompt string as a parameterizable function.
def build_prompt(task: str, style: str = "concise", max_tokens: int = 512) -> str:
base = f"""Act as an expert {task}.
Style: {style}.
Be factual and cite sources when possible."""
return base
Automatic Prompt Refinement (APR)
Use the same LLM to iteratively improve a prompt.
- Start with an initial prompt.
- Generate 50–100 candidate refinements.
- Score each candidate with a held-out evaluation set.
- Select the prompt with the highest score.
def refine_prompt(prompt: str, eval_set: list[tuple]) -> str:
candidates = [llm.generate_refinement(prompt, i) for i in range(5)]
scores = [evaluate(cand, eval_set) for cand in candidates]
return candidates[scores.index(max(scores))]
Human-in-the-Loop (HITL) Tuning
Even with automation, human annotators can judge nuanced qualities such as tone, safety, or brand voice. A lightweight HITL dashboard surfaces the top 10 prompts and lets reviewers up-vote or down-vote outputs.
Metrics That Matter
| Metric | Purpose | Example Tool |
|---|---|---|
| Accuracy | Exact match, F1, or domain-specific metrics | Custom evaluator |
| Consistency | Output variance over 10 identical queries | Variance score |
| Latency | Tokens per second | Open-source profiler |
| Safety | Toxicity score | Perspective API |
| Cost | Dollar per thousand queries | Cloud billing API |
A/B Testing Infrastructure
Wrap your prompt in a lightweight A/B framework so you can roll out new variants to a small percentage of traffic and compare conversion or error rates.
from abtesting import Experiment
exp = Experiment(
name="diag_grade_v6",
variants=["baseline", "cot_v1", "cot_v2"],
metric=lambda logs: logs["accuracy"],
traffic_split=0.05
)
selected_variant = exp.serve()
Prompt Versioning & Rollback
Store every prompt variant in Git, add a semantic commit message (“feat: add contrastive examples for grade 3”), and tag each release. If a new variant causes a regression, roll back in seconds.
Putting It All Together: A Production Pipeline
- Decomposition: Break the task into subtasks (extract → reason → summarize).
- CoT Design: For reasoning-heavy subtasks, use few-shot CoT examples.
- Scaffolding: For structured outputs, inject role, format constraints, and contrastive pairs.
- Optimization: Run APR with a small evaluation set; iterate 3–5 times.
- Deployment: A/B test the final prompt in production, monitor metrics daily, and roll back on regression.
Remember that prompt engineering is not a one-time setup but a continuous loop. As models evolve, so must your prompts; treat them as living artifacts that grow with your product and user expectations.
