TW
← Back to writing

Writing

When To Fine-Tune vs Prompt Engineer: A Decision Framework

Tallha WaheedJanuary 2, 20253 min read

A practical rubric for choosing between prompt engineering, adapters/LoRA, or full fine-tuning.

When To Fine-Tune vs Prompt Engineer: A Decision Framework

Teams often jump to fine-tuning when better prompt and retrieval design would have been cheaper. Here’s the rubric I use with product and engineering leads to choose between prompting, adapters, and full fine-tunes.

Start With Constraints

  • Latency & cost: Can you afford larger contexts or multiple calls? If not, consider a smaller model + adapters or distillation.
  • Data sensitivity: Can data leave your VPC? If not, lean on open-source models you can host; avoid SaaS-only fine-tunes.
  • Change velocity: How often will instructions or policies shift? Frequent change favors prompts + retrieval over touching weights.
  • Evaluation readiness: Do you have a test harness (groundedness, safety, task success) to catch regressions? If not, start there.

Level 0: Prompt + Retrieval First

  • Use system prompts with clear policies, style, refusal rules, and persona (tone, voice, jurisdiction).
  • Add retrieval for facts instead of stuffing the prompt. Use filters + rerankers to keep contexts tight and relevant.
  • Enforce output schemas (JSON/markdown) with validators; reject malformed outputs early.
  • Add tooling where possible: calculators, policy lookups, canonical phrase libraries—reduce the model’s creative surface area.
  • Measure hallucination and grounding; if you’re above 3–5%, fix retrieval and prompting before touching weights.

Level 1: Adapters / LoRA / Prefix Tuning

  • Use when the base model already knows the domain but needs persona/style control or shorter contexts.
  • Great for multi-tenant setups: swap adapters per customer or product line without retraining the base.
  • Keep adapters small; track latency impact on your serving stack. Benchmark P50/P95 and cost before/after.
  • Version adapters and prompts together. Roll back with flags if safety or compliance fails.

Level 2: Full Fine-Tune

  • Justified when you need new capabilities not present in the base model, or must reduce context dependence drastically.
  • Requires high-quality, diverse, and counter-example-rich data; otherwise you overfit and regress.
  • Add an evaluation harness: groundedness, safety, refusal correctness, and task success before shipping.
  • Plan for continuous training: refresh with new data and invalidations; monitor drift and refusal rates.

Decision Checklist I Run With Teams

1) Can retrieval + prompting + tools solve 80%? Start there. 2) Need tighter style control or lower latency? Use adapters/LoRA. 3) Need new capabilities and have high-quality evals + data? Fine-tune. 4) If you can’t evaluate it, you can’t ship it—pause and build the test harness.

Data Strategy For Fine-Tuning

  • Collect user traces with labels: success/failure, refusal correctness, groundedness, safety violations.
  • Create hard negatives: adversarial instructions, conflicting policies, long contexts, multilingual variants.
  • Balance the dataset: mix short and long prompts; include edge jurisdictions/products.
  • Keep data lineage: which product version, policy version, and annotator produced each example.

Deployment Notes

  • Version everything: prompts, adapters, datasets, retrievers, rerankers, policies.
  • Canary new weights; keep a rollback path to the last good model.
  • Monitor token cost, latency, refusal rates, and safety continuously. Alert on P95 blowups and hallucination spikes.
  • For regulated flows, keep source-citation checks and safety filters even after fine-tuning—weights don’t remove the need for guardrails.

Use this framework to avoid premature fine-tuning while still giving your product the control, efficiency, and safety it needs.