Instructions tuning
For many application-layer builders, this is the very first step. In the current world, it’s safe to say that big LLMs are proficient enough in a variety of topics. Hence, it comes down to instructing them correctly to do the job. Instructions need to be detailed but also crisp. Current LLMs do a great job of instruction following, but they can still fall short in context handling. Some researchers suggest that anything above 4,000 tokens leads to degradation in performance; however, it’s such an important area that models are getting better at it by the day. The most common roadmap includes the following steps:- Build a baseline prompt: specify role, scope, guardrails, and escalation. Typically includes up to 5 canonical examples per task.
- Add programmatic constraints: include a checklist and refusal rules, such as: “Always include: vitals, allergies, pregnancy status…” and when to say “need labs / escalate.”
- Optional — add retrieval-grounded instructions: “Use ONLY the provided context; cite a numeric reference,” and instruct what guidelines to prefer, such as: “Prefer guidelines effective ≤24 months old.”
- Add tools: introduce a plan for tool use and specific instructions for when a certain tool needs to be called, as well as how to handle fallbacks.
- Force decomposition for different scenarios: for instance, for patient triage, a model or agent can be instructed to run through DIAGNOSIS → RISKS → TESTS → TREATMENT at all times.
- Auto-instructors: add evolutionary instructors, using a multi-armed bandit approach, to send a query to the best potential instructor, with auto-validation for quality and promotion.
- Advanced auto-instructors: use an orchestrator to route a query to the correct instructor, passing additional parameter weights based on the user’s query.
SFT
Supervised fine-tuning — SFT — uses pairs of inputs — prompts — and outputs — answers. By curating data and post-training a model with SFT, we expect it to learn domain-specific terminology and patterns. For instance, a general big LLM, after oncology-specific SFT, will be much better at understanding that ER stands for Estrogen Receptor biomarker and not an Emergency Room. While a model obtains domain-specific knowledge, it might also forget some other things. That is to say — we need to be mindful and ensure it doesn’t forget what’s truly important for a given case. An example SFT pair for medicine is:What is ibuprofen?SFT comes in many flavors. Four common buckets include supervision focus, interaction, data source, and parameter update method. Granularity
Ibuprofen is a non-steroidal anti-inflammatory drug (NSAID). It reduces pain, inflammation, and fever by blocking COX enzymes that make prostaglandins (the chemicals that drive swelling and pain).
- Direct SFT — demo pairs: input → target output. Classic instruction tuning, single or multi-turn.
- Process-supervised SFT: train on intermediate steps — reasoning traces, tool plans, decomposition. Great for reliability on hard tasks.
- Critique-and-revise SFT: target equals: initial answer; critique; revised answer. Teaches self-correction loops.
- Edit or patch SFT: bad output plus edit instructions → corrected output. Excellent for “make this safer, shorter, structured.”
- Outcome-only SFT: no rationale; aligns to the final clinical answer or summary only — use when rationales are risky to expose.
- Single-turn instruction SFT: Q→A or task→completion — for example, write a SOAP note.
- Multi-turn conversational SFT: dialogue histories with system constraints and assistant style.
- Tool-use SFT: learn to call functions or APIs — EHR queries, calculators — via JSON or function-call traces.
- Retrieval-grounded SFT: input includes context — guidelines, chart excerpts; targets include grounded answers plus citations.
- Agent or trajectory SFT: sequences of state, thought, tool action, observation, final — for agents operating in environments.
The industry started with human-curated SFT, but several new approaches have emerged:
- Human demonstration SFT: expert-written targets.
- Teacher-distilled SFT: targets from a stronger model, optionally filtered by judges.
- Self-instruct or synthetic demo SFT: the model generates tasks and completions, then you filter or repair.
- Best-of-N distillation SFT: sample multiple responses, keep the judged best as the target.
- Lumos Precise SFT: the model is tested in realistic scenarios; mistakes are collected automatically and distilled into reproducible zero-shot scenarios, focusing on known bad behaviors we want to fix.
Finally, it comes down to how you want to use SFT for training purposes:
- Full-parameter SFT: update all weights — costly, best quality ceiling.
- PEFT:
- LoRA or QLoRA — rank-limited adapters on attention or MLP.
- Prefix or prompt tuning — learned soft prompts.
- Adapters — bottleneck modules inserted between layers.
RLHF
Reinforcement learning with human feedback — RLHF — was a dominant training strategy for big LLMs in 2023—2024. It is a threefold process:- First — select the metrics — rubrics — to score a model with. Bottom-line metrics are accuracy, safety, reasoning, instruction following, and communication. Each metric is scored using a Likert scale. More on that here.
- Second — human experts chat with a model or are presented with generated conversations. They are asked to read the conversation and score each parameter.
- Finally — the scores and justifications are used to train a model:
- A reward model is trained to update the main model’s policy — PPO — or
- The data is fed directly back to the model with the same objective to update its policy — DPO.
Preference Ranking
Preference ranking is a project setup where human experts are shown two examples of an output for the same prompt, and they must pick which one they prefer. To achieve statistically significant results, this method requires the largest dataset and takes the most time to compile. Given the time constraints, it comes down to how we collect the data:- Human pairs: clinicians choose A vs B — or A≈B — for accuracy, safety, grounding.
- RLAIF: preference ranking done with AI judges; humans audit only edge cases.
- Hybrid: AI pre-screens → humans validate top K.
- Implicit — prod: show options to an end user and ask for their preference.
With direct optimization, it can be:
- DPO — Direct Preference Optimization: fit policy to prefer chosen outputs given the same prompt — simple, stable.
- IPO or KTO: DPO variants with different temperature or offsetting — often more stable for long outputs.
- RRHF: rank responses by a reference scorer and pull the policy toward higher-ranked ones — works well with AI judges.
- Best-of-N distillation: sample N, keep the judged best as SFT targets — cheap and strong baseline.
- PPO or RLHF: classic — strong but engineering-heavy.
- ReMax or GRPO: simpler gradient estimators for text.
- Constraint-aware RL: add penalties for JSON invalidity, unsafe dose suggestions, hallucinated citations.
