Post-training

The majority of time and effort is dedicated to improving the trained model and fitting it to the tasks we care about. Almost all application-layer players focus exclusively on post-training, expecting the big labs to take care of pre-training. At this stage, we mostly want to understand what a model learned and identify the behaviors we want to adjust. There is a default set of tools and approaches used to fix a model.

Instructions tuning

For many application-layer builders, this is the very first step. In the current world, it’s safe to say that big LLMs are proficient enough in a variety of topics. Hence, it comes down to instructing them correctly to do the job. Instructions need to be detailed but also crisp. Current LLMs do a great job of instruction following, but they can still fall short in context handling. Some researchers suggest that anything above 4,000 tokens leads to degradation in performance; however, it’s such an important area that models are getting better at it by the day. The most common roadmap includes the following steps:

Build a baseline prompt: specify role, scope, guardrails, and escalation. Typically includes up to 5 canonical examples per task.
Add programmatic constraints: include a checklist and refusal rules, such as: “Always include: vitals, allergies, pregnancy status…” and when to say “need labs / escalate.”
Optional — add retrieval-grounded instructions: “Use ONLY the provided context; cite a numeric reference,” and instruct what guidelines to prefer, such as: “Prefer guidelines effective ≤24 months old.”
Add tools: introduce a plan for tool use and specific instructions for when a certain tool needs to be called, as well as how to handle fallbacks.
Force decomposition for different scenarios: for instance, for patient triage, a model or agent can be instructed to run through DIAGNOSIS → RISKS → TESTS → TREATMENT at all times.
Auto-instructors: add evolutionary instructors, using a multi-armed bandit approach, to send a query to the best potential instructor, with auto-validation for quality and promotion.
Advanced auto-instructors: use an orchestrator to route a query to the correct instructor, passing additional parameter weights based on the user’s query.

SFT

Supervised fine-tuning — SFT — uses pairs of inputs — prompts — and outputs — answers. By curating data and post-training a model with SFT, we expect it to learn domain-specific terminology and patterns. For instance, a general big LLM, after oncology-specific SFT, will be much better at understanding that ER stands for Estrogen Receptor biomarker and not an Emergency Room. While a model obtains domain-specific knowledge, it might also forget some other things. That is to say — we need to be mindful and ensure it doesn’t forget what’s truly important for a given case. An example SFT pair for medicine is:

What is ibuprofen?
Ibuprofen is a non-steroidal anti-inflammatory drug (NSAID). It reduces pain, inflammation, and fever by blocking COX enzymes that make prostaglandins (the chemicals that drive swelling and pain).

SFT comes in many flavors. Four common buckets include supervision focus, interaction, data source, and parameter update method. Granularity

Direct SFT — demo pairs: input → target output. Classic instruction tuning, single or multi-turn.
Process-supervised SFT: train on intermediate steps — reasoning traces, tool plans, decomposition. Great for reliability on hard tasks.
Critique-and-revise SFT: target equals: initial answer; critique; revised answer. Teaches self-correction loops.
Edit or patch SFT: bad output plus edit instructions → corrected output. Excellent for “make this safer, shorter, structured.”
Outcome-only SFT: no rationale; aligns to the final clinical answer or summary only — use when rationales are risky to expose.

Interaction — mostly depends on the application; learn more in our Data Modalities article:

Single-turn instruction SFT: Q→A or task→completion — for example, write a SOAP note.
Multi-turn conversational SFT: dialogue histories with system constraints and assistant style.
Tool-use SFT: learn to call functions or APIs — EHR queries, calculators — via JSON or function-call traces.
Retrieval-grounded SFT: input includes context — guidelines, chart excerpts; targets include grounded answers plus citations.
Agent or trajectory SFT: sequences of state, thought, tool action, observation, final — for agents operating in environments.

Data source
The industry started with human-curated SFT, but several new approaches have emerged:

Human demonstration SFT: expert-written targets.
Teacher-distilled SFT: targets from a stronger model, optionally filtered by judges.
Self-instruct or synthetic demo SFT: the model generates tasks and completions, then you filter or repair.
Best-of-N distillation SFT: sample multiple responses, keep the judged best as the target.
Lumos Precise SFT: the model is tested in realistic scenarios; mistakes are collected automatically and distilled into reproducible zero-shot scenarios, focusing on known bad behaviors we want to fix.

Parameter updates
Finally, it comes down to how you want to use SFT for training purposes:

Full-parameter SFT: update all weights — costly, best quality ceiling.
PEFT:
- LoRA or QLoRA — rank-limited adapters on attention or MLP.
- Prefix or prompt tuning — learned soft prompts.
- Adapters — bottleneck modules inserted between layers.

The most efficient way currently is to focus on known mistakes and help a model correct itself, while also targeting specific layers responsible for the knowledge. We published a research paper using mechanistic interpretability tools to find which layers are worth targeting — you can read it here.

RLHF

Reinforcement learning with human feedback — RLHF — was a dominant training strategy for big LLMs in 2023—2024. It is a threefold process:

First — select the metrics — rubrics — to score a model with. Bottom-line metrics are accuracy, safety, reasoning, instruction following, and communication. Each metric is scored using a Likert scale. More on that here.
Second — human experts chat with a model or are presented with generated conversations. They are asked to read the conversation and score each parameter.
Finally — the scores and justifications are used to train a model:
- A reward model is trained to update the main model’s policy — PPO — or
- The data is fed directly back to the model with the same objective to update its policy — DPO.

The most challenging part is often designing the rubrics correctly and selecting the pool of experts to grade the responses. More on that here. This approach works well for metrics where there can objectively be right or wrong answers. For instance, when we want to ensure that a model or agent can collect all the necessary information prior to suggesting a diagnosis. This is a so-called verifiable domain or verifiable metric. Recently, the industry has realized that there are also unverifiable domains and metrics, such as communication or ethics.

Preference Ranking

Preference ranking is a project setup where human experts are shown two examples of an output for the same prompt, and they must pick which one they prefer. To achieve statistically significant results, this method requires the largest dataset and takes the most time to compile. Given the time constraints, it comes down to how we collect the data:

Human pairs: clinicians choose A vs B — or A≈B — for accuracy, safety, grounding.
RLAIF: preference ranking done with AI judges; humans audit only edge cases.
Hybrid: AI pre-screens → humans validate top K.
Implicit — prod: show options to an end user and ask for their preference.

Reinforcement learning with AI feedback — RLAIF — is gaining popularity. With this approach, human experts are asked to design rubrics based on a prompt and then refine those after a model generates an output. This was publicly demonstrated in HealthBench by OAI. The data is then used to either directly improve the model or build a separate reward model.
With direct optimization, it can be:

DPO — Direct Preference Optimization: fit policy to prefer chosen outputs given the same prompt — simple, stable.
IPO or KTO: DPO variants with different temperature or offsetting — often more stable for long outputs.
RRHF: rank responses by a reference scorer and pull the policy toward higher-ranked ones — works well with AI judges.
Best-of-N distillation: sample N, keep the judged best as SFT targets — cheap and strong baseline.

And with a reward model, it comes down to:

PPO or RLHF: classic — strong but engineering-heavy.
ReMax or GRPO: simpler gradient estimators for text.
Constraint-aware RL: add penalties for JSON invalidity, unsafe dose suggestions, hallucinated citations.

At Lumos, we have strong opinions on how each method should be implemented based on our experience working with major model providers and industry leaders, while we remain flexible on a pipeline that works best for you. We follow a forward-deployment approach to ensure we co-develop the best solution to improve the bottom-line performance of your model or agent.

Core Concepts

Human Data

Platform

FAQ

Post-training

Instructions tuning

SFT

RLHF

Preference Ranking

Core Concepts

Human Data

Platform

FAQ

​Instructions tuning

​SFT

​RLHF

​Preference Ranking

Instructions tuning

SFT

RLHF

Preference Ranking