Skip to main content
During pre-training, a model uses a large corpus of unstructured data. This is how we teach a model to recognize patterns and natural language. We use a loss function to understand prediction errors and update the model’s weights to improve its capabilities. At this stage, companies often use publicly available information and large data archives, while also adding proprietary data obtained through partnerships with industry players and institutions. This normally doesn’t involve input from human experts, although you still need to curate the data.
Data curation usually involves a thoughtful source strategy; data cleaning and normalization; enrichment; and mixture design.
Source Strategy
Source strategy may include pulling data from multiple sources, such as:
  • Authoritative corpora: guidelines — CDC, WHO, USP, NICE — textbooks/handbooks; drug labels — FDA SPL — clinical trial registries; PubMed abstracts + OA full text; medical device IFUs; payer rules; billing guides.
  • Practice-proximal text: de-identified clinical notes/SOAP notes; patient education handouts; discharge instructions; call-center scripts; payer denials/appeals — de-ID! — care pathways; order sets.
  • Structured knowledge: ontologies and code sets — SNOMED CT, ICD-10/11, CPT/HCPCS, RxNorm, LOINC, MeSH, UMLS — and their mappings.
  • Operations and policy: HIPAA/Part 11/GxP docs, insurer policies, hospital policies — useful for agent behavior and compliance reasoning.
  • Multimodal text anchors: radiology report templates; pathology synoptic reports; EHR UI text — de-ID — device readouts; for VLMs, keep captions and alignments.
  • Diversity/coverage: multilingual — ES/FR/DE/zh — global guidelines; consumer-grade content such as MedlinePlus for layperson alignment.
  • Synthetic but verified: teacher-generated Q&A; synthetic dialogues with AI Patients gated by AI Judge filters — keep the synthetic ratio bounded and traceable.
Cleaning and Normalization
During cleaning and normalization, the goal is to:
  • Remove boilerplate: fix Unicode, normalize punctuation; collapse templates and navigation fluff.
  • Language and script checks: drop off-language text or route to the right bucket.
  • Remove duplicates: use simhash/MinHash/LSH at the chunk and document levels; also de-duplicate paraphrases by an embedding-similarity threshold.
  • Filter length and perplexity: remove ultra-short, gibberish, and very high-perplexity spans; cap pathological repetition.
  • Filter toxic/safety information: strip misleading cures, anti-vax spam, dosage-hallucination patterns; keep a “do-not-train” blacklist of proven unsafe snippets.
Data Enrichment
With data enrichment, we mainly want to:
  • Use weak supervision: distantly label spans — drugs, labs, diagnoses — via RxNorm/LOINC/SNOMED dictionaries; attach codes as side channels for multitask pre-training.
  • Add section headers and discourse: tag HPI/ROS/Plan/Impression; Indication/Contraindication; and so on — this improves structure learning.
  • Maintain citation and claim links: for guideline or abstract text, preserve references — useful for later retrieval-augmented setups.
  • Add metadata tags: effective dates, guideline versions, trial phases — allow time-aware sampling and evaluation.
Mixture Design
Finally, with mixture design, we need to:
  • Define buckets: Guidelines; Trials; OA papers; Clinical notes — de-ID — Patient education; Ops/Policy; Lay content; Synthetic.
  • Set target ratios: for example, 20% guidelines/labels; 25% clinical notes; 20% PubMed OA; 10% trials; 10% patient ed; 5% ops/policy; 10% synthetic — tune empirically.
  • Curriculum: start with clean/simple expository text → progress to notes/dialogues → end with harder edge cases. Optionally use a UL2-style span-corruption mix.
  • Hard-example up-weighting: dose/duration instructions; med-med interactions; risk/benefit sections; rare diseases — maintain caps to prevent overfitting.
  • Long-context shards: preserve entire reports or multi-note episodes to teach cross-section reasoning.
There might be variations to these steps, but once they’re completed, we run the training pipeline and move on to the next stage.