Data curation usually involves a thoughtful source strategy; data cleaning and normalization; enrichment; and mixture design. Source Strategy
Source strategy may include pulling data from multiple sources, such as:
- Authoritative corpora: guidelines — CDC, WHO, USP, NICE — textbooks/handbooks; drug labels — FDA SPL — clinical trial registries; PubMed abstracts + OA full text; medical device IFUs; payer rules; billing guides.
- Practice-proximal text: de-identified clinical notes/SOAP notes; patient education handouts; discharge instructions; call-center scripts; payer denials/appeals — de-ID! — care pathways; order sets.
- Structured knowledge: ontologies and code sets — SNOMED CT, ICD-10/11, CPT/HCPCS, RxNorm, LOINC, MeSH, UMLS — and their mappings.
- Operations and policy: HIPAA/Part 11/GxP docs, insurer policies, hospital policies — useful for agent behavior and compliance reasoning.
- Multimodal text anchors: radiology report templates; pathology synoptic reports; EHR UI text — de-ID — device readouts; for VLMs, keep captions and alignments.
- Diversity/coverage: multilingual — ES/FR/DE/zh — global guidelines; consumer-grade content such as MedlinePlus for layperson alignment.
- Synthetic but verified: teacher-generated Q&A; synthetic dialogues with AI Patients gated by AI Judge filters — keep the synthetic ratio bounded and traceable.
During cleaning and normalization, the goal is to:
- Remove boilerplate: fix Unicode, normalize punctuation; collapse templates and navigation fluff.
- Language and script checks: drop off-language text or route to the right bucket.
- Remove duplicates: use simhash/MinHash/LSH at the chunk and document levels; also de-duplicate paraphrases by an embedding-similarity threshold.
- Filter length and perplexity: remove ultra-short, gibberish, and very high-perplexity spans; cap pathological repetition.
- Filter toxic/safety information: strip misleading cures, anti-vax spam, dosage-hallucination patterns; keep a “do-not-train” blacklist of proven unsafe snippets.
With data enrichment, we mainly want to:
- Use weak supervision: distantly label spans — drugs, labs, diagnoses — via RxNorm/LOINC/SNOMED dictionaries; attach codes as side channels for multitask pre-training.
- Add section headers and discourse: tag HPI/ROS/Plan/Impression; Indication/Contraindication; and so on — this improves structure learning.
- Maintain citation and claim links: for guideline or abstract text, preserve references — useful for later retrieval-augmented setups.
- Add metadata tags: effective dates, guideline versions, trial phases — allow time-aware sampling and evaluation.
Finally, with mixture design, we need to:
- Define buckets: Guidelines; Trials; OA papers; Clinical notes — de-ID — Patient education; Ops/Policy; Lay content; Synthetic.
- Set target ratios: for example, 20% guidelines/labels; 25% clinical notes; 20% PubMed OA; 10% trials; 10% patient ed; 5% ops/policy; 10% synthetic — tune empirically.
- Curriculum: start with clean/simple expository text → progress to notes/dialogues → end with harder edge cases. Optionally use a UL2-style span-corruption mix.
- Hard-example up-weighting: dose/duration instructions; med-med interactions; risk/benefit sections; rare diseases — maintain caps to prevent overfitting.
- Long-context shards: preserve entire reports or multi-note episodes to teach cross-section reasoning.
