Skip to main contentAbout Lumos
What is Lumos?
Lumos builds evaluation and data infrastructure for health & life-science AI. We stress-test models and agents, pinpoint failure modes, and feed results back as training fuel (SFT, RLHF, and RL).
Who is Lumos for?
Foundational model labs and application teams in healthcare, life sciences, and biotech (clinical, payer, pharma, RWE, etc.).
What problems do you solve?
We make model quality measurable and improvable—turning vague “accuracy/safety” into hundreds of concrete, clinically grounded metrics and actionable data to fix issues.
Core Concepts
What are AI Judges?
LLM-based evaluators calibrated against clinicians. They score model outputs across granular rubrics (e.g., safety flags, guideline adherence, reasoning quality) and align closely with human judgments.
What are AI Patients?
Structured, synthetic patient agents (with FHIR-style dossiers and memory) used to generate realistic multi-turn clinical conversations for evaluation and training.
How do these differ from generic benchmarks?
They’re domain-specific, rubric-rich, multi-turn, and intervention-oriented: they don’t just rank; they reveal why a model fails and how to fix it.
What Lumos Delivers
What do I get from an evaluation?
It depends on the project setup, but generally:
- A dashboard and/or report with granular metrics and heatmaps of failure modes
- Annotated transcripts and exemplars
- A prioritized “fix plan” (prompt changes, tools to add, data to collect)
- Optional training data packs (SFT/RLHF) or RL environments
Do you offer leaderboards or public reporting?
Yes—by default results are private. Public options (e.g., challenge tracks or leaderboards) are opt-in.
Can you guarantee improvements?
We never promise magic; we design for measurable, repeatable gains by targeting the exact failure modes surfaced in evaluation, and we bring years of experience helping the biggest foundational labs train their models.
Data & Security
Do you need my PHI?
No. We can evaluate with de-identified/synthetic data. If PHI is required, we support secure workflows under appropriate agreements.
Are you HIPAA compliant?
We operate under HIPAA-aligned controls and sign BAAs when PHI is involved.
Who owns the data and outputs?
You own your data and your model outputs. You also own custom datasets we build for you, unless we both agree to create a shared/public benchmark.
Do you use customer data to train third-party models?
No. We do not train third-party foundation models on your data.
Evaluation Design
What modalities do you support?
All and any - Conversational agents, summarization/generation (e.g., SOAP notes), retrieval/agentic workflows, extraction/structuring, images/video (clinical), and voice. Including some of the trickiest ones such as BCI-related 3D images.
Single-turn vs multi-turn?
Both. Multi-turn is preferred for agentic and clinical reasoning tasks; single-turn is used for atomic skills and ablations.
How do you choose metrics?
We combine: (1) clinical safety/risk rubrics, (2) guideline and policy checks, (3) task-specific success criteria, and (4) user-defined KPIs (throughput, deflection, autonomy).
Yes. We evaluate tool selection, grounding, citation fidelity, and error recovery within agentic flows.
Do you compare models head-to-head?
Yes. Side-by-side evaluation and paired tests are standard.
From Evaluation to Improvement
What happens after issues are found?
We convert findings into:
- Prompt & instruction optimizations
- Data recipes (SFT/RLHF sets) with preference ranking
- RL environment tasks for targeted skill building
- Safety guardrails and refusal criteria
Do you run the training?
We can supply clean, labeled data and guidance, or partner with you/your infra to execute SFT/RLHF/RL. Many clients keep training in-house using our packs and RL tasks.
Benchmarks vs. Custom Work
Should we use your off-the-shelf benchmarks?
Great for fast baselining and external comparisons.
When do we need custom evaluation?
When your product has specific workflows, tools, data constraints, or regulatory targets. Custom evals align metrics with your real success criteria.
Can one benchmark serve multiple customers?
Yes—some thematic benchmarks (e.g., medication safety, triage) can be standardized and licensed to multiple teams. Custom sets remain yours.
Human Experts
Who are your experts?
Clinicians and PhDs verified via multi-signal checks (licenses, credentials, interviews, work trials) across 30+ specialties.
How do you ensure quality and prevent cheating?
Identity verification, behavior fingerprinting, submission analytics, seeded QC tasks, cross-review, and continuous Expert Quality Scoring.
Can our medical team review or co-design rubrics?
Absolutely. We welcome your SMEs for rubric tuning, threshold setting, and sign-off.
Integrations & Deployment
How do we integrate?
- API access to AI Judges and evaluation endpoints
- Batch evaluation via secure file drops
- Agent harness for tool-use scenarios
- Dashboard for results exploration
Can you run inside our VPC?
Yes—via private deployment options.
Which clouds and model providers do you support?
We’re model- and cloud-agnostic, with a slight preference towards GCP (in which case, you can use your commitment with Google on our services) We work with major providers and self-hosted models.
Compliance & Governance
Do you align with clinical guidelines and policies?
Yes. Rubrics encode current guidelines where applicable, and we can scope specialty-specific policy packs.
Can you support audit trails and Part 11-style controls?
We maintain provenance for data, prompts, models, judges, and versions, enabling reproducibility and auditing. Enhanced controls available on request.
Pricing & Commercials
How do you price?
Human experts - 10% during the pilot (first 100 hours), and 15% after the pilot flat fee on the monthly consumption. For example, 200 hours of RNs billed at $90/h (the hourly price is selected by you), will result in $18,000 towards experts and $2,700 towards our fees.
The data delivered using AI patients, AI judges, and RL environments is a custom price based on the setup. Typical range for evaluation project for multi-turn conversation is at about $1.2 per conversation scored for 40 metrics, with 2+1 human alignment and >90% accuracy (statsig)
Common components include:
- Evaluation scope (scenarios × turns × metrics)
- Expert review volume and specialty
- Data packs (SFT/RLHF) and RL tasks
- Deployment (API/VPC/private)
Do you offer pilots?
Yes—typical pilots baseline current performance and identify 3–5 high-impact fixes with a path to improvement.
Any rev-share or licensing options?
Available for shared/public benchmarks or co-developed assets.
Getting Started
What do you need from us to begin?
- Your target tasks/workflows and success criteria
- Model endpoints or batch outputs
- Any constraints (PHI, tooling, latency, cost)
- Optional: internal rubrics/guidelines to encode
How do we define success?
We agree on metrics, acceptance thresholds, and a remediation plan (prompts, data, RL tasks) at the outset.
For Application Teams (Healthtech)
We don’t have a training pipeline—can you still help?
Yes. Start with eval → prompt/tool improvements → lightweight SFT/RLHF packs you can apply with minimal infra.
Can you evaluate our agent end-to-end?
Yes—conversation → retrieval → tool calls → handoff notes (e.g., SOAP), including safety and documentation quality.
For Foundation Model Labs
Can you provide judge APIs at scale?
Yes—batch and streaming scoring with calibration reports against clinician panels.
Do you support red-teaming and safety suites?
Yes—stress tests for hallucinations, bias, unsafe advice, privacy leaks, and tool-use risks.
Miscellaneous
Do you replace human clinicians?
No. We amplify them—codifying expertise into scalable evaluators and datasets while keeping humans in the loop for calibration.
What’s the difference between “score higher” and “be safer”?
We separate capability (can it do the task?), alignment (does it follow policy/guidelines?), and risk (what happens when it’s wrong?)—and report each clearly.
Can you work under NDAs and data-partner agreements?
Yes. Standard NDAs, DPAs/BAAs, and SOWs are routine.
How do we reach you?
You can reach out via email — feel free to send us a message at [email protected].