FAQ

About Lumos

What is Lumos?

Lumos builds evaluation and data infrastructure for health & life-science AI. We stress-test models and agents, pinpoint failure modes, and feed results back as training fuel (SFT, RLHF, and RL).

Who is Lumos for?

Foundational model labs and application teams in healthcare, life sciences, and biotech (clinical, payer, pharma, RWE, etc.).

What problems do you solve?

We make model quality measurable and improvable—turning vague “accuracy/safety” into hundreds of concrete, clinically grounded metrics and actionable data to fix issues.

Core Concepts

What are AI Judges?

LLM-based evaluators calibrated against clinicians. They score model outputs across granular rubrics (e.g., safety flags, guideline adherence, reasoning quality) and align closely with human judgments.

What are AI Patients?

Structured, synthetic patient agents (with FHIR-style dossiers and memory) used to generate realistic multi-turn clinical conversations for evaluation and training.

How do these differ from generic benchmarks?

They’re domain-specific, rubric-rich, multi-turn, and intervention-oriented: they don’t just rank; they reveal why a model fails and how to fix it.

What Lumos Delivers

What do I get from an evaluation?

It depends on the project setup, but generally:

A dashboard and/or report with granular metrics and heatmaps of failure modes
Annotated transcripts and exemplars
A prioritized “fix plan” (prompt changes, tools to add, data to collect)
Optional training data packs (SFT/RLHF) or RL environments

Do you offer leaderboards or public reporting?

Yes—by default results are private. Public options (e.g., challenge tracks or leaderboards) are opt-in.

Can you guarantee improvements?

We never promise magic; we design for measurable, repeatable gains by targeting the exact failure modes surfaced in evaluation, and we bring years of experience helping the biggest foundational labs train their models.

Data & Security

Do you need my PHI?

No. We can evaluate with de-identified/synthetic data. If PHI is required, we support secure workflows under appropriate agreements.

Are you HIPAA compliant?

We operate under HIPAA-aligned controls and sign BAAs when PHI is involved.

Who owns the data and outputs?

You own your data and your model outputs. You also own custom datasets we build for you, unless we both agree to create a shared/public benchmark.

Do you use customer data to train third-party models?

No. We do not train third-party foundation models on your data.

Evaluation Design

What modalities do you support?

All and any - Conversational agents, summarization/generation (e.g., SOAP notes), retrieval/agentic workflows, extraction/structuring, images/video (clinical), and voice. Including some of the trickiest ones such as BCI-related 3D images.

Single-turn vs multi-turn?

Both. Multi-turn is preferred for agentic and clinical reasoning tasks; single-turn is used for atomic skills and ablations.

How do you choose metrics?

We combine: (1) clinical safety/risk rubrics, (2) guideline and policy checks, (3) task-specific success criteria, and (4) user-defined KPIs (throughput, deflection, autonomy).

Do you test tool use and retrieval?

Yes. We evaluate tool selection, grounding, citation fidelity, and error recovery within agentic flows.

Do you compare models head-to-head?

Yes. Side-by-side evaluation and paired tests are standard.

From Evaluation to Improvement

What happens after issues are found?

We convert findings into:

Prompt & instruction optimizations
Data recipes (SFT/RLHF sets) with preference ranking
RL environment tasks for targeted skill building
Safety guardrails and refusal criteria

Do you run the training?

We can supply clean, labeled data and guidance, or partner with you/your infra to execute SFT/RLHF/RL. Many clients keep training in-house using our packs and RL tasks.

Benchmarks vs. Custom Work

Should we use your off-the-shelf benchmarks?

Great for fast baselining and external comparisons.

When do we need custom evaluation?

When your product has specific workflows, tools, data constraints, or regulatory targets. Custom evals align metrics with your real success criteria.

Can one benchmark serve multiple customers?

Yes—some thematic benchmarks (e.g., medication safety, triage) can be standardized and licensed to multiple teams. Custom sets remain yours.

Human Experts

Who are your experts?

Clinicians and PhDs verified via multi-signal checks (licenses, credentials, interviews, work trials) across 30+ specialties.

How do you ensure quality and prevent cheating?

Identity verification, behavior fingerprinting, submission analytics, seeded QC tasks, cross-review, and continuous Expert Quality Scoring.

Can our medical team review or co-design rubrics?

Absolutely. We welcome your SMEs for rubric tuning, threshold setting, and sign-off.

Integrations & Deployment

How do we integrate?

API access to AI Judges and evaluation endpoints
Batch evaluation via secure file drops
Agent harness for tool-use scenarios
Dashboard for results exploration

Can you run inside our VPC?

Yes—via private deployment options.

Which clouds and model providers do you support?

We’re model- and cloud-agnostic, with a slight preference towards GCP (in which case, you can use your commitment with Google on our services) We work with major providers and self-hosted models.

Compliance & Governance

Do you align with clinical guidelines and policies?

Yes. Rubrics encode current guidelines where applicable, and we can scope specialty-specific policy packs.

Can you support audit trails and Part 11-style controls?

We maintain provenance for data, prompts, models, judges, and versions, enabling reproducibility and auditing. Enhanced controls available on request.

Pricing & Commercials

How do you price?

Human experts - 10% during the pilot (first 100 hours), and 15% after the pilot flat fee on the monthly consumption. For example, 200 hours of RNs billed at $90/h (the hourly price is selected by you), will result in $18,000 towards experts and $2,700 towards our fees. The data delivered using AI patients, AI judges, and RL environments is a custom price based on the setup. Typical range for evaluation project for multi-turn conversation is at about $1.2 per conversation scored for 40 metrics, with 2+1 human alignment and >90% accuracy (statsig) Common components include:

Evaluation scope (scenarios × turns × metrics)
Expert review volume and specialty
Data packs (SFT/RLHF) and RL tasks
Deployment (API/VPC/private)

Do you offer pilots?

Yes—typical pilots baseline current performance and identify 3–5 high-impact fixes with a path to improvement. Available for shared/public benchmarks or co-developed assets.

Getting Started

What do you need from us to begin?

Your target tasks/workflows and success criteria
Model endpoints or batch outputs
Any constraints (PHI, tooling, latency, cost)
Optional: internal rubrics/guidelines to encode

How do we define success?

We agree on metrics, acceptance thresholds, and a remediation plan (prompts, data, RL tasks) at the outset.

For Application Teams (Healthtech)

We don’t have a training pipeline—can you still help?

Yes. Start with eval → prompt/tool improvements → lightweight SFT/RLHF packs you can apply with minimal infra.

Can you evaluate our agent end-to-end?

Yes—conversation → retrieval → tool calls → handoff notes (e.g., SOAP), including safety and documentation quality.

For Foundation Model Labs

Can you provide judge APIs at scale?

Yes—batch and streaming scoring with calibration reports against clinician panels.

Do you support red-teaming and safety suites?

Yes—stress tests for hallucinations, bias, unsafe advice, privacy leaks, and tool-use risks.

Miscellaneous

Do you replace human clinicians?

No. We amplify them—codifying expertise into scalable evaluators and datasets while keeping humans in the loop for calibration.

What’s the difference between “score higher” and “be safer”?

We separate capability (can it do the task?), alignment (does it follow policy/guidelines?), and risk (what happens when it’s wrong?)—and report each clearly.

Can you work under NDAs and data-partner agreements?

Yes. Standard NDAs, DPAs/BAAs, and SOWs are routine.

Contact & Support

How do we reach you?

You can reach out via email — feel free to send us a message at [email protected].

Core Concepts

Human Data

Platform

​About Lumos

​What is Lumos?

​Who is Lumos for?

​What problems do you solve?

​Core Concepts

​What are AI Judges?

​What are AI Patients?

​How do these differ from generic benchmarks?

​What Lumos Delivers

​What do I get from an evaluation?

​Do you offer leaderboards or public reporting?

​Can you guarantee improvements?

​Data & Security

​Do you need my PHI?

​Are you HIPAA compliant?

​Who owns the data and outputs?

​Do you use customer data to train third-party models?

​Evaluation Design

​What modalities do you support?

​Single-turn vs multi-turn?

​How do you choose metrics?

​Do you test tool use and retrieval?

​Do you compare models head-to-head?

​From Evaluation to Improvement

​What happens after issues are found?

​Do you run the training?

​Benchmarks vs. Custom Work

​Should we use your off-the-shelf benchmarks?

​When do we need custom evaluation?

​Can one benchmark serve multiple customers?

​Human Experts

​Who are your experts?

​How do you ensure quality and prevent cheating?

​Can our medical team review or co-design rubrics?

​Integrations & Deployment

​How do we integrate?

​Can you run inside our VPC?

​Which clouds and model providers do you support?

​Compliance & Governance

​Do you align with clinical guidelines and policies?

​Can you support audit trails and Part 11-style controls?

​Pricing & Commercials

​How do you price?

​Do you offer pilots?

​Any rev-share or licensing options?

​Getting Started

​What do you need from us to begin?

​How do we define success?

​For Application Teams (Healthtech)

​We don’t have a training pipeline—can you still help?

​Can you evaluate our agent end-to-end?

​For Foundation Model Labs

​Can you provide judge APIs at scale?

​Do you support red-teaming and safety suites?

​Miscellaneous

​Do you replace human clinicians?

​What’s the difference between “score higher” and “be safer”?

​Can you work under NDAs and data-partner agreements?

​Contact & Support

​How do we reach you?

About Lumos

What is Lumos?

Who is Lumos for?

What problems do you solve?

Core Concepts

What are AI Judges?

What are AI Patients?

How do these differ from generic benchmarks?

What Lumos Delivers

What do I get from an evaluation?

Do you offer leaderboards or public reporting?

Can you guarantee improvements?

Data & Security

Do you need my PHI?

Are you HIPAA compliant?

Who owns the data and outputs?

Do you use customer data to train third-party models?

Evaluation Design

What modalities do you support?

Single-turn vs multi-turn?

How do you choose metrics?

Do you test tool use and retrieval?

Do you compare models head-to-head?

From Evaluation to Improvement

What happens after issues are found?

Do you run the training?

Benchmarks vs. Custom Work

Should we use your off-the-shelf benchmarks?

When do we need custom evaluation?

Can one benchmark serve multiple customers?

Human Experts

Who are your experts?

How do you ensure quality and prevent cheating?

Can our medical team review or co-design rubrics?

Integrations & Deployment

How do we integrate?

Can you run inside our VPC?

Which clouds and model providers do you support?

Compliance & Governance

Do you align with clinical guidelines and policies?

Can you support audit trails and Part 11-style controls?

Pricing & Commercials

How do you price?

Do you offer pilots?

Any rev-share or licensing options?

Getting Started

What do you need from us to begin?

How do we define success?

For Application Teams (Healthtech)

We don’t have a training pipeline—can you still help?

Can you evaluate our agent end-to-end?

For Foundation Model Labs

Can you provide judge APIs at scale?

Do you support red-teaming and safety suites?

Miscellaneous

Do you replace human clinicians?

What’s the difference between “score higher” and “be safer”?

Can you work under NDAs and data-partner agreements?

Contact & Support

How do we reach you?