AutonomousClinical Safety
Our approach to designing, evaluating, and assuring autonomous clinical AI.
Healthcare needs its own "highway code" for clinical AI.
Just as self-driving cars have a highway code to handle merging and emergency stops safely, clinical AI needs a clear safety framework. We set out this vision in Nature Medicine and we're now building the tools to make it operational in real systems.
Fragmented Evaluation
Clinical consultations are not just about accurate diagnoses or asking the next right question. Safety depends on handling uncertainty, balancing being helpful and being safe, and building a relationship, all in a dynamic conversation with a real person.
However, most current benchmarks score isolated sub-tasks. But harm often occurs between these tasks: missed red flags, unsafe reassurance, poor escalation, or advice that sounds fluent but isn't clinically grounded.
Without a whole-consultation view, it's hard to build a system that patients, clinicians and regulators can trust, and difficult to improve systems responsibly.
a holistic view
Evaluating the Whole Consultation
At Ufonia, we've spent years deploying AI that talks to real patients. This experience taught us that safety requires evaluating complete clinical behaviours — not just isolated tasks.
The 'Cure' (Technical Competence)
Does the AI take a correct history, identify red flags, and triage correctly? This is the medical logic.
The 'Care' (Relational Competence)
Does the AI listen actively, show empathy, and explain clearly? This is the human connection.
Why Scalable
Evaluation Matters
Evaluating clinical AI's safety has long involved a trade-off between scalability and clinical relevance:
- 1.Automated metrics — Developed for general AI models, scale easily, but fail to capture clinical nuance or real-world clinical safety.
- 2.Human expert review — Clinicians manually checking AI responses. Provides depth and credibility, but is variable, prohibitively slow and expensive to run continuously.
Our safety frameworks (ASTRID and MATRIX), align automated evaluation with expert clinical judgement, enabling meaningful, scalable testing without reducing safety to simplistic metrics.
Generic Auto Scoring
Not clinically relevant
ASTRID /
MATRIX
Automated & Clinically Validated
Ad-hoc Checks
"Eye-balling" systems
Clinical Human Eval
Difficult to scale and inconsistent
Safety depends on
multiple behaviours working together.
We evaluate each of these behaviours independently, using purpose-built and clinically validated safety frameworks inspired by assurance methods from other high-stakes industries such as self-driving cars.

Dora
Autonomous ConsultationsVoice Understanding
LLM-based evaluators calibrated to align with clinician judgement assess Dora's "hearing" (transcription) behaviour at scale.
ASTRID
Evaluating Context, Refusal, and Faithfulness to prevent hallucinations.
MATRIX
Structured simulation of hazardous scenarios to stress-test clinical history taking.
Safety through large-scale clinical-simulation
Clinical consultations are safety-critical. A single missed red flag or misplaced reassurance can cause harm — even when individual answers appear clinically correct.
In autonomous driving, safety is earned through exposure: millions of miles driven across diverse and hazardous conditions. For clinical AI, the equivalent is experience in conversation. MATRIX evaluates dialogue agents across thousands of minutes of simulated clinical dialogue, exposing systems to the situations that matter for patient safety before they interact with real patients.
Hazards
Conversations
Safety
How MATRIX Works
A closed-loop simulation framework that runs thousands of realistic clinical conversations to uncover hazardous behaviours.
Safety Taxonomy
Structured map of patient input types, expected behaviours, and 40+ hazardous scenarios.
PatBot — Simulated Patient
LLM-driven patient persona that can display anxiety, confusion, or even derail conversations.
BehvJudge — Automated Safety Auditor
Reviews transcripts and flags unsafe behaviours. Matches/exceeds clinician hazard detection.
What MATRIX Reveals
- Frontier Models Fail: GPT-4 and Claude 3 Opus missed 12-15% of critical red-flag emergencies in our benchmarks.
- Misplaced Reassurance: Pure LLMs frequently reassured patients who needed urgent care.
- Dora's Safety: By using the MATRIX feedback loop, Dora achieves >98% pass rate across 40 hazardous scenarios.
MATRIX is used to evaluate Dora over thousands of minutes of conversation. Even though standard models don't pass, Dora's hybrid LLM and Deterministic system allows it perform safely.
Regulatory Alignment
- Built on ISO 14971 & SaMD safety principles
- Provides traceable, auditable safety evidence regulators expect
Clinician-Aligned Safety Auditing
BehvJudge closely matches clinician ratings
Evaluating Clinical Question Answering
Answering patient questions sounds simple - but in healthcare, it's one of the hardest things for AI to do safely. Patient questions are open-ended, ambiguous, and deeply dependent on clinical context. A response can sound fluent while being dangerous.
Standard AI metrics fall short here. Designed for general chatbots, they focus on surface-level similarity and miss key risks like hallucinated advice or failure to refuse unsafe queries. While human review is the gold standard, it is too slow and expensive to run at scale.
Patient"just one question I do have a slight shadow in my left eye..."
Agent
Harmful Response
Example of AI hallucinating medical advice against clinical consensus.
Question Answering Agent Architecture
To reduce hallucination risk, we use Retrieval Augmented Generation (RAG): where, the agent pulls from approved knowledge sources and guidelines before it answers.
Source
Guidelines

Grounding in Action: The model retrieves relevant context from your specific data before generating a single word.
Assuring Safe Communication
ASTRID is a safety-driven evaluation framework designed specifically for clinical question-answering, combining RAG with advanced metrics to provide a comprehensive safety assessment.
It goes beyond standard metrics by triangulating safety across three critical dimensions: Refusal Accuracy (knowing when to stay silent), Context Relevance (using the right data), and Conversational Faithfulness (sticking to the source).
These signals are calibrated to align with clinician judgement, so we can evaluate safely at scale.
Astrid -An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems
Conversational Faithfulness
Is the information actually grounded in approved clinical sources?
Context Relevance
Did the system retrieve the right clinical knowledge for this situation?
Refusal Accuracy
Did it correctly decline to answer when it should?
Our Approach to Science
Ufonia invests heavily in research. We believe that clinical AI requires more than just engineering—it demands a rigorous scientific foundation.
A cross-disciplinary team of clinicians, safety scientists, AI research engineers, and regulatory experts come together to ensure our systems are safe, effective, and equitable. We don't just build models; we validate them through prospective studies and real-world deployments.
Our methods are published at top-tier AI research venues such as ACL and NeurIPS, and our real-world clinical impacts have been shared regularly at leading global conferences including ASCRS, ARVO, ESCRS, AECOS, AAO, and the Royal College of Ophthalmologists Annual Congress.

Building the safety layer
for autonomous medicine
We publish the evaluation frameworks and safety philosophy that turn autonomous consultations from a demo into something you can trust, audit, and deploy responsibly.
Building a code of conduct for AI-driven clinical consultations
Nature Medicine·2026
WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue
arXiv·2025
MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation
arXiv·2025
ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems
ACL Findings·2025
If you're a clinician thinking deeply about safety, or an AI researcher working on evaluation and assurance, we'd love to collaborate.