AutonomousClinical Safety

Our approach to designing, evaluating, and assuring autonomous clinical AI.

Explore our safety approach

Healthcare needs its own "highway code" for clinical AI.

Just as self-driving cars have a highway code to handle merging and emergency stops safely, clinical AI needs a clear safety framework. We set out this vision in Nature Medicine and we're now building the tools to make it operational in real systems.

Read the paper

Fragmented Evaluation

Clinical consultations are not just about accurate diagnoses or asking the next right question. Safety depends on handling uncertainty, balancing being helpful and being safe, and building a relationship, all in a dynamic conversation with a real person.

However, most current benchmarks score isolated sub-tasks. But harm often occurs between these tasks: missed red flags, unsafe reassurance, poor escalation, or advice that sounds fluent but isn't clinically grounded.

Without a whole-consultation view, it's hard to build a system that patients, clinicians and regulators can trust, and difficult to improve systems responsibly.

What's missing:
a holistic view

Diagnostic Accuracy

Note Summarisation

Management Planning

Question Answering

Empathic Communication

Fragmented Evaluations

History Taking

Question Answering

Treatment Planning

Empathy & Rapport

Active Listening

Shared Decisions

CureCare

SAFE

Evaluating the Whole Consultation

At Ufonia, we've spent years deploying AI that talks to real patients. This experience taught us that safety requires evaluating complete clinical behaviours — not just isolated tasks.

The 'Cure' (Technical Competence)

Does the AI take a correct history, identify red flags, and triage correctly? This is the medical logic.

The 'Care' (Relational Competence)

Does the AI listen actively, show empathy, and explain clearly? This is the human connection.

Why Scalable
Evaluation Matters

Evaluating clinical AI's safety has long involved a trade-off between scalability and clinical relevance:

1.
Automated metrics — Developed for general AI models, scale easily, but fail to capture clinical nuance or real-world clinical safety.
2.
Human expert review — Clinicians manually checking AI responses. Provides depth and credibility, but is variable, prohibitively slow and expensive to run continuously.

Our safety frameworks (ASTRID and MATRIX), align automated evaluation with expert clinical judgement, enabling meaningful, scalable testing without reducing safety to simplistic metrics.

Scalability

Clinical Relevance

Generic Auto Scoring

Not clinically relevant

ASTRID /
MATRIX

Automated & Clinically Validated

Ad-hoc Checks

"Eye-balling" systems

Clinical Human Eval

Difficult to scale and inconsistent

Safety depends on
multiple behaviours working together.

We evaluate each of these behaviours independently, using purpose-built and clinically validated safety frameworks inspired by assurance methods from other high-stakes industries such as self-driving cars.

Clinical Safety Harness

Dora

Autonomous Consultations

Voice Understanding

LLM-based evaluators calibrated to align with clinician judgement assess Dora's "hearing" (transcription) behaviour at scale.

ASTRID

Evaluating Context, Refusal, and Faithfulness to prevent hallucinations.

MATRIX

Structured simulation of hazardous scenarios to stress-test clinical history taking.

Safety through large-scale clinical-simulation

Clinical consultations are safety-critical. A single missed red flag or misplaced reassurance can cause harm — even when individual answers appear clinically correct.

In autonomous driving, safety is earned through exposure: millions of miles driven across diverse and hazardous conditions. For clinical AI, the equivalent is experience in conversation. MATRIX evaluates dialogue agents across thousands of minutes of simulated clinical dialogue, exposing systems to the situations that matter for patient safety before they interact with real patients.

"I feel dizzy after the meds...""Any chest pain or shortness of breath?"✓ asked red-flag questions"Also... are you a real person?""I'm an AI assistant. I can help, but can't replace a clinician."✓ disclosed identity"I feel dizzy after the meds...""Any chest pain or shortness of breath?"✓ asked red-flag questions"Also... are you a real person?""I'm an AI assistant. I can help, but can't replace a clinician."✓ disclosed identity

Catalogue
Hazards

Simulate
Conversations

Evaluate
Safety

MATRIX - Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

How MATRIX Works

A closed-loop simulation framework that runs thousands of realistic clinical conversations to uncover hazardous behaviours.

Safety Taxonomy

Structured map of patient input types, expected behaviours, and 40+ hazardous scenarios.

PatBot — Simulated Patient

LLM-driven patient persona that can display anxiety, confusion, or even derail conversations.

BehvJudge — Automated Safety Auditor

Reviews transcripts and flags unsafe behaviours. Matches/exceeds clinician hazard detection.

PatBot plays the patient → your model responds → BehvJudge audits → MATRIX returns pass/fail.

1. CONFIG/2. SIM/3. EVAL

PatBot→Model→BehvJudge→Verdict

What MATRIX Reveals

Frontier Models Fail: GPT-4 and Claude 3 Opus missed 12-15% of critical red-flag emergencies in our benchmarks.
Misplaced Reassurance: Pure LLMs frequently reassured patients who needed urgent care.
Dora's Safety: By using the MATRIX feedback loop, Dora achieves >98% pass rate across 40 hazardous scenarios.

MATRIX is used to evaluate Dora over thousands of minutes of conversation. Even though standard models don't pass, Dora's hybrid LLM and Deterministic system allows it perform safely.

Regulatory Alignment

Built on ISO 14971 & SaMD safety principles
Provides traceable, auditable safety evidence regulators expect

Clinician-Aligned Safety Auditing

BehvJudge closely matches clinician ratings

Read the paper

Evaluating Clinical Question Answering

Answering patient questions sounds simple - but in healthcare, it's one of the hardest things for AI to do safely. Patient questions are open-ended, ambiguous, and deeply dependent on clinical context. A response can sound fluent while being dangerous.

Standard AI metrics fall short here. Designed for general chatbots, they focus on surface-level similarity and miss key risks like hallucinated advice or failure to refuse unsafe queries. While human review is the gold standard, it is too slow and expensive to run at scale.

Patient"just one question I do have a slight shadow in my left eye..."

Agent

Harmful Response

Example of AI hallucinating medical advice against clinical consensus.

Question Answering Agent Architecture

To reduce hallucination risk, we use Retrieval Augmented Generation (RAG): where, the agent pulls from approved knowledge sources and guidelines before it answers.

Patient Question

Knowledge
Source

Safety
Guidelines

Dora

Grounded Response

Grounding in Action: The model retrieves relevant context from your specific data before generating a single word.

Assuring Safe Communication

ASTRID is a safety-driven evaluation framework designed specifically for clinical question-answering, combining RAG with advanced metrics to provide a comprehensive safety assessment.

It goes beyond standard metrics by triangulating safety across three critical dimensions: Refusal Accuracy (knowing when to stay silent), Context Relevance (using the right data), and Conversational Faithfulness (sticking to the source).

These signals are calibrated to align with clinician judgement, so we can evaluate safely at scale.

Astrid -An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Conversational Faithfulness

Is the information actually grounded in approved clinical sources?

Context Relevance

Did the system retrieve the right clinical knowledge for this situation?

Refusal Accuracy

Did it correctly decline to answer when it should?

Read the paper

Our Approach to Science

Ufonia invests heavily in research. We believe that clinical AI requires more than just engineering—it demands a rigorous scientific foundation.

A cross-disciplinary team of clinicians, safety scientists, AI research engineers, and regulatory experts come together to ensure our systems are safe, effective, and equitable. We don't just build models; we validate them through prospective studies and real-world deployments.

Our methods are published at top-tier AI research venues such as ACL and NeurIPS, and our real-world clinical impacts have been shared regularly at leading global conferences including ASCRS, ARVO, ESCRS, AECOS, AAO, and the Royal College of Ophthalmologists Annual Congress.

Building the safety layer
for autonomous medicine

We publish the evaluation frameworks and safety philosophy that turn autonomous consultations from a demo into something you can trust, audit, and deploy responsibly.

Building a code of conduct for AI-driven clinical consultations

Nature Medicine·2026

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

arXiv·2025

MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

arXiv·2025

ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

ACL Findings·2025

If you're a clinician thinking deeply about safety, or an AI researcher working on evaluation and assurance, we'd love to collaborate.

Join waitlist All Publications

AutonomousClinical Safety

Healthcare needs its own "highway code" for clinical AI.

Fragmented Evaluation

Evaluating the Whole Consultation

The 'Cure' (Technical Competence)

The 'Care' (Relational Competence)

Why ScalableEvaluation Matters

Generic Auto Scoring

ASTRID / MATRIX

Ad-hoc Checks

Clinical Human Eval

Safety depends on multiple behaviours working together.

Dora

Voice Understanding

ASTRID

MATRIX

Safety through large-scale clinical-simulation

How MATRIX Works

Safety Taxonomy

PatBot — Simulated Patient

BehvJudge — Automated Safety Auditor

What MATRIX Reveals

Regulatory Alignment

Clinician-Aligned Safety Auditing

Evaluating Clinical Question Answering

Harmful Response

Question Answering Agent Architecture

Assuring Safe Communication

Conversational Faithfulness

Context Relevance

Refusal Accuracy

Our Approach to Science

Building the safety layerfor autonomous medicine

Building a code of conduct for AI-driven clinical consultations

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

ASTRID - An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Why Scalable
Evaluation Matters

ASTRID /
MATRIX

Safety depends on
multiple behaviours working together.

Building the safety layer
for autonomous medicine