Beyond WER

March 2026

We just shipped Universal-3-Pro at AssemblyAI. It's a big step forward for us. But when we ran it against a few open-source benchmark datasets, WER actually went up on some of them. That was surprising.

So we dug in. We started listening to the audio and reading the human reference transcripts side by side. What we found was that our model was beating the human ground truth — getting names right that the humans got wrong, transcribing sections the humans marked as inaudible, capturing speech in other languages that the humans just skipped. WER was penalizing us for being more accurate.

And these aren't obscure datasets. We're talking about widely-used benchmarks like VoxPopuli, Earnings-22, and AfriSpeech — datasets with 25,000+ downloads per month on Hugging Face that the entire industry uses to evaluate and compare models. The errors are hiding in plain sight.

This post is what we found.

First, WER (Word Error Rate) is the metric the industry uses to measure speech-to-text accuracy. Compare a model's transcript word-by-word against a human-written reference, count the differences, divide by total words.

WER on its own is a bit fuzzy. You can win WER leaderboards and still fail massively on the things that matter in production: proper nouns, numbers, hallucinations, meaning-changing errors. WER doesn't distinguish between any of these. Getting a patient's name wrong costs the same as "ok" vs. "okay."

Error	WER cost	Real-world cost
Extra "um" (speaker actually said it)	+1	Zero — LLM ignores it
"gonna" vs. "going to"	+2	Zero — same meaning
"8" vs. "eight"	+1	Zero — same number
Patient name "Jen" → "Chen"	+1	Critical — wrong patient
Drug name "Piriton" → "piratinib"	+1	Critical — antihistamine ≠ chemo drug
"does make sense" → "doesn't make sense"	+1	Critical — meaning reversed
Hallucinated sentence never spoken	+many	Critical — LLM treats it as fact

WER treats every row the same. A cosmetic difference costs the same as corrupting a medical record.

This matters most when your transcripts feed into an LLM — for clinical notes, meeting summaries, CRM updates, investor briefs. The LLM trusts the transcript. If the transcript says "I live alone" when the patient actually said "I live here," the LLM writes that into the chart. If you're just doing rough transcription for search indexing, WER is probably fine. But if the content of the transcript drives your product — if it feeds into agents, generates reports, or updates records — WER is measuring the wrong thing.

The "ground truth" isn't true

There's a much bigger problem than WER though — it assumes the human reference transcript is a perfect record of what was spoken. I looked through 22 popular datasets, and most of these human transcripts are full of gaps, inconsistencies, legitimate errors, and arbitrary style choices.

Here are some common patterns.

1. Fillers and hesitations are inconsistently transcribed by humans

Some annotators include filler words like "um," "uh," and "ahh." Others strip them all out. Both are valid style choices — but WER doesn't know the difference. If the annotator kept fillers and the model drops them, that's errors. If the annotator stripped fillers and the model keeps them, that's also errors. The model gets punished either way.

Medical Exam Medical School Clinical Exam — Chest Pain History

Dataset: Simulated Patient-Physician Medical Interviews

Press play and follow along. This annotator kept fillers — listen for "um," "uh," "ahh" in the audio.

Human Reference D: What brings you in today?

P: I so I've just had this pain in my chest for just over a week now and it's caused me to have trouble breathing.

D: OK, um how old are you, Jen?

P: Um 52.

P: Uh, yeah, just about that I I think maybe 8 days ago.

P: Ahh it it is, but it's definitely worse that um, it's the breathing...

AssemblyAI Output What brings you in today?

So I've just had this pain in my chest for just over a week now, and it's caused me to have trouble breathing.

Okay, um, how old are you, Jen?

Um, 52.

Uh, yeah, just about that. I think maybe 8 days ago.

Uh, it is, but it's definitely worse that it's the breathing...

Differs between sides

This annotator kept fillers and disfluencies ("I I think," "it it is," "Ahh"). The model captured some and smoothed others. Every mismatch counts against WER — but none of these differences change the meaning of what was said. Feed either transcript to an LLM and you'd get the same summary.

Dictation-style speech has its own version of this. In some medical and clinical datasets, the speaker literally says "comma" and "full stop" out loud as punctuation cues:

Clinical Dictation AfriSpeech — Medical Dictation (Accented English)

Dataset: AfriSpeech-200

Listen carefully — the speaker says "comma" and "full stop" out loud as punctuation.

Human Reference In contrast, other conditions produce an increase in urine volume these include the polyuric phase of acute tubular necrosis, diabetes mellitus, and diabetes insipidus.

What the speaker actually said In contrast, comma, other conditions produce an increase in urine volume. These include the polyuric phase of acute tubular necrosis, comma, diabetes mellitus, comma, and diabetes insipidus, full stop.

Spoken punctuation cues

The speaker literally says "comma" and "full stop" — you can hear it in the audio. The human reference just wrote commas and periods, dropping the spoken words entirely. A model that faithfully transcribes what was said — including "comma" and "full stop" as spoken words — gets penalized by WER for every single one.

2. AI beats humans on inaudible sections

Sometimes the audio is genuinely hard to make out. Background noise, crosstalk, mumbling — the human transcriber can't hear it, so they mark it <UNIN/> or <inaudible> and move on. Models will at least take a guess, and often a reasonable one. But WER still treats the human transcript as the gold standard — so a plausible guess at a difficult passage gets penalized the same as a hallucination.

GP Consultation UK Doctor Visit — Migraine Diagnosis

Dataset: PriMock57

Press play. The doctor introduces themselves at the start — try to catch the name yourself. It's hard to hear.

Human Reference D: Hello?
P: Hello.
D: Hello there. It's uh Doctor <UNIN/> here. How can I help you this afternoon?

P: Ohh, I just got a terrible headache since mid-day.

P: Um on the left side. It's just making me feel so ill. <UNIN/> I just feel like I need to vomit.

D: I'm sorry to hear that. Um can you tell me a bit more about the headache?

P: Well you know, I noticed some zig-zag lines in my vision a few minutes before the headache started.

AssemblyAI Output Hello?

Hello there, it's Dr. Sood here. How can I help you this afternoon?
[Model took a guess — and it sounds right]

Oh, I've just got a terrible headache since midday, on the left side. It's just making me feel so ill. I just feel like I need to vomit.

I'm sorry to hear that. Can you tell me a bit more about the headache?

Well, you know, I noticed some zigzag lines in my vision a few minutes before the headache started.

Annotator couldn't hear Model's guess

Listen to it yourself — the doctor's name is genuinely hard to make out. The human gave up and wrote <UNIN/>. The model guessed "Dr. Sood," which sounds right in context. I can't be 100% sure the model is correct here. But the point is: WER treats a plausible guess at a difficult passage the same as getting a clear word completely wrong. There's no distinction.

Same thing in this Ferrari earnings call:

Earnings Call Ferrari — 2021 Full Year Results

Dataset: Earnings-22

The operator hands off to the first speaker. Listen for the missing word.

Human Reference I would now like to hand the conference over to <inaudible> first speaker today, Nicoletta Russo. Please go ahead.

AssemblyAI Output I would now like to hand the conference over to our first speaker today, Nicoletta Russo. Please go ahead.

Annotator couldn't hear Model's guess

The word "our" got marked <inaudible> by the human. The model filled it in — and "our first speaker" makes a lot more sense than "<inaudible> first speaker." Hard to say for certain it's right, but it's a reasonable guess at a passage the human punted on. This transcript alone has 207 <inaudible> tags. Every one of them is a spot where WER penalizes the model for trying.

3. Human errors and low quality human transcripts

The human reference is sometimes flat out wrong. Wrong names, wrong company identifiers, wrong words. And WER penalizes the model for not matching the human's mistake.

Earnings Call Aurubis AG — Q1 2021/22 Conference Call (Opening)

Dataset: Earnings-22

Press play. The company is called Aurubis AG. The presenter is Elke Brinkmann. Listen and see what the human transcriber wrote instead.

Human Reference Good afternoon, ladies and gentlemen and welcome to Aurubis HE conference call, on the occasion of the publication of the quarterly report, first three months 2021, '22. At this time, all participants have been placed on a listen-only mode. The floor will be open for your questions following the presentation. Now, I hand over to Erica <inaudible>.

AssemblyAI Output Good afternoon, ladies and gentlemen, and welcome to the Aurubis AG conference call on the occasion of the publication of the quarterly report first 3 months 2021/22. At this time, all participants have been placed on a listen-only mode. The floor will be open for your questions following the presentation. Now I hand over to Elke Brinkmann.

Human transcriber got this wrong Model got closer to correct Formatting difference

The company is Aurubis AG. The human wrote "Aurubis HE" — that's wrong. The presenter's name is Elke Brinkmann. The human wrote "Erica" and marked the last name <inaudible> — wrong first name, missing last name. These are factual errors in the reference, not style choices. The model gets penalized by WER for not matching them.

Here's another one from a Japanese earnings call:

Earnings Call Advantest Corporation — Q3 FY2021 Results

Dataset: Earnings-22

Listen for the company name and the CEO's name in the opening.

Human Reference ...financial briefing for the third quarter of Y2021 of Advantus Corporation... Today's participants are President and CEO, Mr. Yosheda, CFO Mr. Fujita, Chief Customer Relation Officer...

AssemblyAI Output ...financial briefing for the third quarter FY 2021 of Advantest Corporation... Today's participants are President and CEO, Mr. Yoshida, CFO, Mr. Fujita, Chief Customer Relations Officer...

Human transcriber got this wrong Model got it right Minor difference

The company is Advantest — the human wrote "Advantus." The fiscal year is FY2021 — the human wrote "Y2021." The CEO's name is Yoshida — the human wrote "Yosheda." Three factual errors in the opening paragraph of the reference transcript. WER penalizes the model for getting them right.

4. Multilingual and code-switching

This one's less obvious if you only work with English, but it's a huge problem in multilingual audio. Speakers switch languages mid-sentence — it happens all the time in EU Parliament sessions, international earnings calls, multilingual meetings. Human annotators often just drop the parts that aren't in the "expected" language. The model tries to transcribe everything it hears, and gets penalized for it.

EU Parliament English preamble dropped from German transcript

Dataset: VoxPopuli

Press play — the chair clearly speaks English before the German speaker begins. The human reference pretends this didn't happen.

Human Reference Herr Präsident! Ich möchte die Kollegin nur fragen, ob Sie nicht überhaupt die Agenturen in dieser Form für überflüssig hält.

AssemblyAI Output Thank you, Mrs. Jensen. There is a blue card for you from Mr. Posselt. Mr. Posselt, you have the floor. Ich möchte die Kollegin nur fragen, ob sie nicht überhaupt die Agenturen in dieser Form für überflüssig hält,

Spoken but dropped from reference

You can hear it yourself — the chair says "Thank you, Mrs. Jensen" in clear English before Mr. Posselt starts speaking German. The human reference just skipped it entirely. Every English word the model correctly transcribed counts as an insertion error in WER.

EU Parliament Spanish — annotator formalized informal speech

Dataset: VoxPopuli

Listen for the speaker's word choices — the reference "cleans up" what they actually said.

Human Reference Nuestro informe se apoya en la creación de un marco regulador de principios éticos que sea de obligado cumplimiento para los programas informáticos, algoritmos y datos incluidos en la inteligencia artificial, la robótica y las tecnologías conexas que se desarrollen, se implementen o se utilicen en la Unión Europea.

AssemblyAI Output Nuestro informe se apoya en la creación de un marco regulatorio de principios éticos que sea de obligado cumplimiento para los softwares, algoritmos y datos incluidos en la inteligencia artificial, la robótica y las tecnologías conexas que se desarrollen, se distribuyan o se usen en la Unión Europea.

Formalized vs. what was actually spoken

The annotator "cleaned up" the speaker's word choices. "Softwares" became the more formal "programas informáticos." "Marco regulatorio" became "marco regulador." The simpler verbs "distribuyan" and "usen" became the fancier "implementen" and "utilicen." This is the same formalization problem as English "gonna" → "going to," just in Spanish. WER counts 6+ errors here. Semantic impact: near zero.

Four patterns, same problem: WER treats every mismatch between the model and the reference as the model's fault. It doesn't care if the reference stripped fillers, left gaps, got the facts wrong, or dropped entire languages. It just counts differences.

What to do when your ground truth is wrong

If your human references are unreliable, your WER number is unreliable too. First, trust WER a bit less. Here are some approaches that actually work.

Build your own ground truth

Most open-source speech datasets are riddled with the errors I showed above — inconsistent filler handling, inaudible gaps, wrong names. If you're evaluating models against these datasets, the signal is very low.

Build your own evaluation set instead. Collect audio that actually looks like your production traffic — your domain, your accents, your vocabulary. Then get human transcripts from a service like Rev or HappyScribe. The key part: actually inspect them. Listen to the audio yourself. Read along with the transcript. Flag the spots where the human got it wrong and fix them. This is tedious, but 50-100 files you actually trust is worth more than 10,000 files you inherited from a dataset where you have no idea what the annotators were told to do.

You don't need a massive dataset for evaluation — you need a clean one. A small, high-quality ground truth that you've personally verified will tell you more about your model's real-world accuracy than any leaderboard score computed against noisy references.

Use an LLM as a judge

If your transcripts feed into an agent — a booking agent, a clinical note-taker — what you really want to know is: will the agent get it right?

You can just ask an LLM. Give it the human reference and the outputs from multiple STT models, and ask it to compare. The trick is: tell the LLM that the human reference itself might contain errors. When it sees the same word or phrase across three or four model outputs that disagrees with the reference, it can figure out that the reference is probably wrong — not the models. It's essentially triangulating across outputs to find patterns. If every model says "Aurubis AG" and the reference says "Aurubis HE," the LLM can reason about that.

Here's the kind of prompt that works:

You are comparing multiple speech-to-text transcriptions of the same audio against a human reference. These transcripts will be sent to an LLM for downstream tasks like summarization, so accuracy on content that matters for comprehension is critical. IMPORTANT: The human reference may contain errors — wrong names, missing words, or inconsistent formatting. If multiple model outputs agree on something the reference gets wrong, trust the models. Look for patterns across outputs to identify reference errors. IMPORTANT: The human reference often OMITS filler words (um, uh, like, you know) that were actually spoken. A transcription that INCLUDES fillers is being MORE accurate, not less. Do NOT penalize for fillers. Evaluation criteria (in priority order): 1. HALLUCINATIONS (highest) — fabricated content never spoken 2. PROPER NOUNS (high) — names, companies, medical terms, acronyms 3. CONTENT ACCURACY (medium) — missing/substituted words that change meaning 4. FILLER PRESERVATION (medium) — keeping fillers is better, all else equal 5. FORMATTING (lowest) — punctuation, capitalization, number format

This priority order matches what actually breaks things in production. A booking agent that hears "Austin" instead of "Boston" fails completely. A medical note-taker that hears "I live alone" when the patient said "I live here" writes a wrong social history. WER can't distinguish these from cosmetic differences. An LLM judge can — and by looking across multiple outputs, it can even catch errors in your ground truth that you missed.

I ran this across 200+ files comparing two configurations of our model. The LLM judge sometimes picks the transcript with higher WER — because it has fewer meaning-changing errors, even though it has more filler/formatting mismatches that inflate the WER score.

Here are three real examples. Each row shows the same passage three ways: what was actually said, what the lower-WER transcript produced, and what the judge preferred instead (with higher WER).

Hallucination Medical vitals hallucinated

Dataset: Simulated Patient-Physician Medical Interviews

Listen for the section about the patient's vitals being stable.

Ground Truth ...we're gonna say vitals stable. He is on 2 liters of oxygen...

Lower WER (17.8%) — but wrong ...we're gonna say Whitehall's stable. He is on 2 liters of oxygen...

Judge Winner (21.6% WER) ...we're gonna say vital's stable. He is on 2 liters of oxygen...

The lower-WER transcript hallucinated "Whitehall's stable" — a proper noun that doesn't exist in the conversation. The judge picked the version that said "vital's stable" (close to "vitals stable"), even though it had 3.8 points higher WER overall.

Answer Flipped Patient's "no" turned into "I know"

Dataset: Simulated Patient-Physician Medical Interviews

Doctor asks if pain radiates down the leg. Patient answers.

Ground Truth ...pain radiate anywhere like down the leg or or up into the thigh?

Um no.

Lower WER (17.8%) — but wrong ...pain radiate anywhere, like down the leg or, or up into the thigh?

I know.

Judge Winner (21.6% WER) ...pain radiate anywhere, like down the leg or, or up into the thigh?

Ah, no.

The patient said "um no" — the pain doesn't radiate. The lower-WER transcript turned this into "I know," which an LLM would interpret as an affirmative. The judge picked the higher-WER version that correctly captured "ah, no." A triage agent reading these two transcripts would make the opposite call.

Wrong Entity GE's aircraft leasing arm confused with its financial unit

Dataset: Earnings-21

The speaker is discussing GE's aviation business and GECAS (their aircraft leasing division).

Ground Truth In aviation and at GECAS, airlines are conserving cash, not flying the planes they have...

Lower WER (10.8%) — but wrong ...primarily in Aviation and GE Capital with negative marks and impairments...

Judge Winner (12.9% WER) In aviation and at GECAS, airlines are conserving cash, not flying the planes they have...

GECAS is GE's aircraft leasing division. GE Capital is their financial services arm — a completely different business unit. The lower-WER transcript substituted the wrong entity. An LLM generating an investor brief would attribute the discussion to the wrong part of the company.

In each case, WER picked the wrong winner. The transcript with lower WER had a meaning-changing error that the higher-WER transcript avoided. The LLM judge caught this because it evaluates meaning, not word counts. It's more expensive to run than WER — you wouldn't use it on every file — but for model comparisons and periodic quality audits, it catches the errors that matter.

A/B test in production

Honestly, the best signal often comes from just shipping both options and measuring what happens. Run model A on half your traffic and model B on the other half, then track the KPIs that actually matter to your product — task completion rate, agent success rate, user corrections, support tickets, whatever your downstream metric is.

This sidesteps the entire ground truth problem. You're not comparing against a reference transcript that might be wrong — you're measuring whether the transcript was good enough for the thing it needed to do. If model B's transcripts lead to fewer failed bookings or more accurate clinical notes, that's the answer. WER might say model A is better, but your production metrics don't care about WER.

The downside is it's slower and you need enough traffic to get statistical significance. But if you have the volume, an A/B test will tell you more in a week than any offline evaluation.

The speech industry is still largely evaluating models with a single number. That worked when the bar was "can the model transcribe words at all." It doesn't work when the bar is "can an LLM trust this transcript enough to write a medical chart from it."