Technical Evaluation of Whisper large-v3 vs YouTube Subtitles: an Editorial Case Study in French (Defined Scope)
Automatic speech-to-text systems are now ubiquitous. Yet their evaluation is most often superficial: a few impressions of readability, sometimes a visual comparison, and rarely an instrumented analysis. This approach becomes insufficient as soon as transcription is no longer a mere accessibility aid, but a textual source in its own right.
This article presents a technical and contextualized evaluation of Whisper large-v3 compared to YouTube’s automatic subtitles, based on a real editorial use case: a long, narrative-analytical speech in French. The goal is not to proclaim a universal winner, but to determine what the data actually demonstrate, what they do not, and under which conditions Whisper large-v3 constitutes a credible alternative.
Research problem and hypothesis
Why this question matters
In many editorial workflows — fact-checking, media monitoring, archiving, indexing, RAG — transcription is not a secondary artifact. It becomes:
- a searchable text base,
- a source of quotations,
- an input for algorithmic analysis,
- sometimes a partial substitute for the original audio.
In this context, a transcription that appears “readable” may still be editorially problematic. Apparently minor lexical divergences can:
- weaken an argument,
- distort a quotation,
- introduce bias into automated analysis.
The central question is therefore not “is the transcription comfortable to read?” but “how faithful is it, in measurable terms, to the spoken discourse?”
Research question
On a real French editorial corpus, how do Whisper large-v3 and YouTube subtitles compare in terms of measurable lexical fidelity, once the transcriptions are properly aligned with the audio?
Working hypothesis
We hypothesize that Whisper large-v3 produces a lexically more faithful transcription than YouTube subtitles on this type of content. However, we do not assume in advance either the exact editorial impact of the errors or the generalization of the results to other contexts.
Experimental protocol
Description of the audio corpus
The analyzed corpus consists of a continuous narrative-analytical speech in French, with the following characteristics:
| Attribute | Value |
|---|---|
| Genre | Narrative and analytical speech |
| Language | French |
| Speakers | 1 |
| Duration | 38 min 23 s |
| Structure | Continuous, no artificial segmentation |
| Speaking rate | Natural, sustained |
| Vocabulary | Abstract, conceptual |
| Ambient noise | Very low |
| Audio quality | Good (clean recording) |
This corpus is deliberately bounded. It represents a specific subset: long, structured, single-speaker discourse in contemporary French.
It does not represent:
- multi-speaker conversations,
- regional or strongly accented French,
- noisy audio,
- spontaneous speech with frequent hesitations,
- specialized technical domains (medical, legal).
The conclusions of this study apply only to this scope.
Compared transcription pipelines
Two transcriptions were produced from the same audio file.
Pipeline A — Whisper large-v3
- Model: Whisper large-v3
- Implementation: ctranslate2
- Command used:
whisper-ctranslate2 audio.mp3 –model large-v3 –language fr –output_format srt
- Objective: maximum textual fidelity in offline mode
Pipeline B — YouTube subtitles
- Automatically generated YouTube subtitles
- SRT file extracted
- Objective: accessibility and real-time readability
These two systems pursue fundamentally different goals, which must be taken into account when interpreting the results.
Why a raw comparison would be misleading
Directly comparing two SRT files without preparation is methodologically incorrect. YouTube subtitles exhibit:
- an initial temporal offset,
- segmentation constrained by video display,
- high tolerance for insertions.
Without prior realignment, any metric would be artificially degraded.
Temporal alignment with the audio
Offset identification
An explicit lexical anchor shared by both transcriptions made it possible to identify an estimated initial offset of 25.84 seconds in the YouTube subtitles.
This offset is consistent across the file.
Alignment applied
All YouTube timecodes were adjusted accordingly before any analysis. This alignment step is essential for any audio-aligned analysis, especially when working with temporal windows.
Evaluation methodology
Normalization and tokenization
Both transcriptions underwent a strictly identical preprocessing, designed to isolate lexical fidelity from typographical noise.
| Processing step | Choice |
|---|---|
| Case | Lowercase |
| Punctuation | Removed (apostrophes preserved) |
| Accents | Preserved |
| Multiple spaces | Normalized |
| Hyphens | Normalized |
| Numbers | Tokenized as words |
The analysis is performed at the word level.
This normalization is intentional. Punctuation and case are not spoken; including them would introduce stylistic divergences unrelated to the audio signal.
Main metric: Word Error Rate (WER)
The Word Error Rate measures lexical distance between two word sequences:
WER = (Substitutions + Insertions + Deletions) / Number of reference words
In this study:
- Whisper large-v3 serves as the lexical reference,
- YouTube subtitles are compared against this reference.
WER does not measure:
- semantic impact,
- readability,
- overall editorial quality.
It measures only raw lexical alteration.
Global and temporal analysis
Two levels of analysis were conducted:
- Global WER over the entire corpus.
- WER over fixed 10-second temporal windows, aligned with the audio.
The second approach makes it possible to identify local fragility zones that are invisible in a global average.
Quantitative results
Global summary (entire corpus)
| Indicator | Value |
|---|---|
| Audio duration | 38 min 23 s |
| Words (Whisper) | 6,323 |
| Words (YouTube) | 6,317 |
| Substitutions (YouTube) | 331 |
| Insertions (YouTube) | 182 |
| Deletions (YouTube) | 188 |
| Global WER (YouTube vs Whisper) | 11.1% |
This corresponds to 701 edit operations over 6,323 reference words — roughly one alteration every nine words.
This value must be interpreted cautiously: it aggregates trivial and potentially critical errors.
Temporal distribution of errors
Analysis over fixed 10-second windows (230 windows in total) shows a heterogeneous distribution:
- Median WER ≈ 11%
- Mean WER ≈ 12%
- Local peaks exceeding 30%
Errors are therefore not uniformly distributed. They concentrate in specific parts of the discourse.
Targeted human audit: qualitative validation
Purpose of the audit
WER indicates where divergences occur, but not what they imply. A targeted human audit was therefore conducted to assess the real editorial impact of observed divergences.
Scope and method
- Audited portion: ~25% of the corpus
- Duration: 9 min 36 s
- Volume: 1,627 words after normalization
- Reference: original audio
- Unit of analysis: interpretable divergence (not isolated words)
For each divergence between Whisper and YouTube:
- the audio was listened to,
- both transcriptions were compared,
- the impact was classified.
Impact categories used
| Category | Definition |
|---|---|
| Critical | Changes the meaning or weakens the argument |
| Moderate | Affects clarity without reversing meaning |
| Trivial | Minor variation |
| Inaudible | Impossible to decide from the audio |
Qualitative audit results
On the audited portion:
- Most of the text is identical between Whisper and YouTube after normalization.
- Observed divergences are significantly more frequent and more impactful in the YouTube transcription.
- Whisper mainly produces trivial errors.
- YouTube concentrates more errors with moderate or critical impact, often linked to spurious insertions.
This audit confirms that WER peaks correspond to editorially sensitive passages.
Qualitative analysis: nature of the observed divergences
The targeted human audit makes it possible to go beyond purely quantitative metrics and to identify the structural nature of the errors.
Dominant types of YouTube errors
Across the audited segments, divergences in YouTube subtitles fall mainly into four categories:
- Spurious insertions Addition of words or connectors not present in the audio (“so”, “actually”, “exactly”, repeated segments). → Frequently impacts argumentative logic.
- Approximate lexical substitutions Replacement of an abstract term with a more common or phonetically similar word. → Moderate to critical impact depending on context.
- Loss of nuance through omission Removal of qualifiers or modalizers that are present in the spoken discourse. → Progressive weakening of the argument.
- Discursive fragmentation Segmentation constrained by subtitle display, producing incomplete or artificially split sentences. → Readable, but detrimental to conceptual continuity.
These errors reflect a filling strategy: when ambiguity arises, the system prioritizes textual continuity over strict fidelity to the audio.
Residual errors observed in Whisper large-v3
Whisper large-v3 is not error-free. However, the observed errors mostly consist of:
- minor orthographic variations,
- literal transcription of hesitations,
- imperfect normalization at word boundaries.
These errors have low editorial impact and do not alter either the meaning or the argumentative structure of the discourse.
Interpretation: why errors are not random
The temporal WER distribution and the qualitative observations converge on the same conclusion:
- errors increase in conceptually dense passages,
- rapid transitions between ideas are high-risk zones,
- successive abstractions place greater demands on the model.
This point is critical: errors do not occur randomly, but precisely where textual fidelity matters most for editorial use.
Discussion: scope and limits of the results
What this study demonstrates
Within the studied scope (long, analytical, single-speaker French discourse):
- Whisper large-v3 exhibits higher measurable lexical fidelity than YouTube subtitles.
- YouTube divergences are more often editorially significant.
- A similar global WER can mask very different impacts, depending on the nature of the errors.
- Temporal alignment is a necessary prerequisite for any serious comparison.
What this study does not demonstrate
This study does not allow conclusions about:
- multi-speaker conversations,
- noisy or degraded audio,
- regional or strongly accented French,
- specialized technical domains,
- comparisons with professional or human transcription services.
Any extrapolation beyond this scope would be methodologically unsound.
Methodological limitations
- Partial human audit Only 25% of the corpus was audited qualitatively. This is sufficient to identify trends, but not to establish exhaustive truth.
- Single auditor No inter-annotator agreement could be computed. The audit aims at qualitative consistency rather than statistical robustness.
- Non-human global reference Whisper serves as the lexical reference for global WER. A full human transcription would enable an absolute comparison, at significantly higher cost.
These limitations are known, documented, and do not undermine the contextualized conclusions of the study.
Practical implications for editorial use
When Whisper large-v3 is appropriate
Whisper large-v3 constitutes a reliable foundation for:
- transcribing long, analytical speeches,
- text indexing and search,
- RAG pipelines,
- quotation extraction (with audio verification).
Operational recommendation:
- use Whisper as a first transcription pass,
- focus human review on dense passages or high-WER segments,
- systematically validate critical quotations against the audio.
When Whisper large-v3 is insufficient on its own
- legal or medical content,
- heavily noisy audio,
- spontaneous or multi-speaker conversations,
- real-time accessibility requirements.
In these contexts, human transcription or specialized services remain preferable.
When YouTube subtitles are sufficient
- accessibility and video captioning,
- quick consultation,
- coarse segment identification.
They should not be used as an editorial source, for quotation, or as an analytical base.
Going beyond the numbers
While these results highlight a clear performance gap in specific contexts, they also raise a fundamental question: is Word Error Rate (WER) still the right metric for professional transcription? For a deeper dive into why raw accuracy can be a misleading indicator and how to choose the right tool for your specific needs, read our methodological analysis: Beyond the bench: Why raw accuracy is a trap in speech-to-text evaluation.
Conclusion
On a clearly defined corpus, this study shows that Whisper large-v3 offers higher measurable lexical fidelity than YouTube’s automatic subtitles, especially where editorial stakes are highest.
This conclusion is:
- grounded in explicit metrics,
- qualitatively validated through a targeted human audit,
- transparent about its limitations.
Whisper large-v3 is not a universal solution. It is, however, a credible editorial tool, provided its use is framed, documented, and complemented by a reasoned human review.
Methodological transparency
- Audio duration: 38 min 23 s
- Language: French
- Corpus: single-speaker analytical discourse
- Metrics: global WER + temporal WER
- Human audit: qualitative, targeted (25%)
- Temporal alignment: applied prior to analysis
Your comments enrich our articles, so don’t hesitate to share your thoughts! Sharing on social media helps us a lot. Thank you for your support!
