ASR in 2025-2026: A Deep Dive into Speech Recognition Technology Selection

Report Summary
ASR Industry Landscape and Technology Evolution Trends
In-Depth Comparison of Core Chinese ASR Systems
English and Chinese-English Code-Switching ASR Evaluation
ASR Pipeline Component Selection (VAD / Punctuation / Speaker Diarization / Alignment)
Streaming Architecture and Deployment
Cloud Service Market and Pricing
Self-Hosted vs. Cloud Economics
Typical Failure Modes and System Limitations
Extended Application — Vector Retrieval for Bilingual Transcription Knowledge Bases
Scenario-Based Selection and Implementation Recommendations
Conclusion
Appendices — A. CLI Reference · B. Benchmark Data · C. References

1. Report Summary

Three core judgments run through this report:

First, LLM architectures now fully dominate ASR accuracy; traditional non-autoregressive models have retreated to the efficiency tier. Between 2025 and 2026, every model that set a new SOTA -- FireRedASR2-LLM, Qwen3-ASR, Fun-ASR (7.7B), NVIDIA Canary-Qwen -- adopted the Conformer encoder + LLM decoder architecture. Pure acoustic models like Paraformer-Large (220M) remain irreplaceable on the efficiency dimension (RTF 0.008, 1--2GB VRAM), but the accuracy gap on AISHELL-1 has widened to approximately 3x (1.68% vs 0.57%).

Second, the Chinese and English ecosystems have fully diverged. In Chinese scenarios, Whisper's CER is 2--10x that of purpose-built Chinese models (AISHELL-1: 5.14% vs FireRedASR2-AED 0.57%), making it unsuitable for any serious Chinese ASR workload. In English scenarios, cloud APIs (ElevenLabs 2.3% WER, AssemblyAI 3.3%) deliver excellent experiences with complete features, but open-source alternatives (NVIDIA Canary 4.4%, Whisper large-v3-turbo 4.8%) are rapidly closing the gap. The economic advantage of local deployment crushes cloud in batch scenarios -- processing 71 hours of audio on a 3090Ti costs less than $1 in electricity.

Third, there is no longer a "one-size-fits-all" model. The optimal solution strictly depends on four dimensions of the business scenario: offline vs streaming, pure Chinese vs Chinese-English code-switching, edge vs cloud, and accuracy-first vs cost-first. This report provides specific pipeline configurations and cost estimates for five typical scenarios.

2. ASR Industry Landscape and Technology Evolution Trends

Between 2025 and 2026, the speech recognition industry underwent five structural shifts that redefined the fundamental assumptions of technology selection.

2.1 Architectural Paradigm Shift: From Component Stacking to End-to-End Large Models

The traditional ASR pipeline was a multi-component stack of "acoustic model + language model + VAD + punctuation + speaker diarization." Starting in 2025, mainstream models evolved toward the Encoder + LLM Decoder end-to-end architecture. FireRedASR2-LLM uses Qwen2-7B as its decoder backbone, the Fun-ASR API runs a 7.7B-parameter model on a Qwen3 base (with training data at the tens-of-millions-of-hours scale), Qwen3-ASR-1.7B unifies language ID, punctuation, and streaming recognition into a single model, and NVIDIA Canary-Qwen uses Qwen3-1.7B as its decoder to achieve 4.4% English AA-WER.

The essence of this shift is: the world knowledge of language models has been injected into speech recognition. LLM decoders inherently understand context, proper nouns, and domain terminology, making transcription results no longer purely acoustic mappings but semantically-aware text generation. The cost is a significant increase in model size and compute requirements -- from Paraformer's 220M/1--2GB to FireRedASR2-LLM's 8B+/32GB.

2.2 The Cambrian Explosion of Chinese ASR

Within 14 months, the best CER on the AISHELL-1 benchmark plummeted from Paraformer-Large's 1.68% to FireRedASR2-AED's 0.57% -- a 66% improvement. The contenders include:

FireRedASR (Xiaohongshu, 2025.01): The first open-source model with sub-1% CER
FireRedASR2/2S (Xiaohongshu, 2026.02--03): A complete pipeline integrating VAD/LID/punctuation
Qwen3-ASR (Alibaba Qwen, 2026.01): 52 languages + 22 Chinese dialects, unified streaming/offline
Fun-ASR-Nano (Alibaba Tongyi, 2025.12): Tens of millions of hours of training data, achieving 1.96% CER with 0.8B parameters
GLM-ASR-Nano (Zhipu): 4.10% multi-scenario CER

The leap in training data scale is the root cause of this explosion. Fun-ASR-Nano used "tens-of-millions-of-hours" of real speech data, and Qwen3.5-Omni reportedly exceeded 100 million hours. By comparison, the pre-2024 benchmark Paraformer-Large used only about 60,000 hours of labeled data -- the order-of-magnitude gap directly translates into an accuracy gap.

2.3 Whisper's Stagnation and Shrinking Footprint

OpenAI has not released Whisper v4. The latest model remains the October 2024 large-v3-turbo. OpenAI's strategic focus shifted to the API-only GPT-4o-transcribe (2025.03), incorporating speech recognition into a general multimodal model rather than a standalone ASR product.

Whisper's Chinese CER is now 5--10x behind the best open-source alternatives. On WenetSpeech Meeting (noisy meeting scenario), Whisper large-v3's CER reaches 18.87%, while FireRedASR2-LLM achieves only 4.32%. This isn't a marginal difference; it's a generation gap. The Fun-ASR technical report further warns: on independent industry test sets, Whisper's CER balloons to 16--32%, while Chinese-optimized models maintain 5--11%.

Whisper remains a reasonable choice in English scenarios (large-v3-turbo 4.8% WER), and its tool ecosystem (faster-whisper, WhisperX, stable-ts, whisper.cpp) is unmatched. But for Chinese ASR, Whisper should only be used as a multilingual fallback.

2.4 Practical Edge ASR

Edge ASR matured in 2025--2026:

SenseVoice-Small (234M): <1GB VRAM, can run on Raspberry Pi, built-in punctuation + emotion + audio event detection, CER 2.96% (still better than Whisper large-v3's 5.14%)
Moonshine-tiny-zh (27M): Chinese-specialized version released September 2025, accuracy matches Whisper Medium, model is only 27M parameters
Paraformer-v2 (220M): CTC architecture improved noise robustness, CPU RTF 0.05--0.1 (i7 processes 92 seconds of audio in about 9 seconds)
Sherpa-ONNX: The only framework covering Android/iOS/HarmonyOS/Raspberry Pi/RISC-V/NPU, bindings for 12 programming languages, CPU-only ONNX inference

Edge ASR is no longer "barely usable" -- it has become the preferred solution in specific scenarios (offline devices, low-latency embedded systems, privacy-sensitive applications).

2.5 Unified Models Replacing Component Stacking

The "Paraformer + FSMN-VAD + CT-Punc" three-component pattern was the gold standard for Chinese ASR in 2024. In 2026, this pattern is being replaced by single models:

FireRedASR2: Integrates FireRedVAD + FireRedLID (language identification) + FireRedPunc, punctuation F1 78.90% (vs FunASR-Punc 62.77%)
SenseVoice: Single model outputs punctuation + ITN + emotion tags + audio event tags
Qwen3-ASR: Single model handles language ID + punctuation + unified streaming/offline inference

The benefit of unification goes beyond deployment simplification -- choosing a 2026-era all-in-one model may give you both better accuracy and simpler deployment. When VAD incorrectly segments a speech fragment, downstream ASR and punctuation models are both degraded; unified models fundamentally avoid such issues.

3. In-Depth Comparison of Core Chinese ASR Systems

3.1 Benchmarks vs Real-World Performance Gap

The Chasm Between Clean Speech and Real Scenarios

Chinese ASR benchmarks have a structural problem: the performance gap between clean read speech and real-scenario audio can reach 3--4x. Taking Whisper large-v3 as an example, its CER on AISHELL-1 (standard read speech dataset) is 5.14%, but on WenetSpeech Meeting (real meeting recordings) it surges to 18.87%, a 3.7x degradation. Even the current accuracy ceiling FireRedASR2-LLM, with only 0.64% CER on AISHELL-1, rises to 4.32% on WenetSpeech Meeting, a 6.75x degradation. This gap is universal across all systems:

| Model | AISHELL-1 CER | WenetSpeech Meeting CER | Degradation Factor | |------|:---:|:---:|:---:| | FireRedASR2-LLM | 0.64% | 4.32% | 6.75x | | FireRedASR2-AED | 0.57% | 4.53% | 7.95x | | Doubao-ASR API | 1.52% | 4.74% | 3.12x | | Qwen3-ASR-1.7B | 1.48% | 5.88% | 3.97x | | Fun-ASR API (7.7B) | 1.64% | 5.78% | 3.52x | | Whisper large-v3 | 5.14% | 18.87% | 3.67x |

This means that selecting a model based solely on AISHELL-1 rankings can seriously mislead decisions. A model's 0.1 percentage-point lead on a clean dataset could easily be reversed in real noisy scenarios.

The API-vs-Paper Gap

The production performance of commercial APIs shows a meaningful gap from the numbers reported in papers. Two typical cases:

Seed-ASR / Doubao-ASR: The July 2024 Seed-ASR paper (arXiv:2407.04675) reported an AISHELL-1 CER of 0.68%, but the February 2026 measured CER of the commercial API was 1.52% -- the production system is significantly worse than the paper data. The gap is even larger on LibriSpeech-other: paper 2.84%, API measured 5.69%, a 2.0x difference. This gap likely stems from engineering compromises made by the commercial service to meet latency and throughput requirements (quantization, distillation, streaming truncation, etc.).

Fun-ASR / Bailian API: Alibaba's Fun-ASR technical report claims its API outperforms open-source models on internal real-world evaluation sets, but in FireRedASR2S's public benchmark tests (24 datasets), the Fun-ASR API's four-benchmark average CER is 4.16%, ranking below FireRedASR2-LLM (2.89%), FireRedASR2-AED (3.05%), Doubao-ASR (3.69%), and Qwen3-ASR (3.76%). There is a clear contradiction between the claim and measured results.

Dialect Recognition: The Biggest Current Technical Pain Point

Dialect recognition remains the hardest unsolved problem. The FireRedASR2S technical report (arXiv:2603.10420, March 2026) tested on 19 dialect benchmarks, showing that even the best system has an average CER above 11%:

| System | Dialect 19-Set Average CER | |------|:---:| | FireRedASR2-LLM | 11.55% | | FireRedASR2-AED | 11.67% | | Qwen3-ASR-1.7B | 11.85% | | Fun-ASR API (7.7B) | 12.76% | | Fun-ASR-Nano (0.8B) | 15.07% | | Doubao-ASR API | 15.39% |

The CER range varies enormously across specific dialects. Shanghainese conversational data CER ranges from 27% (FireRedASR2-AED) to 45% (Doubao-ASR), Sichuanese harder samples CER falls in the 20%--25% range across all systems, and Cantonese short-segment CER fluctuates between 5%--10.5% (Doubao-ASR Cantonese 10.5%, FireRedASR2 and Qwen3-ASR around 5%). Notably, Doubao-ASR's dialect average of 15.39% is the worst among the four major systems, 33% higher than FireRedASR2-LLM's 11.55% -- a gap completely invisible from pure Mandarin benchmarks.

The Disconnect Between Benchmark Rankings and User Perception

User reports from V2EX paint a more subjective picture compared to benchmark rankings. One user who used Doubao for six months reported "the speech recognition feels almost error-free in practice"; another stated flatly "Jianying (CapCut, with Volcengine backend) is the most accurate -- Whisper and Paraformer accuracy are both much worse." Yet in public benchmark rankings, Doubao-ASR ranks only third or fourth.

The root cause of this disconnect is: CER measures character-by-character accuracy, but user perception is more heavily influenced by system-level factors such as context-aware decoding, production-grade punctuation/ITN post-processing, and hotword list support. The PPO reinforcement learning and multimodal visual disambiguation (using images to resolve homophone ambiguity) introduced in Seed-ASR 2.0 are optimization directions targeting "perception" rather than "CER." The weak correlation between benchmark rankings and user satisfaction reminds us not to treat numbers as the sole criterion.

Data Leakage Risk

The Fun-ASR technical report explicitly warns: performance on public test sets "may not reflect real capability," due to data leakage -- potential overlap between training and test sets. Measured data corroborates this concern: on independent industry test sets, Whisper's CER surges from the benchmark's 5--10% to 16--32%, while Chinese-specialized models maintain 5--11%. The real-scenario gap is likely larger than what public benchmarks show.

3.2 Detailed Analysis of the Four Top Chinese Systems

The following data comes from the FireRedASR2S technical report (arXiv:2603.10420, March 2026) unified evaluation across 24 test sets. Full comparison table of the four systems:

| Benchmark | FireRedASR2-LLM (8B+) | FireRedASR2-AED (1B+) | Doubao-ASR API | Qwen3-ASR-1.7B | Fun-ASR API (7.7B) | |------|:---:|:---:|:---:|:---:|:---:| | AISHELL-1 | 0.64 | 0.57 | 1.52 | 1.48 | 1.64 | | AISHELL-2 | 2.15 | 2.51 | 2.77 | 2.71 | 2.38 | | WenetSpeech test_net | 4.44 | 4.57 | 5.73 | 4.97 | 6.85 | | WenetSpeech test_meeting | 4.32 | 4.53 | 4.74 | 5.88 | 5.78 | | Mandarin four-benchmark average | 2.89 | 3.05 | 3.69 | 3.76 | 4.16 | | Dialect 19-set average | 11.55 | 11.67 | 15.39 | 11.85 | 12.76 | | 24-set overall average | 9.67 | 9.80 | 12.98 | 10.12 | 10.92 | | Song (opencpop) | 1.12 | 1.17 | 4.36 | 2.57 | 3.05 |

FireRedASR2 (Xiaohongshu)

Technical Architecture: Xiaohongshu (RedNote) released FireRedASR v1 (arXiv:2501.14350) in January 2025 and v2 (FireRedASR2S, arXiv:2603.10420) in February 2026. Two variants are provided: AED (Attention-based Encoder-Decoder, 1.1B parameters) and LLM (8.3B parameters, with Qwen2-7B as the decoder backbone). The v2 version integrates FireRedVAD (voice activity detection), FireRedLID (language identification), and FireRedPunc (punctuation restoration), forming a complete unified pipeline. Open-source license is Apache 2.0.

Core Strengths:

Current open-source Chinese ASR accuracy ceiling. The AED variant achieves 0.57% CER on AISHELL-1, the current SOTA on Papers with Code
The LLM variant's four-benchmark average of 2.89% CER is the current open-source lowest
Outstanding song recognition capability, with opencpop CER of only 1.12%, 74% lower than Doubao-ASR's 4.36%
Dialect coverage of 20+ Chinese dialects, dialect average 11.55%, nearly on par with Qwen3-ASR (11.85%)
The v2 integrated FireRedPunc punctuation restoration module achieves 78.90% F1, significantly better than FunASR-Punc's 62.77%

Specific Data: AISHELL-1 CER 0.57% (AED) / 0.64% (LLM), AISHELL-2 CER 2.15% (LLM best), WenetSpeech Meeting 4.32% (LLM), Song 1.12% (LLM). The AED variant provides word-level timestamps via CTC branch (new in v2); the LLM variant has no timestamps. AED requires 4--6GB VRAM, LLM requires >=32GB VRAM (A100/A800 recommended), disk usage approximately 21GB (LLM). GPU RTF approximately 0.02 (AED), TensorRT can achieve 12.7x acceleration.

Known Limitations:

The AED variant has a 60-second maximum input length limit; exceeding 200 seconds causes hallucinations due to positional encoding errors
The LLM variant requires 32GB+ GPU VRAM (A100/A800 recommended), high deployment barrier
The v1 version lacked punctuation restoration capability; a Zhihu comment noted "this thing is fine for learning, but not usable for products." While v2 significantly improved this, the 60-second limit remains
The v2 FireRedVAD pipeline partially mitigates the long audio problem, but requires carefully designed chunking strategies

Doubao-ASR / Seed-ASR (ByteDance / Volcengine)

Technical Architecture: The product is confirmed to be Seed-ASR -- the commercial name is "Doubao-Seed-ASR-2.0," and the internal API resource ID contains the "seedasr" string. The Seed-ASR paper (arXiv:2407.04675, July 2024) describes an LLM-based end-to-end architecture using a 2B-parameter Conformer encoder, trained on over 20 million hours of speech. The v2.0 version (December 2025) added PPO reinforcement learning for context reasoning optimization, multimodal visual disambiguation (using images to resolve homophone ambiguity), and support for 13 foreign languages. Closed-source, available only through the Volcengine API.

Core Strengths:

Context-aware capability is the strongest among the four systems -- leverages conversation history and domain context for disambiguation, a capability that pure CER metrics cannot capture
Most complete production-grade post-processing: built-in punctuation, ITN (inverse text normalization), hotword lists (up to 200 tokens), replacement word rules
Best subjective experience -- long-term V2EX user feedback reports "feels almost error-free in practice"
The same technology stack powers Douyin, Feishu, Pico VR, and other products with hundreds of millions of users, with the most thorough production validation

Specific Data: AISHELL-1 CER 1.52% (API measured), AISHELL-2 CER 2.77%, WenetSpeech Meeting 4.74%, four-benchmark average 3.69%. Provides millisecond-precision sentence-level timestamps and speaker diarization. Streaming latency is sub-second. Regarding noise robustness, the FireRedASR v1 paper shows Seed-ASR (believed to be "ProviderA-Large" in the paper) has a 23.7% relative CER disadvantage on real multi-source speech (short videos, livestreams, voice input).

Known Limitations:

Closed-source API-only, no self-deployment option; uncontrollability is the biggest risk
Registration requires a mainland China phone number and real-name verification, creating access barriers for international developers
The API-vs-paper gap is meaningful: AISHELL-1 dropped from the paper's 0.68% to the API's 1.52%, LibriSpeech-other from 2.84% to 5.69%
Worst dialect performance among the four systems: 19-set average 15.39%, 33% higher than FireRedASR2-LLM
Seed-ASR 2.0 has no published English-specific benchmarks; baseline comparisons use data from other papers rather than controlled tests

Qwen3-ASR (Alibaba Tongyi)

Technical Architecture: Released January 2026, 1.7B parameters, Conformer encoder + Qwen LLM decoder architecture. Supports 52 languages (including 22 Chinese dialects), with unified streaming/offline inference mode via dynamic FlashAttention (2-second chunks). Apache 2.0 open-source license.

Core Strengths:

Broadest multilingual and dialect coverage: 52 languages + 22 Chinese dialects, far exceeding other systems
Unified streaming and offline architecture, 92ms Time-to-First-Token, achieving 2000x real-time throughput at 128 concurrent sessions on GPU
Excellent Chinese-English code-switching performance -- trained with 40,000+ synthesized English keyword data covering technology, education, finance, sports, and other domains
Dialect 19-set average 11.85%, only slightly behind FireRedASR2, far better than Doubao-ASR's 15.39%
Outstanding noise robustness: WER of 16.17% under extreme noise, compared to Whisper large-v3's 63.17%

Specific Data: AISHELL-1 CER approximately 1.48% (from FireRedASR2S evaluation), AISHELL-2 CER 2.71%, WenetSpeech test_net 4.97%, four-benchmark average 3.76%. No built-in timestamps; requires the separate Qwen3-ForcedAligner-0.6B model (average alignment precision 36.5ms, better than WhisperX and MFA). Timestamps not supported in streaming mode. VRAM requirement 6--8GB, CPU-only inference not supported.

Known Limitations:

Infinite token repetition defect (GitHub issue #129), where the model gets stuck repeating phrases thousands of times in an infinite loop
Streaming mode is "pseudo-streaming" -- each chunk is reprocessed from scratch with no cross-segment context, causing repeated transcription and format leakage at boundaries
Approximately 148 open GitHub issues as of April 2026, raising maturity concerns
Accuracy degradation on newer Transformers library versions (v5.3/5.4)
The official paper does not report AISHELL-1 CER -- a notable omission for the most standard Chinese ASR benchmark
No built-in ITN (inverse text normalization), requires community tools

FunASR / Bailian API (Alibaba)

Technical Architecture -- Key Finding: The commercial Fun-ASR and the open-source Paraformer are entirely different systems. This is the easiest trap to fall into during selection. The commercial Fun-ASR API (Bailian platform, launched late 2025) runs a 7.7B-parameter LLM architecture system (Qwen3 base), with training data reaching tens of millions of hours. The open-source Paraformer-Large is a 220M-parameter non-autoregressive model, trained on approximately 60,000 hours. They share a brand name, but their architectures are completely unrelated -- a 35x difference in parameter count, and orders of magnitude more training data.

The Bailian platform currently offers three ASR tiers: Fun-ASR (new LLM version, recommended for most scenarios, supports song recognition, 7 Chinese dialect groups, 26 regional accents, RAG hotword customization); Paraformer-v2 (lightweight fast alternative, CTC architecture v2 significantly improved noise robustness); SenseVoice (marked as "soon to be discontinued," documentation guides users to migrate).

Core Strengths:

Most mature ecosystem: Docker deployment, 2-pass streaming, hotword customization, WebSocket server, with the most production deployment experience
Open-source Paraformer-Large (220M parameters) remains the best cost-performance choice among sub-300M models: 1--2GB VRAM, GPU RTF 0.008, CPU RTF 0.05--0.1, supports CPU-only processing of 92 seconds of audio in about 9 seconds on an i7
FunASR 2-pass streaming mode is validated at scale by Alibaba and Chinese industry, with 4vCPU/8GB RAM supporting 16 concurrent streams, and 64vCPU/128GB RAM supporting 100+ streams
The commercial Fun-ASR API achieves 7.60% WER on real-world industry data, a 30%+ improvement over Paraformer-Large's approximately 11%

Specific Data: Fun-ASR API four-benchmark average 4.16%, AISHELL-1 CER 1.64%, AISHELL-2 CER 2.38% (better than Doubao-ASR and Qwen3-ASR on this metric), WenetSpeech Meeting 5.78%. Open-source Paraformer-Large four-benchmark average approximately 4.56%, AISHELL-1 CER 1.68%. Fun-ASR-Nano (0.8B) four-benchmark average 4.55%. Both word-level and sentence-level timestamps are supported.

Known Limitations:

Open-source Paraformer accuracy has fallen significantly behind SOTA: AISHELL-1 CER 1.68% vs FireRedASR-AED's 0.55%, a 3x gap
FunASR Python package import takes over 1 minute (dependency loading issue)
The commercial Fun-ASR LLM model inherits the inherent hallucination risk of LLMs -- may generate text not present in the audio
The open-source toolkit has persistent bugs: WebSocket server hotword parameters silently ignored, ONNX punctuation model loading failures, timestamp array mismatches (character/timestamp count misalignment, end truncation)
Open-source ITN has known bugs: e.g., "shi yi dian" (eleven o'clock) incorrectly converted to "10 yi dian"
SenseVoice is soon to be discontinued; users of this model face migration needs

3.3 Other Chinese ASR Solutions

SenseVoice-Small / Large

A multi-task ASR model from Alibaba DAMO Academy, unique in integrating emotion detection, audio event labeling, and punctuation restoration into the recognition output. SenseVoice-Small is only 234M parameters (0.23B), requires less than 1GB GPU VRAM, can even run on a Raspberry Pi, and achieves 2.96% CER on AISHELL-1 -- still better than Whisper large-v3's 5.14%, with a model size only 1/6 of the latter. SenseVoice-Large (1.6B parameters) has lower CER: AISHELL-1 2.09%, AISHELL-2 3.04%, WenetSpeech Meeting 6.73%. Both support code-switching across 5 languages.

However, the key limitation is: SenseVoice is marked as "soon to be discontinued" on the Bailian platform (original documentation text), and users are being guided to migrate to Fun-ASR or Paraformer-v2. Additionally, SenseVoice has no native streaming support; sherpa-onnx simulates streaming through VAD + chunked offline inference, but this is not true streaming. As a lightweight option for edge deployment or resource-constrained scenarios, SenseVoice-Small still has value before its discontinuation.

Fun-ASR-Nano (0.8B)

Released December 2025 (Fun-ASR-Nano-2512), 0.8B parameters, trained on tens of millions of hours of real speech, VRAM requirement 2--4GB. AISHELL-1 CER 1.76% (close to Paraformer-Large's 1.68%), AISHELL-2 CER 2.80% (Material 2 reported value) / 3.02% (FireRedASR2S unified evaluation value). Supports 31 languages, streaming mode, and native punctuation output. Positioned as a balance between accuracy and efficiency -- with only 1/10 the parameters of Fun-ASR API (7.7B), its four-benchmark average of 4.55% is nearly identical to Paraformer-Large's 4.56%, while its dialect average of 15.07% is in the same tier as Doubao-ASR's 15.39%.

Moonshine-tiny-zh (27M)

An extreme edge model released in September 2025 as part of the Moonshine "Flavors" series, with only 27M parameters (0.027B), claiming accuracy matching Whisper Medium. This is currently the smallest known Chinese ASR model. Supports deployment on CPU via sherpa-onnx, suitable for Android, iOS, Raspberry Pi, and other edge devices. Does not support code-switching, no native punctuation output. When VRAM and compute are extremely constrained, Moonshine-tiny-zh provides a "better than nothing" option.

WeNet U2++

Approximately 100M parameters (0.1B), an open-source project from Tsinghua University, providing native streaming support (unified two-pass architecture). AISHELL-1 CER 4.63%, accuracy is no longer competitive in the current landscape, but its architecture design has deep influence in academia. Limited code-switching support, no native punctuation output. Valuable as a learning and research reference, but no longer recommended for production selection.

MiniMax / Moonshot / Zhipu ASR Status

MiniMax: Despite the international website (minimax.io) listing "Speech Recognition API," the official developer documentation (platform.minimaxi.com) does not expose any ASR endpoint. The API reference contains only TTS models (speech-2.8-hd, speech-2.8-turbo, etc.), voice cloning, and sound design. MiniMax's own simultaneous interpretation GitHub demo uses Whisper as the ASR component. Third-party analysis shows its STT functionality is a wrapper around Whisper Large. MiniMax does lead in the TTS domain (first place in both Artificial Analysis Speech Arena and HuggingFace TTS Arena), but has no competitive proprietary ASR offering.

Moonshot / Kimi: No commercial ASR API. However, their open-source Kimi-Audio-7B achieves 0.60% WER on AISHELL-1 -- a remarkable number, approaching FireRedASR2-AED's 0.57% CER level. However, this model exists only in open-source form with no cloud API service, and uses WER rather than CER as the metric, so direct cross-system comparison requires caution.

Zhipu AI: Is the only one among emerging AI companies to offer a formal ASR product. GLM-ASR and GLM-ASR-2512 cloud APIs support streaming recognition, 8 Chinese dialects, and OpenAI-compatible SDK integration. GLM-ASR-Nano (1.5B parameters) has a multi-scenario average CER of 4.10%. Less prominent than the four leading systems, but a fully functional commercial ASR API.

3.4 The Technical Foundation of CapCut / Jianying

Volcengine Speech Team RNN-T + Conformer Architecture

CapCut / Jianying uses ByteDance's self-developed ASR technology, built by the Volcano Speech team and delivered through the Volcengine cloud platform. It is not FunASR, it is not Paraformer -- those are Alibaba's technology stack, an entirely different company and codebase.

The core architecture is an end-to-end streaming system based on RNN-T (Recurrent Neural Network Transducer) + Conformer encoder, integrating ASR, endpoint detection, punctuation restoration, inverse text normalization (ITN), and intelligent sentence segmentation into a unified pipeline. In January 2023, this system achieved 99% recognition accuracy in a specific certification test at the China AI National Testing Center (the specific test set and metric definitions were not disclosed; performance under real noisy scenarios may differ significantly). The same engine powers Douyin (TikTok China), Feishu (Lark), and Pico VR.

Relationship with Seed-ASR

Seed-ASR is ByteDance's next-generation ASR system, trained on over 20 million hours of speech, based on an LLM architecture. Seed-ASR achieves a 10--40% error rate reduction relative to Whisper and is currently being gradually integrated into the Doubao platform. CapCut / Jianying's ASR backend is the Volcengine ASR service, belonging to the same technology lineage as Doubao-ASR / Seed-ASR, but the specific deployed version may differ from the research model described in the Seed-ASR 2.0 paper due to productionization differences (as discussed in the API-vs-paper gap above).

AsrTools Open-Source Reverse Engineering

AsrTools (GitHub 2.6k stars) is an open-source project that reverse-engineers the cloud ASR APIs of CapCut / Jianying, Bilibili, and Kuaishou. Users need no GPU and directly call these platforms' backend services via network to obtain ByteDance-level ASR quality. For users who do not want to build their own infrastructure but still want high-quality Chinese ASR, AsrTools provides a zero-cost shortcut -- at the cost of depending on the stability of reverse-engineered interfaces, and operating in a legal gray area.

4. English and Chinese-English Code-Switching ASR Evaluation

The English ASR market underwent a significant landscape reshaping in 2025--2026: multiple cloud APIs surpassed Whisper in accuracy, while open-source solutions led by NVIDIA caught up rapidly. Unlike the Chinese market where domestic models dominate unchallenged, the English track presents a triple parallel landscape of highly mature cloud APIs, practical open-source solutions, and the emergence of the LLM transcription paradigm.

4.1 English Commercial API Tiers (AA-WER v2.0 Benchmark)

The most reliable independent English ASR benchmark currently is the AA-WER v2.0 published by Artificial Analysis, which tested 41 models across three real-world datasets (AgentTalk, VoxPopuli, Earnings-22), covering conversations, parliamentary speeches, and earnings calls. The following analysis is grouped by accuracy tier.

Tier 1: Extreme Accuracy

ElevenLabs Scribe v2 ranks first among all dedicated ASR APIs with 2.3% WER, releasing its v2 update in February 2025. Priced at $0.40/hour, it sits at the median price level. Core features include: word-level timestamps, up to 32 speaker diarization, and unique audio event annotation (laughter, applause, etc.), with streaming latency around 150ms. The main limitation is that it is relatively new and only supports cloud deployment.

Close behind are Gemini 3 Pro (2.9% WER, $4.80/1000 min) and Gemini 3 Flash (3.1% WER, $1.92/1000 min). (Note: Gemini pricing is per 1000 minutes, not per hour as listed for other vendors in this section.) Notably, Google's LLM transcription models perform excellently, but their traditional STT product Chirp shockingly recorded 31.3% WER--a more than 10x accuracy gap between two products from the same company.

Tier 2: All-in-one Productivity

AssemblyAI Universal-3 Pro recorded 3.3% WER at $0.21/hour, making it the best accuracy-per-dollar ratio among top-tier APIs. AssemblyAI deliberately optimizes for "usable output" rather than raw WER: Universal-2 improved proper noun accuracy by 24% over its predecessor, alphanumeric handling by 21%, and human preference ratings by 73%. Its Speech Understanding API consolidates transcription, summarization, sentiment analysis, and topic detection into a single call.

Gladia Solaria-1 offers the most complete single-API feature set at 4.2% WER and a flat $0.50/hour price (no surcharges): transcription, speaker diarization, timestamps, punctuation, semantic chapterization, summarization, named entity recognition, sentiment analysis, and 99-language translation. Semantic chapterization is extremely rare among ASR APIs, and code-switching (automatic mid-stream language detection) is a signature advantage.

Speechmatics Enhanced (Ursa 2) recorded 4.3% WER at $0.40/hour with comprehensive features.

Tier 3: Extreme Cost-effectiveness and Speed

Deepgram Nova-3 trades 5.4% WER for unmatched speed (130--473x real-time) and the lowest pricing at $0.26/hour (batch). Word timestamps, speaker diarization, punctuation, keyword hints, and PII redaction are all included in a single API call with no surcharges. Its predecessor Nova-2 recorded 5.6% WER with speeds up to 473x real-time--the fastest among all APIs. (Note: Nova-3 WER figures vary across sources; AA-WER v2.0 benchmark reports 5.4% for Nova-3.)

Gemini 2.0 Flash Lite achieves 4.0% WER at only $0.19/1000 min, making it the ultimate cost choice, but it follows the LLM transcription route with limited post-processing flexibility.

Tier 4: LLM Transcription New Forces

OpenAI GPT-4o Transcribe recorded 4.1% WER with hallucinations reduced by approximately 90% compared to Whisper v2 (December 2025 data). However, key limitations exist: no word-level timestamps (only the legacy Whisper-1 model supports them), speaker diarization requires calling a separate gpt-4o-transcribe-diarize model which only provides sentence-level timestamps, and the maximum chunk limit is 23 minutes. Priced at $0.36/hour.

Microsoft MAI-Transcribe-1 (April 2026) claims to achieve the lowest WER among commercial models across 25 languages at $0.36/hour, but independent verification is still pending.

Pitfall Guide

Google Cloud Speech-to-Text (Chirp) recorded 31.3% WER on the same benchmark--an incredibly poor number considering Google's own Gemini models can achieve 4.0%. Azure Native STT also lags behind with WER in the 10--14% range. Both giants' traditional STT products have severely fallen behind the leading tier, but both can access OpenAI models as alternatives through their respective platforms.

Amazon Transcribe maintains usable accuracy at 4.3% WER, but its $1.44/hour pricing is the most expensive in its tier. Rev.ai Reverb has an estimated WER of approximately 5--6% with $0.12/hour pricing offering excellent cost-effectiveness, but lacks independent benchmark verification.

English Commercial API Overview Table

| Vendor | Model | AA-WER v2.0 | Price/Hour | Word Timestamps | Speaker Diarization | Streaming | |------|------|:-----------:|:---------:|:--------:|:----------:|:----:| | ElevenLabs | Scribe v2 | 2.3% | $0.40 | ✅ | ✅ (32 speakers) | ✅ (~150ms) | | AssemblyAI | Universal-3 Pro | 3.3% | $0.21 | ✅ | ✅ | ✅ | | AssemblyAI | Universal-2 | 4.0% | $0.15 | ✅ | ✅ (+$0.02) | ✅ | | OpenAI | GPT-4o Transcribe | 4.1% | $0.36 | ❌ | Separate model | ✅ | | Gladia | Solaria-1 | 4.2% | $0.50 | ✅ | ✅ | ✅ | | Amazon | Transcribe | 4.3% | $1.44 | ✅ | ✅ | ✅ | | Speechmatics | Enhanced (Ursa 2) | 4.3% | $0.40 | ✅ | ✅ | ✅ | | Deepgram | Nova-3 | 5.4% | $0.26 | ✅ | ✅ | ✅ (<300ms) | | Rev.ai | Reverb | ~5--6% | $0.12 | ✅ | ✅ | ✅ | | Azure | Native STT | ~10--14% | $0.36 | ✅ | ✅ | ✅ | | Google | Chirp | 31.3% | $0.96 | ✅ | ✅ | ✅ |

4.2 Seed-ASR English Performance

ByteDance's Seed-ASR paper (arXiv:2407.04675, July 2024) reported competitive English data for its multilingual model (Seed-ASR ML). The model is based on a 2 billion parameter Conformer encoder trained with approximately 125K hours of English supervised data.

On standard benchmarks, Seed-ASR claims:

LibriSpeech test-clean: 1.58% WER (slightly better than AssemblyAI Universal-1's 1.6%)
LibriSpeech test-other: 2.84% WER (better than Universal-1's 3.1%)
Switchboard: 11.59% WER (better than Whisper v2's 13.8%)
CallHome: 12.24% WER (better than Whisper v2's 17.6%)
AMI IHM: 13.16% WER (better than Whisper v2's 16.9%)
Internal multi-domain English test: 5.34% WER, achieving a 42% relative reduction compared to Google USM's 9.33%

However, a critical caveat must be attached. No independent third party has verified these numbers. Seed-ASR is not open-source, does not appear on any public leaderboard, and is only available through the Volcengine commercial API. The baseline comparisons in the paper use data from other papers rather than controlled head-to-head evaluations.

Particularly noteworthy is the significant gap between production API and paper numbers. Taking the Doubao-ASR API real-world tests as an example: AISHELL-1 CER is 1.52% (paper: 0.68%), LibriSpeech-other WER is 5.69% (paper: 2.84%). This API-vs-paper gap is prevalent in commercial deployments, stemming from latency/throughput optimization tradeoffs against accuracy.

Seed-ASR 2.0 is a commercial product upgrade adding PPO reinforcement learning, multimodal visual disambiguation, and support for 13 foreign languages, but no English-specific benchmarks have been published. The paper also does not report Earnings-22, GigaSpeech, or CommonVoice results, making direct comparisons with 2025--2026 competitors incomplete.

4.3 English Open-Source Solutions

Open-source English ASR closed the gap with commercial APIs in 2025--2026, led by NVIDIA.

NVIDIA Canary-Qwen-2.5B

Using a Conformer encoder + Qwen3-1.7B decoder architecture, it recorded 4.4% WER on AA-WER v2.0 with processing speed of 418x real-time--the highest accuracy among open-source English models. As a completely free self-hosted solution, its accuracy is already in the same range as GPT-4o Transcribe (4.1%) and Amazon Transcribe (4.3%), with low hallucination risk. However, the 2.5B parameter count means GPU dependency.

NVIDIA Parakeet-TDT-0.6B

Another extreme route: a non-autoregressive CTC/TDT decoder with only 0.6B parameters, trading 6.8% WER for extreme throughput of over 2000x real-time. Its non-autoregressive architecture is structurally immune to hallucination--there is no risk of autoregressive decoder attention divergence through looping. Suitable for scenarios requiring extremely high throughput where slight accuracy loss is acceptable.

Whisper large-v3-turbo

Although OpenAI has not released Whisper v4 (shifting to API-only GPT-4o-transcribe), Whisper large-v3-turbo remains a pragmatic choice for English self-hosting. 4.8% WER is competitive, it requires only 5--6GB VRAM to run, and its ecosystem is unmatched: faster-whisper (CTranslate2 acceleration), WhisperX (wav2vec2 forced alignment + Silero VAD), stable-ts (stable timestamps), and whisper.cpp (CPU inference) form a complete toolchain.

For English self-hosting scenarios that don't pursue extreme accuracy but value ecosystem maturity, the Whisper large-v3-turbo + WhisperX combination remains the solution with the least engineering resistance.

English Open-Source Solution Comparison

| Model | Parameters | AA-WER / WER | Real-time Speed | VRAM | Hallucination Risk | Core Advantage | |------|:----:|:------------:|:--------:|:----:|:--------:|----------| | Canary-Qwen-2.5B | 2.5B | 4.4% | 418x | -- | Low | Open-source accuracy leader | | Parakeet-TDT-0.6B | 0.6B | 6.8% | >2000x | -- | Zero | Extreme throughput, zero hallucination | | Whisper large-v3 | 1.55B | 4.3% | 91--154x | ~10GB | High | Benchmark reference | | Whisper large-v3-turbo | -- | 4.8% | -- | 5--6GB | High | Most mature ecosystem |

4.4 Chinese-English Code-Switching Special Topic

Why Traditional Models Struggle with Chinese-English Code-Switching

Code-switching refers to the phenomenon where speakers alternate between two or more languages within the same sentence or passage, such as "我们用transformer做的这个API有三个endpoint" (mixed Chinese-English). Traditional ASR models face three difficulties in this scenario:

Training data bias: The vast majority of ASR training sets are monolingual, and annotated Chinese-English mixed data is extremely scarce. The acoustic and language models of these models assume a single language distribution, and probability distributions become severely distorted at language switching points.
Acoustic-linguistic dual uncertainty: At switching points, acoustic features abruptly shift from one language's phoneme space to another, while the language model's n-gram/context probabilities simultaneously fail. Autoregressive decoders are especially vulnerable here--the context generated from preceding Chinese tokens cannot provide effective conditional probabilities for subsequent English tokens.
Vocabulary and tokenization conflicts: Chinese character/word-based tokenization and English subword-based tokenization (BPE/SentencePiece) produce boundary conflicts in mixed text, leading to reduced tokenization efficiency and a surge in OOV (out-of-vocabulary) tokens.

Code-Switching Strategies Across Systems

Qwen3-ASR-1.7B employs the most systematic code-switching training approach. The team synthesized 40,000+ English keywords across technology, education, finance, and sports domains, injecting them into Chinese training data to generate large amounts of Chinese-English mixed training samples. Combined with native multilingual training across 52 languages and support for 22 Chinese dialects, Qwen3-ASR has the broadest language coverage for code-switching. However, its known issues--infinite token repetition (GitHub issue #129) and pseudo-streaming boundary repetition--may be amplified in code-switching scenarios.

Seed-ASR / Doubao-ASR introduced a Function Call strategy in v2.0 specifically optimizing code-switching. Seed-ASR's core advantage lies in context-aware capability--using conversation history and domain context for disambiguation. This explains why Doubao-ASR ranks third/fourth on public benchmarks, but V2EX users subjectively perceive it as "almost error-free." However, Doubao-ASR has the worst dialect CER among the four systems (15.39% average), meaning performance may deteriorate sharply when code-switching overlaps with dialects.

Gladia Solaria-1 offers fully automatic mid-stream language detection supporting code-switching across 99 languages without requiring users to preset the language. This "zero-configuration" approach is extremely attractive for scenarios where the language composition of audio is uncertain. Its 4.2% AA-WER accuracy foundation is also solid enough. The flat $0.50/hour pricing includes code-switching capability with no additional surcharge.

FireRedASR2 supports code-switching but it is not its core selling point. The system's advantages concentrate on pure Chinese accuracy (AISHELL-1 CER 0.57%), 20+ dialect coverage, and song recognition (opencpop CER 1.12%). It can serve as a backup for Chinese-English code-switching scenarios, but is not as targeted as the three systems above.

Fun-ASR's code-switching relies on explicit user-configured language hints, with limited automatic detection capability. The open-source version's streaming Chinese-English recognition is constrained by the streaming-paraformer-bilingual-zh-en model, which is not as capable as Qwen3-ASR's native multilingual streaming. The commercial Fun-ASR API (7.7B LLM architecture) performs better on code-switching, but the specific improvement magnitude lacks public benchmark support.

4.5 "Full-featured One-stop" API Comparison

For production scenarios requiring "one API call to solve all problems," the following four APIs form differentiated competition in feature completeness.

Feature Completeness Matrix

| Feature | Deepgram Nova-3 | AssemblyAI | Gladia Solaria-1 | ElevenLabs Scribe v2 | |------|:---------------:|:----------:|:-----------------:|:--------------------:| | Word-level timestamps | ✅ | ✅ | ✅ | ✅ | | Speaker diarization | ✅ (40x faster than competitors) | ✅ | ✅ | ✅ (32 speakers) | | Punctuation restoration | ✅ | ✅ | ✅ | ✅ | | Semantic chapterization | ❌ | ✅ (topic detection) | ✅ (chapterization) | ❌ | | Summary generation | ❌ | ✅ | ✅ | ❌ | | Sentiment analysis | ❌ | ✅ | ✅ | ❌ | | NER/entity detection | ❌ | ✅ | ✅ | ❌ | | PII redaction | ✅ | ✅ | ❌ | ❌ | | Translation | ❌ | ❌ | ✅ (99 languages) | ❌ | | Audio event annotation | ❌ | ❌ | ❌ | ✅ (exclusive) | | Code-switching | ❌ | ❌ | ✅ | ❌ | | Filler word detection | ✅ | ❌ | ❌ | ❌ | | Price/Hour | $0.26 | $0.15--0.75 | $0.50 flat | $0.40 |

Deepgram Nova-3 is closest to "full-featured single call": word timestamps, speaker diarization, smart formatting, PII redaction, and filler word detection are all activated via parameters with no surcharges. The only missing feature is semantic chapterization. Its 130--473x real-time speed makes it irreplaceable in high-throughput scenarios.

AssemblyAI Speech Understanding provides the deepest semantic understanding through its AI analysis layer: summarization, topic detection, sentiment analysis, and entity detection are completed in a single call, and its chapter and topic detection features essentially provide semantic chapterization. However, with all features enabled, pricing may stack to 2--5x the base price (starting from $0.15/hour).

Gladia Solaria-1 is the most feature-complete single API, with all features included at a $0.50/hour flat price. Semantic chapterization, cross-99-language translation, and code-switching are its unique differentiators. For scenarios requiring "set it and forget it" with comprehensive feature needs, Gladia offers the best overall value.

ElevenLabs Scribe v2 leads exclusively in accuracy (2.3% WER), and audio event annotation (laughter, applause, etc.) is an exclusive feature, but it lacks semantic chapterization and AI analysis capabilities. Suitable for scenarios demanding extreme accuracy without deep semantic processing.

4.6 Community Consensus and Benchmark Limitations

Systematic Distortion in Benchmarks

The reliability of published benchmarks is facing increasing scrutiny. Published benchmarks are unreliable -- providers train on public datasets, and a model showing 5% WER on benchmarks may actually deliver 15--20% WER on challenging production audio. The Fun-ASR technical report itself explicitly warns that public test set results may not reflect true capabilities due to data leakage. The Seed-ASR paper reports AISHELL-1 CER of 0.68%, but the commercial API tests at 1.52%--the API-vs-paper gap is pervasive.

VoiceWriter Independent Testing

VoiceWriter conducted independent evaluations using its own clean, noisy, accented, and specialist audio across four categories. Results show: Google Cloud traditional STT consistently ranked last; Whisper and Gemini tied for first in overall scoring; but on accented and specialist audio, Gemini clearly won--its LLM knowledge base helps decode domain terminology. This finding confirms the structural advantage of LLM-based ASR at the semantic level.

Narrowing Accuracy Competition and True Differentiation Dimensions

Top English APIs have all reached <5% WER on clean audio, with differences becoming negligible. True differentiation occurs on noisy, accented, multi-speaker, and domain-specific audio. The consensus on Reddit and HN communities is: always test on your own data, do not trust published numbers.

LLM-based ASR Trend

LLM transcription models like GPT-4o Transcribe, Gemini, and Mistral Voxtral are reshaping expectations. These models outperform traditional encoder-decoder ASR in contextual understanding, proper nouns, and formatting, but are slower and sometimes more expensive. The long-term trend is clear: LLM architectures will dominate the accuracy frontier, while traditional ASR retreats to the high-throughput/low-cost niche. This is entirely consistent with the evolution path in the Chinese market--LLM-integrated architectures (FireRedASR-LLM 8.3B, Fun-ASR 7.7B, Qwen3-ASR 1.7B) have fully taken over the accuracy leaderboards.

5. ASR Pipeline Component Selection

For pipeline design with traditional or non-all-in-one models. In 2026, integrated models (FireRedASR2 with built-in VAD+LID+Punc, SenseVoice with built-in emotion+events+punctuation, Qwen3-ASR with built-in language ID+punctuation+streaming) are replacing the component stacking paradigm, but component-based pipelines remain the mainstream approach for low-resource deployments, customization scenarios, and when using non-integrated models (e.g., Paraformer, WeNet, Whisper).

5.1 VAD -- Voice Activity Detection

VAD (Voice Activity Detection) is the first line of defense in the ASR pipeline: separating effective speech segments from silence, noise, and non-speech events in the audio stream. VAD quality directly impacts downstream ASR accuracy--especially for autoregressive models like Whisper, where VAD preprocessing is the single most effective means of suppressing hallucination.

FSMN-VAD

An industrial-grade Chinese VAD built into the FunASR ecosystem, based on Feedforward Sequential Memory Networks (FSMN), with anti-noise training on industrial Mandarin corpora. Inference is extremely fast in ONNX format, deeply integrated with the FunASR pipeline--filtering invalid speech segments directly reduces downstream CER.

Positioning: The preferred VAD for pure Chinese ASR pipelines. Proven integration cases with both Paraformer and FireRedASR2's FireRedVAD. Apache 2.0 license.

Silero VAD v5

The best VAD for general scenarios, MIT license, model size only approximately 2MB. Records 87.7% TPR at 5% FPR in standard evaluations, integrable with 3 lines of Python, supports cross-platform deployment. Its language-agnostic design makes it more flexible than FSMN-VAD in cross-language and code-switching scenarios.

Positioning: The default VAD choice for non-FunASR pipelines. Both WhisperX and faster-whisper have built-in Silero VAD integration. In Chinese-English code-switching streaming scenarios, Silero's language-agnostic nature and lower switching latency give it a slight edge over FSMN-VAD.

pyannote VAD

Academic SOTA-level accuracy, based on deep learning, best performance with GPU inference. The core advantage lies in seamless integration with the pyannote.audio speaker diarization pipeline--when a pipeline needs both VAD and diarization, pyannote VAD is the natural choice, avoiding the complexity of cross-framework integration. CC-BY-4.0 license.

Positioning: Offline batch pipelines requiring speaker diarization. Not recommended when only doing VAD without diarization--Silero is more lightweight.

WebRTC VAD

A lightweight VAD released early by Google, BSD license. However, it achieves only 50% TPR at 5% FPR under identical evaluation conditions--Silero's error rate is one quarter of it. Processing overhead is negligible, but accuracy can no longer meet modern ASR pipeline requirements.

Positioning: Effectively obsolete, with usage justification only in legacy system compatibility scenarios.

TEN VAD

TEN Framework benchmarks show it outperforms Silero on the precision-recall curve with lower computational overhead. MIT license. The core advantage is Web/WASM deployment--the best VAD choice for browser-based real-time ASR scenarios. However, Chinese-specific capabilities lack documentation.

Positioning: Browser/WASM deployment scenarios. Server-side pipelines should still prioritize Silero or FSMN-VAD.

VAD Selection Quick Reference

| VAD Model | Accuracy | Latency | Streaming | Chinese Specialization | License | Best Scenario | |----------|------|------|:----:|:--------:|:----:|----------| | FSMN-VAD | Industrial-grade (anti-noise training on industrial Mandarin corpora) | Extremely fast (ONNX) | ✅ | ★★★★★ | Apache 2.0 | Chinese ASR pipeline | | Silero v5 | 87.7% TPR@5% FPR | <1ms/chunk | ✅ | ★★★ | MIT | General/cross-language/Whisper pipeline | | pyannote | Academic SOTA | Medium (best with GPU) | ✅ | ★★★ | CC-BY-4.0 | With speaker diarization | | WebRTC | 50% TPR@5% FPR | Negligible | ✅ | ★★ | BSD | Legacy systems | | TEN VAD | Better than Silero (precision-recall) | Lower than Silero | ✅ | Undocumented | MIT | Web/WASM deployment |

5.2 Punctuation Restoration and ITN

Punctuation Restoration and Inverse Text Normalization (ITN) transform the raw text output of ASR into readable written text. ITN is responsible for converting colloquial number expressions into written form (e.g., "one hundred twenty three" -> "123"), while punctuation restoration inserts sentence-breaking symbols.

CT-Punc (CT-Transformer)

The best standalone Chinese punctuation restoration model in the FunASR ecosystem. Handles Chinese punctuation and Chinese-English mixed text, supporting both offline and streaming modes. Runs on CPU via ONNX with nearly zero added latency. Validated at scale through Alibaba's industrial-grade speech services.

Positioning: When using models that do not include built-in punctuation output such as Paraformer or WeNet, CT-Punc is the preferred choice for punctuation restoration. However, for 2026 integrated models that natively output punctuation such as SenseVoice, Qwen3-ASR, and FireRedASR2, CT-Punc is no longer necessary in many pipelines.

FireRedPunc

A BERT-based punctuation model integrated into FireRedASR2S (v2, February 2026), achieving an F1 score of 78.90%--significantly better than FunASR-Punc's 62.77% (same-source benchmark). Notably, FireRedASR v1 had no punctuation capability at all--this was a major user complaint, with Zhihu reviews stating "good for learning, not for production." FireRedPunc in v2 completely resolved this issue.

Positioning: A built-in component of the FireRedASR2 pipeline. When used as a standalone punctuation model, its F1 has a significant advantage over CT-Punc, but its documentation and ecosystem maturity for standalone deployment are less developed than CT-Punc.

LLM Punctuation Restoration

Using large language models like Qwen-2.5-72B for punctuation restoration achieves the highest F1 on research benchmarks, but inference costs make it infeasible for real-time or high-throughput scenarios.

Pragmatic compromise: Use CT-Punc for real-time processing, and optionally use a smaller LLM (e.g., Qwen-2.5-7B) for refinement during the batch post-processing phase--balancing accuracy and cost.

Native Punctuation Output Trend

An important structural change in 2025--2026 is: the best ASR models are beginning to natively output punctuation, making standalone punctuation models unnecessary in many pipelines.

SenseVoice: Outputs punctuation + ITN + emotion labels + audio event labels
Qwen3-ASR: LLM decoding naturally produces punctuation and language identification
FireRedASR2: Integrates the FireRedPunc component
Whisper: Outputs punctuation for most languages, but Chinese punctuation quality is inconsistent

ITN Pitfalls

ITN appears simple but hides pitfalls. The FunASR open-source ITN has known bugs, such as incorrectly converting "shi yi dian" (eleven o'clock) to "10 yi dian" (a half-Chinese, half-Arabic mixed format). Qwen3-ASR does not have built-in ITN and relies on community tools. In Chinese-English mixed text, ITN complexity increases further--whether "san billion dollars" should be converted to "3 billion dollars" or "3 billion USD" depends on context, and there is currently no universal solution.

Recommendation: For scenarios demanding high ITN quality, prefer the Doubao-ASR API (most complete production-grade punctuation/ITN/hotword list) or FireRedASR2 (FireRedPunc includes ITN), and avoid relying on FunASR's open-source ITN.

5.3 Speaker Diarization

Speaker diarization answers "who said what and when," and is a hard requirement for multi-speaker scenarios (meetings, interviews, podcasts). The evaluation metric is DER (Diarization Error Rate); lower is better.

pyannote community-1

The latest open-source version on pyannote.audio 4.0, recording 11.7% DER on AISHELL-4 (strict evaluation: no collar forgiveness, evaluating overlap segments)--the lowest among all open-source solutions. Uses WeSpeaker embeddings with significant improvement over the 3.1 baseline. DER on the AMI dataset is 17.0%. CC-BY-4.0 license.

Positioning: The best overall DER choice in offline batch pipelines. Suitable for combination with any ASR model through the workflow of VAD segmentation -> batch ASR -> punctuation restoration -> speaker embedding extraction -> clustering -> merged transcription for end-to-end multi-speaker transcription.

3D-Speaker + CAM++

Alibaba DAMO Academy's speaker recognition solution. The CAM++ speaker embedding model has only 7.2M parameters with an EER of only 0.65% on VoxCeleb1-O--extremely high speaker discrimination capability. 3D-Speaker further supports multimodal (audio+visual+semantic) diarization, specifically trained on Chinese datasets (CN-Celeb). DER on AISHELL-4 is 13.3%.

Positioning: The strongest solution for Chinese-specific speaker diarization. Deeply integrated with the FunASR pipeline, using FSMN-VAD and spectral clustering combination. Apache 2.0 license.

FunASR CAM++ Pipeline

FunASR integrates CAM++ embeddings, FSMN-VAD, and spectral clustering into a unified pipeline providing real-time speaker label output. The highest integration level in Chinese production environments.

Positioning: The preferred choice when real-time Chinese speaker diarization is needed. FunASR Docker deployment provides a WebSocket server supporting streaming speaker labels.

NeMo Sortformer v2

NVIDIA's speaker diarization solution with the core advantage of computational efficiency: 214x real-time processing speed. DER on both AISHELL-4 and AMI is ~13%. Supports streaming diarization.

Limitation: Supports a maximum of only 4 speakers, limiting use in large-scale meeting scenarios. CC-BY-NC-4.0 license (note the non-commercial restriction).

DiariZen

An emerging open-source alternative with no authoritative benchmark data yet. Suitable to track as an exploratory option.

Speaker Diarization Selection Quick Reference

| System | DER (AISHELL-4) | DER (AMI) | Chinese Support | Streaming | License | Best Scenario | |------|:---------------:|:---------:|:--------:|:----:|:----:|----------| | pyannote community-1 | 11.7% | 17.0% | ★★★★ | Partial | CC-BY-4.0 | Overall best (offline) | | 3D-Speaker + CAM++ | 13.3% | -- | ★★★★★ | ❌ | Apache 2.0 | Chinese-optimized | | FunASR CAM++ pipeline | Competitive | -- | ★★★★★ | ✅ | Apache 2.0 | Chinese production integration | | NeMo Sortformer v2 | ~13% | ~13% | ★★★ | ✅ | CC-BY-NC-4.0 | Efficient streaming (<=4 speakers) | | DiariZen | -- | -- | ★★★ | ❌ | Open-source | Exploratory |

5.4 Timestamps and Word-level Alignment

Word-level timestamps are the foundational capability for subtitle generation, audio/video retrieval, and interactive transcription. Timestamp support varies significantly across models.

Timestamp Support by Model

Doubao-ASR (Seed-ASR) provides the most complete timestamp solution: sentence-level timestamps + speaker association with millisecond precision. Production-grade punctuation and ITN are deeply integrated with timestamps. As a closed-source API, users need not handle alignment details.

FireRedASR2-AED outputs word-level timestamps via its CTC branch (v2 only; v1 lacked this capability). Its LLM variant (8B+) does not support timestamps--a common limitation of LLM decoder architectures.

Qwen3-ASR has no built-in timestamp output. Alibaba released a separate Qwen3-ForcedAligner-0.6B model specifically for speech-text forced alignment. This model achieves a measured average alignment precision of 36.5ms--better than WhisperX and Montreal Forced Aligner (MFA). However, it requires additional inference steps and GPU resources. There is no timestamp support at all in streaming mode.

Alibaba Bailian API provides word-level and sentence-level timestamps across all models, but GitHub issues report alignment bugs: character-to-timestamp count mismatches, end-time truncation, and other problems. Alignment quality verification is needed when using it.

FunASR Paraformer-Large provides character-level timestamps (char-level); the CIF (Continuous Integrate-and-Fire) mechanism enables its non-autoregressive architecture to also produce alignment information. The v2 version improved noise robustness after switching to CTC-based alignment.

Whisper natively provides word-level timestamps, but precision is inconsistent on Chinese (timestamps become coarse-grained when outputting Chinese character by character).

SenseVoice does not support timestamp output.

WhisperX: The Mature Solution for English Word-level Alignment

WhisperX is currently the most mature solution for English word-level timestamps. Its core approach is: first use Whisper (or faster-whisper) for transcription, then use wav2vec2 forced phoneme alignment to precisely align the transcribed text to the audio at the word level.

Compared to Whisper's native timestamps, WhisperX's alignment precision is significantly improved: wav2vec2's pre-trained phoneme representations provide sub-word-level acoustic anchors. Combined with Silero VAD segmentation and the condition_on_prev_text=False setting, WhisperX simultaneously solves both timestamp precision and hallucination issues.

On a 3090Ti, WhisperX processes 71 hours of English audio in batched inference in only ~2.5--3.5 hours (<8GB VRAM) while producing word-level timestamps--making it the de facto standard tool for English subtitle generation.

Timestamp Capability Overview

| Model/System | Timestamp Type | Precision | Standalone Alignment Tool | Notes | |-----------|-----------|------|:-----------:|------| | Doubao-ASR API | Sentence-level + speaker | Millisecond-level | Not needed | Most complete | | FireRedASR2-AED | Word-level (CTC) | Good | Not needed | v2 AED only; LLM has no timestamps | | Qwen3-ASR | No built-in | -- | Qwen3-ForcedAligner (36.5ms) | Requires additional inference step | | Bailian API | Word-level + sentence-level | Bug reports exist | Not needed | Alignment quality verification needed | | Paraformer-Large | Character-level (CIF) | Good | Not needed | v2 CTC improved noise robustness | | Whisper + WhisperX | Word-level (wav2vec2) | High | Built into WhisperX | Most mature English solution | | SenseVoice | Not supported | -- | -- | No timestamp capability |

6. Streaming Architecture and Deployment

Streaming ASR is a hard requirement for real-time interaction (simultaneous interpretation, voice input, meeting minutes). However, not all models possess true streaming capability. This chapter dissects the streaming architecture of five mainstream solutions one by one and provides quantified hardware requirement references.

6.1 Streaming Solution Comparison

6.1.1 FunASR 2-pass: Most Production-Proven Chinese Streaming Benchmark

FunASR's 2-pass architecture is currently the most production-proven Chinese streaming ASR solution, deployed at scale by Alibaba and across Chinese industry. Its core design is a dual-stage pipeline:

Stage 1 (Streaming Paraformer): Low-latency initial output, acoustic latency 480–600ms (depending on chunk configuration)
Stage 2 (Offline Paraformer-Large): Triggers full-sentence correction at sentence boundaries, pulling accuracy back to offline levels

On the deployment side, FunASR's C++ runtime supports WebSocket connections:

| Hardware Configuration | Concurrent Streams | |----------|----------| | 4 vCPU / 8GB RAM | 16 streams | | 64 vCPU / 128GB RAM | 100+ streams |

GPU offline mode RTF is 0.0076, achieving 1200x acceleration with multi-thread speedup. Docker deployment, hot-word customization, WebSocket server, and other production-grade toolchains are all available, making it the safe choice for streaming Chinese ASR today.

6.1.2 Sherpa-ONNX: Widest Deployment Coverage Cross-Platform Solution

Sherpa-ONNX's core value lies not in accuracy but in deployment coverage. It is the only ASR framework that covers everything from server racks to microcontrollers:

Language bindings: 12 programming languages (C/C++/Python/Java/C#/Go/Kotlin/Swift/JS/Dart/Rust/Ruby)
Platform support: Android / iOS / HarmonyOS / Raspberry Pi / RISC-V / Ascend NPU
Runtime: Pure CPU + ONNX Runtime, no GPU dependency
Bilingual models: streaming-zipformer-bilingual-zh-en provides true Chinese-English bilingual streaming recognition; streaming-paraformer-trilingual-zh-cantonese-en supports Mandarin-Cantonese-English trilingual

Sherpa-ONNX has expanded to support ONNX inference for mainstream models including FireRedASR2, Cohere Transcribe, FunASR Nano, and Qwen3-ASR. Accuracy is lower than FunASR 2-pass and Qwen3-ASR, but it is the only viable option for edge/embedded scenarios.

6.1.3 Qwen3-ASR: Highest Accuracy Streaming, but with "Pseudo-Streaming" Issues

Qwen3-ASR-1.7B achieves streaming inference through dynamic FlashAttention, with a chunking strategy of 2-second chunks. Key performance metrics:

TTFT (Time-to-First-Token): 92ms (GPU)
Throughput: 2000x real-time (concurrency 128)
Language coverage: 52 languages, including 22 Chinese dialects

However, Qwen3-ASR's streaming implementation has a fundamental limitation: each chunk is processed from scratch with no cross-segment context passing. This causes two practical problems:

Boundary transcription duplication: Overlapping regions between adjacent chunks produce duplicate text
Format leakage: Internal formatting markers from the model occasionally appear in the output

Strictly speaking, this is pseudo-streaming (chunked offline) rather than true streaming (incremental decoding). GitHub issue #129 reported a more severe infinite token repetition problem. For production environments requiring stable streaming output, careful evaluation is needed.

6.1.4 Whisper: No True Streaming Solution

Whisper's Transformer encoder-decoder architecture requires a complete 30-second audio window as input, ruling out true streaming capability by design. The community's best efforts:

| Solution | Latency | Status | |------|------|------| | UFAL Whisper-Streaming | ~3.3s (English), higher for other languages | Usable but latency too high | | WhisperRT (2025) | Improved (causal encoder fine-tuning) | Not widely deployed |

~3.3-second latency is unacceptable for real-time applications. WhisperRT improved latency performance through causal encoder fine-tuning, but as of April 2026 it has not entered mainstream deployment. OpenAI has pivoted to the GPT-4o-transcribe cloud API, effectively abandoning Whisper's streaming evolution.

6.1.5 SenseVoice: No Native Streaming

SenseVoice's non-autoregressive architecture does not support native streaming inference. The current workaround is to achieve simulated streaming through Sherpa-ONNX via VAD segmentation + chunked offline inference, which essentially cuts long audio into segments and processes each offline rather than performing incremental decoding. Latency depends on the VAD segmentation strategy and per-segment inference time.

6.1.6 Streaming Solutions Overview

| Solution | True Streaming | Latency | Concurrency | Code-Switch | Production Maturity | |------|--------|------|----------|-------------|------------| | FunASR 2-pass | Yes | 480–600ms | 16–100+ streams | Limited (offline variant supports) | Highest (Alibaba production-grade) | | Sherpa-ONNX | Yes | Medium | Hardware-dependent | Yes (bilingual models) | High (cross-platform verified) | | Qwen3-ASR | Pseudo-streaming | 92ms TTFT | 2000x real-time | Yes (52 languages) | Low (148 open issues) | | Whisper | No | >=3.3s | N/A | Weak | Not applicable | | SenseVoice | No (simulated) | Depends on segmentation | N/A | Yes (5 languages) | Medium |

6.2 Hardware Requirements Overview

The table below lists GPU VRAM requirements, CPU-only feasibility, and real-time factor (RTF = processing time / audio duration; lower is faster) for each mainstream pipeline.

| Pipeline | GPU VRAM | CPU-Only Feasibility | RTF (GPU) | RTF (CPU) | |------|----------|-----------------|-----------|-----------| | Paraformer-Large (FunASR, ONNX) | 1–2 GB | Excellent | 0.008 | 0.05–0.1 | | SenseVoice-Small (234M) | <1 GB | Yes (including Raspberry Pi) | <<0.01 | <0.1 | | FireRedASR-AED (1.1B) | 4–6 GB | Slow (usable but impractical) | ~0.02 | >0.3 | | Qwen3-ASR-1.7B | 6–8 GB | Not feasible | 0.015–0.13 | -- | | Whisper large-v3 (faster-whisper, FP16) | ~4.5 GB | Barely usable | 0.05–0.15 | 0.3–0.8 | | Fun-ASR-Nano (0.8B) | 2–4 GB | Barely usable | Near real-time | -- | | FireRedASR2-LLM (8.3B) | >=32 GB (A100/A800 recommended) | Not feasible | -- | -- | | Paraformer-v2 (220M, CTC) | -- | Excellent (i7 CPU 92s audio ~9s) | 0.009 | -- |

Key findings:

1–2 GB VRAM is sufficient for production-grade Chinese ASR: Paraformer-Large and SenseVoice-Small can run in real-time on consumer GPUs or even CPUs, suitable for resource-constrained deployment.
6–8 GB is the entry threshold for LLM-level accuracy: Qwen3-ASR-1.7B and FireRedASR-AED require mid-range GPUs (e.g., T4/RTX 3060), but accuracy is significantly better than lightweight models.
32 GB+ is needed for top-tier models: FireRedASR2-LLM (8.3B) and the 7.7B model behind the Fun-ASR API require A100/A800-class GPUs, not suitable for individual developers to self-host.

7. Cloud Service Market and Pricing

Chinese ASR cloud service pricing spans 40x -- from Alibaba Bailian's ¥0.288/hour to AWS Transcribe's ¥10.37/hour. When selecting a service, price, accuracy, and features are all indispensable. This chapter uses ¥7.2/$1 exchange rate for unified conversion, with data as of April 2026.

7.1 Batch/File Recognition Pricing

The table below covers 16 vendors, sorted by hourly cost in ascending order.

| Vendor | Product | ¥/hour | $/hour | Free Tier | |------|------|--------|--------|----------| | Alibaba Bailian | Fun-ASR / Paraformer-v2 | ¥0.288 | $0.04 | 90-day credits | | Volcengine | Standard Recording (500K package) | ¥0.40 | $0.06 | Trial credits | | Alibaba ISI | Off-peak (100K package) | ¥0.45 | $0.06 | 3-month trial | | Baidu | File Transcription (500K package) | ¥0.60 | $0.08 | 2 million free calls (~8,333 hours) | | Volcengine | Doubao Recording 2.0 (300K package) | ¥0.67 | $0.09 | Trial credits | | Tencent | Recording File (300K package) | ¥0.70 | $0.10 | 10 hours/month | | Volcengine | LLM Recording (off-peak) | ¥1.00 | $0.14 | Trial credits | | AssemblyAI | Universal-2 batch | ¥1.08 | $0.15 | $50 credits | | Tencent | Recording LLM Edition (300K package) | ¥1.20 | $0.17 | None | | OpenAI | GPT-4o Mini Transcribe | ¥1.30 | $0.18 | $5 credits | | Google Cloud | Dynamic Batch V2 | ¥1.30 | $0.18 | 60 min/month + $300 | | Deepgram | Nova-2 batch | ¥1.86 | $0.26 | $200 credits | | Baidu | File Transcription (pay-as-you-go) | ¥2.00 | $0.28 | 2 million free calls | | Huawei | Recording File | ¥2.50 | $0.35 | None | | Alibaba ISI | Recording File (pay-as-you-go) | ¥2.50 | $0.35 | 3-month trial | | OpenAI | Whisper-1 / GPT-4o Transcribe | ¥2.59 | $0.36 | $5 credits | | Azure | Batch STT | ¥2.59 | $0.36 | 5 hours/month | | iFlytek | Long-text Batch | ¥4.90–9.90 | $0.68–1.38 | 50 hours | | AWS Transcribe | Standard (Tier 1) | ¥10.37 | $1.44 | 60 min/month x 12 months |

Key takeaway: Baidu's 2 million free calls (equivalent to approximately 8,333 hours of short audio) is the best entry point for testing and small-scale usage. Alibaba Bailian leads in both batch and streaming with ¥0.288/hour pricing, 2-3x cheaper than most competitors.

7.2 Real-Time/Streaming Pricing

| Vendor | Product | ¥/hour | $/hour | |------|------|--------|--------| | Alibaba Bailian | Fun-ASR Realtime / Paraformer-v2 | ¥0.288 | $0.04 | | Volcengine | Doubao Streaming 2.0 (1K package) | ¥0.90 | $0.13 | | Tencent | Real-time Standard (1K package) | ¥1.80 | $0.25 | | Volcengine | Standard Streaming (1K package) | ¥1.80 | $0.25 | | iFlytek | Real-time (10K package) | ¥2.00 | $0.28 | | Baidu | Real-time ASR (pay-as-you-go) | ¥3.00 | $0.42 | | Huawei | Real-time ASR | ¥3.20 | $0.44 | | Tencent | Real-time LLM Edition (1K package) | ¥4.20 | $0.58 | | Volcengine | LLM Streaming ASR (pay-as-you-go) | ¥4.50 | $0.63 | | Azure | Real-time STT | ¥7.20 | $1.00 | | iFlytek | Real-time Transcription (20h package) | ¥9.90 | $1.38 |

Streaming pricing is generally higher than batch pricing, but Alibaba Bailian is the exception -- batch and streaming are priced identically at ¥0.288/hour. Volcengine Doubao Streaming 2.0 ranks second at ¥0.90/hour, and is the most cost-effective streaming API for Chinese-English code-switching scenarios requiring Seed-ASR's context-aware capabilities.

7.3 Feature Matrix

| Feature | Best Vendors | Notes | |------|----------|------| | Code-switching (Chinese-English mixing) | Volcengine Seed-ASR, Alibaba Fun-ASR, iFlytek, Zhipu GLM-ASR | Seed-ASR has context-aware recognition; v2.0 uses Function Call strategy for dedicated optimization | | Dialect coverage | Tencent (27+ dialects), iFlytek (20+), Volcengine (13+) | Huawei only 3 dialects, significant gap | | Speaker diarization | Alibaba, iFlytek, Tencent, Azure, Volcengine | Baidu's speaker diarization capability is limited | | Automatic punctuation | All vendors | Now a universal feature | | Word-level timestamps | All vendors | Now a universal feature | | Max file duration | Alibaba Bailian (12 hours), Huawei/Tencent (5 hours) | Alibaba leads by a wide margin | | Emotion detection | Alibaba Paraformer, Tencent (surcharge) | Niche feature | | Song/singing recognition | Alibaba Fun-ASR (cloud API) | FireRedASR2 and other open-source models also support this; Bailian API is the only provider among commercial APIs | | Meeting minutes + summarization | Volcengine (Doubao Voice Notes) | ASR + AI summarization bundled | | Simultaneous interpretation | Volcengine, Tencent (added in 2026) | Chinese-English bidirectional | | Private deployment | Volcengine, iFlytek | The only option for data-sensitive scenarios |

7.4 Data Residency and Privacy

Data residency is a hard constraint for Chinese enterprises when choosing ASR services. The compliance posture differs significantly across three vendor categories:

Mainland China local storage (compliant by default):

Alibaba Cloud / Baidu / Tencent / Volcengine / iFlytek / Huawei
All store data in mainland China by default, compliant with Chinese data protection regulations
Huawei explicitly states no retention of processed data
iFlytek holds MLPS Level 3 security certification

Cross-border compliance (mainland China region available):

Azure China (operated by 21Vianet) and AWS China (Beijing/Ningxia regions) provide complete mainland data residency
Suitable for scenarios requiring international vendor branding while data cannot leave the country

No China region (data must leave the country):

Google Cloud / OpenAI / Deepgram / AssemblyAI
These four are directly excluded when data must remain within the country
Unless using their self-hosted/open-source models (e.g., Whisper) for local deployment

7.5 Bailian "Price Killer" Strategy Analysis

Alibaba Bailian's ¥0.288/hour pricing stands out anomalously in the Chinese ASR market, and understanding the logic behind it is necessary:

Pricing strategy: Batch and streaming are priced identically at ¥0.288/hour, 30% lower than the runner-up (Volcengine Standard Recording at ¥0.40), and roughly half the price of Baidu (¥0.60) and Tencent (¥0.70). This is not normal market-competitive pricing but a classic move of strategically low pricing to drive platform adoption.

Sustainability risk: This pricing advantage may not be durable. As Bailian serves as the entry point for Alibaba Cloud's large model platform, ASR pricing functions as customer acquisition cost rather than an independent profitability metric. Once market share targets are reached or platform strategy shifts, there is room for price increases.

Accuracy match: Despite the lowest price, Bailian API's accuracy is not the worst -- the commercial Fun-ASR (7.7B LLM) achieves 7.60% average WER in real-world evaluations, significantly better than open-source Paraformer-Large. The combination of price and accuracy makes it the top choice for usage below 500 hours/month.

7.6 Bailian Fun-ASR ≠ Open-Source Paraformer: Same Brand Name, Unrelated Architecture

This is one of the most misleading findings in this research:

| Dimension | Commercial Fun-ASR (Bailian API) | Open-Source Paraformer-Large (FunASR Toolkit) | |------|-------------------------|---------------------------------------| | Parameters | 7.7B | 220M | | Architecture | LLM architecture (Qwen3 base) | Non-autoregressive (NAR, CIF-based) | | Training data | Tens of millions of hours | ~60,000 hours | | AISHELL-1 CER | 1.64% | 1.68% | | Real-world WER | 7.60% (internal industrial evaluation) | ~11% (estimated) | | Dialects | 7 major dialects + 26 regional accents | Limited | | Song recognition | Supported (opencpop CER 3.05%) | Not supported | | Hot-word customization | RAG-enhanced | Limited (WebSocket has silent-ignore bug) |

35x parameter gap, training data differing by orders of magnitude. They share a brand name but are architecturally completely unrelated. When someone says "use FunASR," it is essential to distinguish whether they mean the Paraformer family in the open-source toolkit or the commercial LLM model on the Bailian platform.

The Bailian platform currently offers three ASR tiers:

Fun-ASR (recommended): 7.7B LLM model, highest accuracy
Paraformer-v2: Lightweight fast alternative, CTC architecture with improved noise robustness over the original CIF
SenseVoice: Marked as "scheduled for deprecation," users need to migrate

7.7 Doubao-ASR = Seed-ASR: Product Confirmation and v2.0 New Features

The relationship between Volcengine's Doubao ASR product and ByteDance Research's Seed-ASR model has been definitively confirmed:

Product naming: The commercial API is named "Doubao-Seed-ASR-2.0"
Paper statement: The Seed-ASR paper explicitly states this model powers Doubao's commercial speech recognition
Internal identifier: The API resource ID contains "seedasr"

Key new features of Seed-ASR v2.0 (released December 2025):

PPO reinforcement learning: Optimizes contextual reasoning during inference through reinforcement learning
Multimodal visual disambiguation: Uses image information to assist homophone disambiguation (e.g., distinguishing homophones via screenshots)
Foreign language expansion: Added support for 13 new foreign languages
Context-aware recognition: Uses conversation history to improve recognition accuracy, which is the core reason behind V2EX user feedback rating Doubao as "subjectively the best"

The accuracy gap between API and paper warrants caution: Seed-ASR paper reports AISHELL-1 CER 0.68%, but the commercial API measures at 1.52% -- a meaningful API-vs-paper gap likely caused by latency/throughput optimizations in the production environment.

7.8 MiniMax ASR Absence

MiniMax ranks #1 globally in TTS (text-to-speech) -- simultaneously topping both the Artificial Analysis Speech Arena and the HuggingFace TTS Arena. But in ASR:

Official developer documentation (platform.minimaxi.com) does not expose any ASR endpoint; the API reference only lists TTS models (speech-2.8-hd, speech-2.8-turbo, etc.), voice cloning, and voice design
MiniMax's own GitHub simultaneous interpretation demo uses Whisper as the ASR component, not a self-developed model
Third-party investigation indicates that the "speech recognition API" advertised on its international site (minimax.io) is essentially a wrapper around Whisper Large

MiniMax's absence in the ASR domain stands in stark contrast to its absolute dominance in the TTS domain. Similarly, Moonshot/Kimi has no commercial ASR API, but its open-source Kimi-Audio-7B achieves 0.60% WER on AISHELL-1 (near SOTA). Among emerging AI companies, the only one formally offering an ASR product is Zhipu AI, whose GLM-ASR / GLM-ASR-2512 cloud API supports streaming, 8 Chinese dialects, and is compatible with the OpenAI SDK interface.

8. Self-Hosted vs. Cloud Economics

The choice between self-hosted and cloud is fundamentally an economics problem -- trading operational costs for compute cost discounts, and accuracy control for deployment convenience. This chapter provides quantified answers with three tables.

8.1 TCO Comparison Table

The table below compares total cost of ownership across five solutions at different monthly volumes. Engineering operations overhead for self-hosted solutions is estimated at $200–500/month (including DevOps, troubleshooting, model updates). FunASR's Docker deployment and mature toolchain push operational burden to the lower end of this range.

| Monthly Volume | Serverless GPU | Dedicated T4 ($219/month) | Bailian API (¥0.288/h) | Volcengine Doubao 2.0 (¥0.75/h) | iFlytek (¥2/h package) | |--------|----------------|-------------------|----------------------|---------------------------|-------------------| | 100 hours | ~$1 + ops | $219 (heavily idle) | ~¥29 ($4) | ~¥75 ($10) | ~¥200 ($28) | | 1,000 hours | ~$8 + ops | $219 | ~¥288 ($40) | ~¥750 ($104) | ~¥2,000 ($278) | | 10,000 hours | ~$77 + ops | $219 | ~¥2,880 ($400) | ~¥7,500 ($1,042) | ~¥20,000 ($2,778) |

Interpretation:

100 hours/month: Bailian API crushes all solutions at $4/month. While Serverless GPU compute cost is only $1, adding operations overhead far exceeds the API. The dedicated T4's $219/month is completely wasted.
1,000 hours/month: Bailian API at $40/month remains extremely attractive. Serverless GPU $8 + ops $200–500 starts to match or exceed API cost. The dedicated T4 at $219/month begins to show value (bare compute cost $0.22/hour).
10,000 hours/month: Self-hosted solutions dominate. Serverless GPU $77/month + ops $200–500, totaling $277–577/month, is 0.7–1.4x of Bailian but allows choosing higher-accuracy models. The dedicated T4 at $219/month fixed cost brings per-unit cost down to $0.02/hour, 1/20th of Bailian.

8.2 Local Batch Processing Performance Estimate

Using RTX 3090Ti (24GB VRAM) processing 71 hours of audio as the benchmark, estimated performance for five solutions:

| Solution | Estimated Processing Time | GPU VRAM Usage | |------|-------------|--------------| | Vanilla Whisper large-v3 | ~71 hours | 10 GB | | faster-whisper large-v3 (FP16) | ~14–18 hours | 5 GB | | faster-whisper large-v3-turbo | ~8–12 hours | 3–5 GB | | WhisperX batched (large-v3) | ~2.5–3.5 hours | <8 GB | | FunASR Paraformer-Large | ~6 hours | 1–2 GB |

Electricity cost estimate: 350W power draw x 3–18 hours @ $0.15/kWh = $0.20–$1.00.

Corresponding cloud API costs (processing the same 71 hours of audio):

| Cloud Service | Cost | |--------|------| | Alibaba Bailian | ~¥20 ($2.84) | | AssemblyAI | ~$10.65 | | Deepgram | ~$18–33 | | Google Cloud | ~$68 | | AWS Transcribe | ~$102 |

The electricity cost for 3090Ti local processing of 71 hours of audio ($0.20–$1.00) differs from the Alibaba Bailian API ($2.84) by only 2–14x, but compared to Deepgram ($18–33) or AWS ($102), the local solution is significantly cheaper.

Note: While WhisperX has the fastest processing speed (2.5–3.5 hours), Paraformer-Large has lower VRAM usage (1–2 GB vs <8 GB), allowing other tasks to run simultaneously on the same GPU. For pure Chinese content, Paraformer-Large's accuracy is also significantly better than Whisper (CER 1.68% vs 5.14%).

8.3 Cost Crossover Point Analysis

Combining data from 8.1 and 8.2, the cost crossover point between self-hosted and cloud presents a clear tiered structure:

First tier: <500 hours/month -- Cloud API wins

Bailian API cost <¥144 ($20)/month, below the operations overhead floor of any self-hosted solution ($200/month). In this range:

Using Bailian Fun-ASR API (¥0.288/hour) is the optimal solution
Zero operations, zero hardware investment, pay-as-you-go
And the commercial Fun-ASR (7.7B) has better accuracy than self-hosted Paraformer-Large (220M)

Second tier: 500–1,000 hours/month -- Transition zone

Bailian API cost ¥144–288 ($20–40)/month
Serverless GPU compute cost $4–8/month + ops $200–500/month
The decision in this range depends on operations team cost: If there is dedicated DevOps (operations cost is a sunk cost), self-hosted starts to pay off; if operations requires additional hires or consumes developer time, API still wins

Third tier: >1,000 hours/month -- Self-hosted dominates

Bailian API $40+/month and growing linearly
Dedicated T4 $219/month fixed cost, processing capacity far exceeds 1,000 hours
Self-hosted saves 5–18x compute cost
And allows choosing higher-accuracy open-source models (e.g., FireRedASR-AED 1.1B, CER approximately 30% lower than Paraformer)

Hybrid strategy recommendation:

For medium volume (500–2,000 hours/month), the optimal solution may be hybrid deployment:

Use Bailian API for streaming and low-latency needs daily (¥0.288/hour)
Process batch tasks with self-hosted FireRedASR-AED (1.1B, 4–6 GB VRAM), achieving approximately 30% CER improvement
Supplement Chinese-English code-switching scenarios with Volcengine Doubao 2.0 streaming API (¥0.90/hour), leveraging Seed-ASR's context-aware capabilities

This hybrid strategy covers the optimal solutions across accuracy, latency, and language switching dimensions while keeping costs under control.

9. Typical Failure Modes and System Limitations

ASR system selection cannot rely solely on accuracy leaderboards. Any model that performs well on benchmarks will expose its own unique failure modes once deployed in production. This chapter systematically catalogs Whisper's hallucination crisis and the known defects of four major Chinese ASR systems, providing a practical pitfall guide for production deployment.

9.1 Whisper's "Hallucination" Crisis

9.1.1 Problem Scale

The "Careless Whisper" study (ACM FAccT 2024) conducted the most systematic quantitative analysis of Whisper's hallucination problem to date. Core finding: approximately 1% of audio segments produce entirely fabricated text—not misrecognition of individual words, but generation of complete sentences entirely unrelated to the original audio. More alarmingly, among these hallucinated outputs, 38% contain harmful content, including violence references, racial commentary, and false medical terminology. As a control, the study tested Google, Amazon, AssemblyAI, and Rev.ai on the same audio—none produced comparable hallucinations.

9.1.2 Root Cause Analysis

The root cause of Whisper's hallucinations lies in its autoregressive decoder architecture. Research by Baranski et al. (ICASSP 2025) revealed two key mechanisms:

Erroneous transcription of non-speech audio: When the input is pure noise, silence, or ambient sound, Whisper large-v3 transcribes 55.2% of non-speech segments as "so", rather than correctly outputting empty text. This indicates the decoder, lacking valid acoustic signals, tends to generate high-frequency training tokens rather than remain silent.
YouTube training data artifacts: Whisper was trained on 680,000 hours of internet audio (largely from YouTube), leading to common hallucinations including "thanks for watching", "subtitles by the Amara.org community", and other subtitle template text.

9.1.3 Calm-Whisper: Precise Localization and Targeted Repair

Calm-Whisper (May 2025) research provides an elegant diagnostic and repair approach. The researchers found that among Whisper's decoder's 20 attention heads, only 3 of 20 decoder attention heads cause >75% of hallucinations. This means hallucinations are not diffused throughout the entire model, but highly concentrated in a few attention heads.

Based on this finding, targeted fine-tuning of these 3 heads achieves 80% hallucination reduction with <0.1% WER degradation.

9.1.4 Mitigation Strategy Checklist

Currently validated hallucination mitigation strategies, ranked by effectiveness:

| Strategy | Mechanism | Effectiveness | Implementation Cost | |----------|-----------|---------------|---------------------| | VAD Preprocessing (Silero VAD / pyannote) | Strips silence and non-speech segments from input — the single most effective fix | Highest | Very low, 3-line code integration | | WhisperX Default Config | Sets condition_on_prev_text=False, uses VAD segmentation, cuts off decoding chain cascading | High | Low, switch inference framework | | faster-whisper Parameters | hallucination_silence_threshold parameter + Silero VAD integration, automatically filters suspicious outputs | High | Low, parameter tuning | | Calm-Whisper Fine-tuning | Targeted fine-tuning of 3 hallucination attention heads | High (80% reduction) | Medium, requires fine-tuning pipeline | | NAR Model Replacement | FunASR Paraformer, NVIDIA Parakeet, and other non-autoregressive models, architecturally immune to this class of hallucination | Fundamental cure | Requires model switch |

Key conclusion: If the use case cannot tolerate any hallucination risk (e.g., medical transcription, legal records), a non-autoregressive (NAR) architecture model should be chosen directly. VAD preprocessing can significantly mitigate but cannot fundamentally cure the inherent problem of autoregressive decoders. For pipelines still using Whisper, never feed raw audio containing long stretches of silence directly into the model.

9.2 System-Specific Bugs of Four Major Systems

9.2.1 FireRedASR2: Engineering Flaws Beneath the Accuracy Ceiling

FireRedASR2 set a new open-source accuracy record on AISHELL-1 with 0.57% CER (AED variant), but there is a significant gap between its engineering maturity and academic accuracy.

60-Second Input Limit and Crash Issues

The AED variant has a hard constraint of 60-second maximum input length. This is not a soft limit where "performance degrades," but causes two severe consequences:

Beyond 60 seconds: The model begins hallucinating, outputting text unrelated to the audio
Beyond 200 seconds: Triggers a positional encoding overflow error, crashing outright

v2 introduced the FireRedVAD pipeline for automatic long audio segmentation, but this means users must depend on the accuracy of its VAD module to avoid crashes—in scenarios where VAD segmentation is inaccurate (e.g., continuous long conversations, speech mixed with background music), there remains risk of hitting the limit.

Hardware Resource Requirements

| Variant | GPU VRAM | Disk Space | Recommended Hardware | |---------|----------|------------|----------------------| | FireRedASR2-AED (1B+) | 4-6 GB | Moderate | Consumer GPU | | FireRedASR2-LLM (8B+) | >=32 GB | ~21 GB | A100 / A800 |

The LLM variant's 32GB VRAM threshold excludes most consumer GPUs. Even the RTX 4090 (24GB) cannot run it.

v1 User Reviews and v2 Improvements

A representative early tester review on Zhihu for v1: "Good for learning, not for production." Core issues of v1 included no punctuation output, no VAD integration, and difficulty with long audio processing. v2 significantly improved productionization capability through integration of FireRedVAD + FireRedLID + FireRedPunc (78.90% F1 vs FunASR-Punc's 62.77%), but the 60-second/200-second hard limits remain.

9.2.2 Qwen3-ASR-1.7B: Stability Risks Under Multilingual Coverage

Qwen3-ASR-1.7B, with its coverage of 52 languages + 22 Chinese dialects and Apache 2.0 license, is the most attractive open-source multilingual solution, but its engineering stability issues cannot be ignored.

Endless Token Repetition (GitHub Issue #129)

This is Qwen3-ASR's most severe known bug: the model enters endless token repetition on certain audio inputs, repeatedly generating the same phrase or token thousands of times until reaching maximum sequence length. This is a classic failure mode of autoregressive decoders—the decoder gets trapped in a local probability peak and cannot escape.

This issue has been independently reported by multiple users on GitHub, and as of April 2026, there is no official fix.

Pseudo-Streaming Architecture Boundary Issues

Qwen3-ASR's streaming mode is not true streaming, but pseudo-streaming: each 2-second audio chunk is processed independently from scratch, with no cross-segment context passing. This causes two practical issues:

Boundary duplicate transcription: Overlapping regions of adjacent audio chunks may produce duplicate text
Format leakage: Internal control tokens or format markers occasionally appear in final output

Open-Source Community Health

As of April 2026, Qwen3-ASR has approximately 148 unclosed Issues on GitHub. This figure reflects two facts: the user base is large (indicating the model is attractive), and the bug fix rate cannot keep pace with the reporting rate.

Transformers Version Compatibility

Using newer versions of the Transformers library (v5.3 / v5.4) causes accuracy degradation. This is an easily overlooked engineering pitfall—developers typically use the latest dependency versions, but Qwen3-ASR requires version pinning.

AISHELL-1 CER Not Reported

Qwen3-ASR's official paper deliberately did not report AISHELL-1 CER—the most standard and widely used benchmark in Chinese ASR. This absence is a conspicuous omission for the most standard Chinese ASR benchmark, potentially meaning the model's performance on this benchmark is inferior to competitors, or there are other undisclosed reasons. Third-party tests show its AISHELL-1 CER at 1.48%, which is excellent but indeed trails FireRedASR2-AED's 0.57%.

9.2.3 Alibaba Bailian / FunASR: Contrast Between Ecosystem Maturity and Code Quality

Alibaba's ASR ecosystem is the most mature among the four systems—Docker deployment, 2-pass streaming, hot-word support, WebSocket server, all fully featured. But open-source code quality issues and architectural differences from the commercial version create multiple points of confusion in actual use.

FunASR Package Import Performance Issue

FunASR Python package's first import takes over 1 minute, caused by an excessively long dependency loading chain. For use cases requiring frequent inference process startups (e.g., serverless functions, CI/CD pipelines), this cold start time is unacceptable.

Fun-ASR LLM Model Hallucination Risk

The commercial Fun-ASR API runs a 7.7B parameter LLM architecture model (based on Qwen3), possessing LLM advantages (context understanding, formatting) but also inheriting LLM's inherent risk—generating text that does not exist in the audio. This contrasts sharply with the open-source Paraformer's non-autoregressive architecture: Paraformer is naturally immune to hallucinations, but its accuracy has fallen significantly behind SOTA.

Hot-Word Parameters Silently Ignored

In FunASR's WebSocket server, hot-word parameters passed via API are silently ignored, producing no error messages. This is one of the most dangerous bug types—developers believe the hot-word feature is working when it has completely no effect, only exposed when production recognition accuracy falls short of expectations.

ONNX Punctuation Model Loading Failure

The CT-Transformer punctuation restoration model in the open-source FunASR toolkit has loading failure issues in ONNX format. Additionally, the ITN (Inverse Text Normalization) module has known bugs, such as incorrectly converting "shi yi dian" (eleven o'clock) to "10 yi dian" (10 one o'clock).

Timestamp Array Mismatch

Although the Alibaba Bailian API supports word-level and sentence-level timestamps, multiple alignment bugs have been reported on GitHub:

Character count and timestamp count mismatch
End-time cutoff

These bugs are especially impactful for applications requiring precise timestamps (e.g., subtitle generation, audio-video synchronized editing).

SenseVoice Approaching Deprecation

SenseVoice on the Alibaba Bailian platform has been marked as "approaching deprecation," requiring users to migrate to Fun-ASR or Paraformer-v2. This imposes migration costs on teams that have already built pipelines based on SenseVoice.

9.2.4 Doubao-ASR: Accuracy Gap Within the Closed-Source Walls

Doubao-ASR (ByteDance / Volcengine) received the highest subjective user experience ratings, but its closed-source nature and access barriers limit its applicable scope.

Closed-Source and Access Barriers

Doubao-ASR is the only system among the four that is completely closed-source, API-only, with no self-deployment option. Registering a Volcengine account requires:

A mainland China phone number
Real-name authentication

These two requirements constitute rigid barriers for international developers, and also mean data must be sent to ByteDance's servers for processing, unable to meet data residency compliance requirements.

Significant Gap Between API Accuracy and Paper Accuracy

This is Doubao-ASR's most noteworthy issue. There is a meaningful API-vs-paper gap between data reported in the Seed-ASR paper (July 2024) and commercial API data measured in February 2026:

| Benchmark | Seed-ASR Paper | Doubao-ASR API Measured | Gap | |-----------|---------------|------------------------|-----| | AISHELL-1 CER | 0.68% | 1.52% | — | | LibriSpeech-other WER | 2.84% | 5.69% | — |

This gap reminds us: paper data does not equal production accuracy; selection must be based on API measurement.

Worst Dialect Performance Among Four Systems

Despite ByteDance possessing massive voice data (from Douyin, Feishu, and other products), Doubao-ASR's dialect recognition performance is surprisingly poor:

Average dialect CER: 15.39%
Comparison: FireRedASR2-LLM 11.55%, Qwen3-ASR 11.85%, Fun-ASR 12.76%

15.39% dialect CER is 33% higher than the best system, reaching 45% CER in Shanghainese conversational scenarios. This weakness is a key exclusion factor in use cases requiring dialect user coverage.

9.3 Failure Mode Quick-Reference Matrix

| System | Most Fatal Bug | Trigger Condition | Impact | Workaround | |--------|---------------|-------------------|--------|------------| | Whisper | Hallucination: fabricates complete sentences | Silence/non-speech input | ~1% of segments, 38% harmful | VAD preprocessing / switch to NAR model | | FireRedASR2 | Overtime crash | Input > 200 seconds | Process termination | FireRedVAD auto-segmentation / control input length | | Qwen3-ASR | Endless token repetition | Certain audio patterns | Output completely unusable | Set max_length / output length detection | | FunASR | Hot-word silently fails | WebSocket server | Feature not effective but no warning | HTTP API alternative / log verification | | Doubao-ASR | API vs paper accuracy gap | All inputs | CER doubles | Use API measured data as ground truth |

10. Extended Application -- Vector Retrieval for Bilingual Transcription Knowledge Bases

Transforming ASR-transcribed text into a searchable knowledge base is a key step in realizing the value of speech data. This chapter focuses on embedding model selection, retrieval pitfalls for Chinese-English mixed text, and best practice configurations for a 200-hour transcription scale.

10.1 Embedding Model Evaluation Reversal

In 2025-2026, the embedding model field experienced a profound landscape upheaval: open-source models comprehensively surpassed commercial APIs, and the former leader OpenAI fell to a mid-tier position.

MTEB / MMTEB Leaderboard Current State

| Model | Type | MTEB/MMTEB Score | Dimensions | Max Tokens | Price (/MTok) | Chinese-English Capability | |-------|------|-----------------|------------|------------|--------------|---------------------------| | Qwen3-Embedding-8B | Open-source | 70.58 (MMTEB) | 32-7,168 | 32K | Free | Excellent | | Gemini Embedding 001/2 | Cloud API | 68.32 | 3,072 | 2K-32K | $0.15 | Excellent (R@1 0.997) | | Voyage voyage-3-large | Cloud API | 66.80 | 2,048 | 32K | $0.18 | Good | | Cohere embed-v4 | Cloud API | ~65.2 | 1,024 | 128K | $0.12 | Strong cross-lingual | | Jina v3 | Partially open-source | 65.52 | 1,024 | 8K | $0.018 | Good | | OpenAI text-embedding-3-large | Cloud API | 64.6 | 3,072 | 8K | $0.13 | Medium | | Qwen3-Embedding-0.6B | Open-source | 64.33 (MMTEB) | 32-1,024 | 32K | Free | Strong | | BGE-M3 | Open-source | ~63.0 | 1,024 | 8K | Free | Excellent (hybrid retrieval) |

Key findings:

Qwen3-Embedding-8B (Alibaba) reached the top of MMTEB multilingual leaderboard at #1 with 70.58, while achieving 75.22 on MTEB English v2
OpenAI text-embedding-3-large dropped to approximately #9 at 64.6—18 months ago it was the undisputed industry standard
Gemini Embedding 001/2 ranked first among commercial APIs at 68.32, with near-perfect cross-lingual retrieval capability (R@1 0.997)
BGE-M3, while not the highest in absolute scores, has a unique advantage in technical text containing proper nouns and precise phrases thanks to its unique hybrid dense + sparse + multi-vector retrieval capability

The underlying logic of this reversal is: vendors such as Alibaba (Qwen3) invested massive amounts of Chinese-English training data and instruction fine-tuning, while OpenAI's embedding model has seen virtually no substantive iteration since its early 2024 release.

10.2 Retrieval Pitfalls for Chinese-English Mixed Text

Bilingual transcription knowledge bases face a fundamental challenge: ASR-transcribed text naturally contains Chinese-English mixed content (code-switching), e.g., "wo men yong transformer zuo de zhe ge API you san ge endpoint" (we used transformer to build this API with three endpoints). If the embedding model cannot understand this mixed text, retrieval quality degrades severely.

CCKM Benchmark: Binary Split

The CCKM benchmark (166 Chinese-English parallel pairs, including hard negatives) reveals a binary split in the embedding model world—models either can handle Chinese-English mixing, or completely cannot, with almost no middle ground:

Top-tier multilingual models (cross-lingual R@1 > 0.93):

| Model | Cross-lingual R@1 | |-------|-------------------| | Gemini Embedding 2 | 0.997 | | Qwen3-Embedding-8B | 0.988 | | BGE-M3 | 0.940 |

Completely failing English-only models (cross-lingual R@1 near zero):

| Model | Cross-lingual R@1 | |-------|-------------------| | Nomic Embed | 0.12 | | Stella (English-only) | Near zero | | mxbai-embed-large | 0.16 |

These English-only models cannot process Chinese at all and must be excluded at the selection stage.

Actual Business Impact

Measured data from a Chinese e-commerce platform provides an intuitive reference: using Qwen3 to replace OpenAI embeddings for processing Chinese queries improved search relevance by 34%. This is not a marginal improvement from fine-tuning, but a fundamental difference in the model's depth of understanding of the Chinese semantic space.

Code-Switching Processing Principles

For transcription text containing Chinese-English mixing, the correct approach is to embed as-is, without language separation:

Correct: Embed "wo men yong transformer zuo de zhe ge API you san ge endpoint" as a single complete chunk
Incorrect: Split the Chinese and English parts and embed them separately

Modern multilingual embedding models (Qwen3, Gemini, BGE-M3) have already processed large amounts of code-switching data during training and can encode mixed-language content in a unified vector space. Artificial separation instead destroys semantic integrity.

10.3 Best Practice Configuration (200 Hours of Transcription)

Chunking Strategy

Recommended approach: Semantic/topic-based chunking, target 256-512 tokens per chunk, 30-50 token overlap.

The core basis comes from a controlled study on clinical decision support:

| Chunking Method | Retrieval Accuracy | |----------------|-------------------| | Topic-aligned chunking (semantic/topic-aligned) | 87% | | Fixed-size chunking | 13% |

The 87% vs 13% gap shows that chunking strategy has far greater impact on final retrieval quality than the choice of embedding model. Video transcriptions have natural topic transition nodes; these nodes should be leveraged for adaptive chunking:

Each chunk contains 3-8 semantically coherent sentences
30-50 token overlap ensures cross-chunk contextual continuity
Each chunk augmented with metadata: video title, timestamp range, speaker ID, AI-generated micro-header summary

Dimensions: Full Resolution, No Truncation

At the 200-hour transcription scale (approximately 20K-40K chunks), storage overhead is entirely negligible:

| Dimension Config | Storage Requirement (30K chunks) | |-----------------|--------------------------------| | 1,024 dimensions (BGE-M3, Cohere) | ~120 MB | | 3,072 dimensions (OpenAI, Gemini) | ~360 MB | | 7,168 dimensions (Qwen3 maximum) | ~840 MB |

Even with the highest dimension configuration, storage is under 1 GB. Matryoshka dimension compression is meaningless at this scale—saving a few dozen MB of storage is not worth sacrificing any retrieval precision. The conclusion is clear: use full dimensions, no truncation.

Cost Estimation

200 hours of dense transcription text amounts to approximately 5 million tokens. The complete embedding cost table:

| Provider | Model | Total Cost | |----------|-------|------------| | OpenAI | text-embedding-3-small | $0.10 | | Voyage AI | voyage-3.5 | $0.30 | | Cohere | embed-v4 | $0.60 | | OpenAI | text-embedding-3-large | $0.65 | | Google | Gemini Embedding | $0.75-$1.00 | | Self-hosted | Qwen3-Embedding-8B (A100, ~1h) | $1-$3 | | Self-hosted | BGE-M3 (CPU, ~2h) | $0 |

Cost is a non-factor. Even choosing the most expensive option, a single full embedding of the entire corpus costs no more than $3. Even if re-embedding is needed due to model upgrades or parameter adjustments, the cost is negligible. Selection should be based entirely on retrieval quality.

Two-Stage Retrieve-then-Rerank Architecture

For high-value bilingual transcription knowledge bases, a two-stage retrieval architecture is recommended:

Query --> Embedding Model (recall) --> Top-K candidates --> Reranker (precision) --> Final results

Stage one (recall): Use Qwen3-Embedding-8B or Gemini Embedding to retrieve Top-20/50 candidate documents, optimizing recall
Stage two (precision): Use Qwen3-Reranker-8B to re-rank candidate documents, optimizing precision

The advantage of the two-stage architecture is: the embedding model is responsible for rapidly narrowing the range across massive documents (high recall), the reranker is responsible for fine-ranking within a small range (high precision).

10.4 Model Selection Decision Tree

Based on deployment constraints, four paths each have a clear optimal choice:

Path One: Cloud Simple Solution

Recommended: Gemini Embedding 2

Cross-lingual R@1: 0.997 (near perfect)
Context window: 32K tokens
Dimensions: 3,072
Total cost: < $1
Rationale: Strongest cross-lingual retrieval capability, requires zero infrastructure investment, 32K context window sufficient for long chunks. The only limitation is dependency on the Google ecosystem.

Path Two: Self-Hosted GPU Best Solution

Recommended: Qwen3-Embedding-8B + Qwen3-Reranker-8B

MMTEB multilingual ranking: #1 (70.58)
License: Apache 2.0
Context window: 32K tokens
Dimensions: 32-7,168 (Matryoshka adjustable)
Hardware requirement: ~16 GB VRAM (4-6 GB after quantization)
Rationale: Highest absolute accuracy, Chinese and English are both Alibaba's core optimization directions, paired with the same-family Reranker for the optimal two-stage retrieval pipeline. Suitable for scenarios with GPU resources pursuing ultimate retrieval quality.

Path Three: Self-Hosted CPU Lightweight Solution

Recommended: BGE-M3

Parameter count: 568M
License: MIT
Deployment: CPU-runnable
Cross-lingual R@1: 0.940
Unique advantage: Hybrid dense + sparse + multi-vector retrieval
Rationale: The only solution in GPU-free environments that combines good Chinese-English retrieval capability with practical deployment cost. Its hybrid retrieval capability is especially valuable for technical terms, proper nouns, and precise phrases in transcription text—sparse retrieval compensates for dense vectors' weakness in low-frequency exact matching.

Path Four: Zero-Chunking Minimal Solution

Recommended: Cohere embed-v4

Context window: 128K tokens
Dimensions: 1,024
Supports Matryoshka + quantization
Cross-lingual capability: Strong
Rationale: The 128K context window can embed an entire video transcription as a single document, completely skipping the chunking step. Suitable for rapid prototyping or scenarios with no engineering investment in chunking strategy. The trade-off is coarser retrieval granularity—cannot locate specific paragraphs.

11. Scenario-Based Selection and Implementation Recommendations

This chapter converges all technical analysis into concrete implementation plans for five typical use cases, each providing a recommended pipeline, key parameters, and cost estimates.

Scenario One: Pure Chinese, Ultra-High Accuracy Offline Batch Processing

Typical requirement: Batch transcription of large volumes of Chinese audio/video material, with extreme CER requirements, no real-time requirement, acceptable longer processing time.

Self-hosted first choice -- FireRedASR2-AED + FireRedVAD + FireRedPunc

This is the current open-source accuracy ceiling solution. FireRedASR2-AED achieves 0.57% CER on AISHELL-1, four-benchmark average 3.05%, achieving 67% / 33% CER reduction compared to Paraformer-Large's 1.68% / 4.56%. v2 integrates VAD + language identification + punctuation restoration, an all-in-one pipeline requiring no additional components. Hardware requirement: 4-6GB VRAM (AED variant), 3090Ti more than sufficient. Note the 60-second input length limit, requiring FireRedVAD for fine-grained segmentation.

Cloud first choice -- Alibaba Bailian Fun-ASR API

CNY 0.288/hour, commercial Fun-ASR (7.7B LLM) achieves 7.60% average WER in real-world evaluation, significantly outperforming open-source Paraformer-Large. Zero operations, zero hardware, the optimal solution for <500 hours/month usage. Supports 12-hour long files, hot-word customization, and song recognition.

Resource-constrained alternative -- SenseVoice-Small

234M parameters, <1GB VRAM, can run on Raspberry Pi. AISHELL-1 CER 2.96%—still superior to Whisper large-v3's 5.14%, with model size only 1/6 of the latter. Built-in punctuation, emotion, and audio event detection. Note: Bailian platform has marked it as "approaching deprecation."

Maximum accuracy regardless of cost -- FireRedASR2-LLM (8B+)

Four-benchmark average 2.89%, current absolute SOTA. Requires >=32GB VRAM (A100/A800). Suitable for research institutions or commercial scenarios with extreme accuracy requirements.

Scenario Two: Chinese-English Mixed, Real-Time Streaming Recognition

Typical requirement: Simultaneous interpretation, real-time meeting minutes, voice input, requiring sub-second latency, content includes Chinese-English mixing.

Open-source frontier -- Qwen3-ASR-1.7B + Silero VAD

52-language native streaming, 2-second chunking, TTFT 92ms. Excellent Chinese-English code-switching performance (40K+ English keyword synthetic training data). 6-8GB VRAM. Note pseudo-streaming issues (boundary repetition) and infinite repetition bug (#129); thorough testing required before production deployment.

Production-grade stability -- FunASR 2-pass + FSMN-VAD + CT-Punc

Alibaba production-validated, 480-600ms acoustic latency, 4vCPU/8GB supports 16-way concurrency. Most mature ecosystem: Docker deployment, WebSocket server, hot-word customization. Code-switching support is limited to the offline variant; pure streaming Chinese-English capability is inferior to Qwen3-ASR.

Cloud first choice -- Volcengine Doubao 2.0 Streaming

Seed-ASR context-aware technology, CNY 0.90/hour. Best subjective experience among four systems (V2EX user long-term feedback). v2.0 PPO reinforcement learning + multimodal disambiguation. Note: requires mainland China phone number for registration.

Edge/Embedded -- Sherpa-ONNX + streaming-zipformer-bilingual-zh-en

CPU-only, 12 language bindings, full platform support for Android/iOS/RPi. Accuracy lower than the above solutions but deployment flexibility is irreplaceable.

Scenario Three: Low-Cost Bilingual Podcast Subtitle Generation (CLI Batch Processing)

Typical requirement: Individual developer or small team, batch processing Chinese-English bilingual podcasts to generate SRT subtitles, local 3090Ti deployment, extreme cost efficiency.

Dual-engine architecture:

Chinese pipeline: FunASR Paraformer-Large (paraformer-zh + fsmn-vad + ct-punc), single command handles VAD + transcription + punctuation + timestamps. 1-2GB VRAM, 12x real-time, 71 hours of audio takes approximately 6 hours to process
English pipeline: WhisperX (faster-whisper large-v3-turbo + Silero VAD + wav2vec2 alignment), batch inference 60-70x real-time, word-level timestamps far superior to native Whisper. 3-5GB VRAM, 71 hours of audio takes approximately 2-4 hours to process

Cost: 350W x 3-18 hours @ $0.15/kWh = $0.20-$1.00 in electricity. Equivalent cloud API: AssemblyAI $10.65, Deepgram $18-33, AWS $102.

Scenario Four: Pure English "Out-of-the-Box" Production-Grade Solution

Typical requirement: Overseas business, English content platform, need a single API call to solve the entire transcription workflow.

Accuracy king -- ElevenLabs Scribe v2

2.3% WER, $0.40/hour. 32-speaker diarization, word-level timestamps, audio event annotation (laughter/applause). Suitable for scenarios with extreme accuracy requirements.

All-in-one -- AssemblyAI Universal-3 Pro or Gladia Solaria-1

AssemblyAI 3.3% WER / $0.21/hour, Speech Understanding API includes transcription + summarization + topic + sentiment, but add-on stacking can reach 2-5x base price. Gladia 4.2% WER / $0.50/hour flat rate (no add-ons), includes semantic chaptering + NER + translation + 99-language code-switching—the most feature-complete single API.

Ultimate price-performance -- Deepgram Nova-3

5.4% WER / $0.26/hour, 130-473x real-time. Word timestamps + diarization + punctuation + PII redaction all included, no add-ons. The winner on both speed and cost.

Scenario Five: Ultra-Large Volume Minimal Cost Reduction

Typical requirement: Massive audio processing, most cost-sensitive, acceptable non-SOTA accuracy.

On-device CPU -- SenseVoice-Small: 234M parameters, direct CPU processing, eliminates pipeline dependencies. Built-in punctuation + emotion + events, AISHELL-1 CER 2.96%
CPU-only deployment -- Paraformer-v2 (220M): RTF 0.05-0.1 (CPU), i7 processes 92-second audio in about 9 seconds. 1GB VRAM
Maximum throughput -- Qwen3-ASR-0.6B: 2000x real-time (concurrency 128), suitable for large-scale concurrent processing

Embedding Selection Quick Reference

| Requirement | Recommended Solution | Rationale | |-------------|---------------------|-----------| | Bilingual embedding (self-hosted GPU) | Qwen3-Embedding-8B + Qwen3-Reranker-8B | MMTEB #1, Apache 2.0, Chinese-English R@1 0.988 | | Bilingual embedding (cloud) | Gemini Embedding 2 | Cross-lingual R@1 0.997, 32K context, <$1 total cost | | Bilingual embedding (CPU self-hosted) | BGE-M3 | 568M CPU-runnable, MIT, hybrid retrieval advantageous for proper nouns | | Zero-chunking long text | Cohere embed-v4 | 128K context, skips chunking step |

12. Conclusion

Five Core Insights

1. Whisper's dominance depends on language. In English scenarios, Whisper large-v3-turbo (4.8% WER) remains a reasonable local solution, with an unmatched ecosystem. But in Chinese scenarios, Whisper's CER is 2-4x that of purpose-built models (AISHELL-1 5.14% vs FireRedASR2-AED 0.57%), and the 18.87% on WenetSpeech Meeting versus FireRedASR2-LLM's 4.32% represents a generational gap. Chinese ASR should no longer use Whisper as the baseline.

2. Hallucination is not a footnote, but an architectural defect. The ACM FAccT 2024 study confirmed that ~1% of audio segments are transcribed by Whisper into entirely fabricated text, with 38% containing harmful content. This is not an isolated bug, but an inherent problem of autoregressive decoders—55.2% of non-speech segments are transcribed as "so" by Whisper large-v3. Non-autoregressive models (Paraformer, Parakeet) are architecturally immune to such hallucinations. Scenarios with rigid transcription accuracy requirements (medical, legal, subtitle publishing) should prioritize NAR architecture.

3. Local inference economics crushes cloud for batch tasks. A 3090Ti processing 71 hours of podcast costs $0.20-$1.00 in electricity, while equivalent AWS Transcribe costs $102—a gap exceeding 100x. Even compared to the cheapest cloud API (Bailian CNY 0.288/h = $2.84/71h), the local solution still has a 3-14x cost advantage. When monthly usage exceeds 1000 hours, self-hosting saves 5-18x in compute costs.

4. The "API vs paper" gap is a hidden risk in selection. The Seed-ASR paper reports AISHELL-1 CER 0.68%, while the commercial API measures 1.52%. The Bailian Fun-ASR API (7.7B) and open-source Paraformer (220M) share a brand name but are architecturally unrelated (35x parameter gap). Paper numbers cannot be equated with production performance.

5. Open-source embeddings have surpassed commercial APIs. Qwen3-Embedding-8B topped the MMTEB multilingual leaderboard at 70.58, while OpenAI text-embedding-3-large dropped to ~#9 at 64.6. The full embedding cost for 200 hours of transcription is <$1, making model quality the sole decision factor. English-only models (Nomic 0.12, mxbai 0.16) cannot process Chinese at all and must be excluded at the selection stage.

Action Recommendations

<500 hours/month: Use Alibaba Bailian Fun-ASR API (CNY 0.288/h), zero operations, use-and-go, accuracy superior to self-hosted Paraformer
>1000 hours/month: Self-host FireRedASR-AED (1.1B, 4-6GB VRAM), CER reduced ~67% compared to Paraformer
Chinese-English mixed streaming: Supplement with Volcengine Doubao 2.0 Streaming (CNY 0.90/h) leveraging Seed-ASR context-aware capability
English batch local: WhisperX (faster-whisper turbo + wav2vec2 alignment), most mature ecosystem, 3090Ti processes 71h in only 2.5-3.5h
Bilingual knowledge base: Qwen3-Embedding-8B self-hosted or Gemini Embedding 2 cloud, semantic chunking 256-512 tokens, two-stage retrieve-then-rerank

Appendices

A. SRT Generation CLI Tools

| Tool | Tech Stack | Core Features | |------|------------|---------------| | whisply | faster-whisper + WhisperX | Cross-platform CLI, --export srt, subtitle length control | | VideoCaptioner | Whisper + LLM sentence splitting | Closest open-source quality to CapCut, LLM-driven intelligent sentence splitting | | FunClip (Alibaba) | Paraformer | Automatic SRT generation + hot-word support | | AsrTools (2.6k stars) | CapCut/Bilibili/Kuaishou reverse-engineered API | No GPU needed, cloud CapCut-level quality, legal gray area | | Subtitle Edit | 7 Whisper engines | GUI/CLI, recommended faster-whisper-XXL + large-v3-turbo |

B. Complete Benchmark Data Table (Source: FireRedASR2S arXiv:2603.10420)

| Model | Params | AISHELL-1 | AISHELL-2 | WenetSpeech Net | WenetSpeech Meeting | Four-Benchmark Average | Dialect 19-Group Average | Song (opencpop) | |-------|--------|-----------|-----------|-----------------|---------------------|----------------------|------------------------|-----------------| | FireRedASR2-LLM | 8B+ | 0.64 | 2.15 | 4.44 | 4.32 | 2.89 | 11.55 | 1.12 | | FireRedASR2-AED | 1B+ | 0.57 | 2.51 | 4.57 | 4.53 | 3.05 | 11.67 | 1.17 | | Doubao-ASR API | Closed-source | 1.52 | 2.77 | 5.73 | 4.74 | 3.69 | 15.39 | 4.36 | | Qwen3-ASR-1.7B | 1.7B | 1.48 | 2.71 | 4.97 | 5.88 | 3.76 | 11.85 | 2.57 | | Fun-ASR API | 7.7B | 1.64 | 2.38 | 6.85 | 5.78 | 4.16 | 12.76 | 3.05 | | Fun-ASR-Nano | 0.8B | 1.96 | 3.02 | 6.93 | 6.29 | 4.55 | 15.07 | 2.95 | | Paraformer-Large | 0.22B | 1.68 | 2.85 | 6.74 | 6.97 | ~4.56 | — | — | | SenseVoice-Small | 0.23B | 2.96 | 3.80 | 7.84 | 7.44 | — | — | — | | SenseVoice-Large | 1.6B | 2.09 | 3.04 | 6.01 | 6.73 | — | — | — | | Whisper large-v3 | 1.55B | 5.14 | 4.96 | 10.48 | 18.87 | ~9.86 | — | — |

C. References

FireRedASR2S Technical Report. arXiv:2603.10420, March 2026.
Seed-ASR: Understanding Diverse Speech and Contexts. arXiv:2407.04675, July 2024.
Fun-ASR Technical Report. Alibaba Tongyi Lab, September 2025.
Koenecke, A. et al. "Careless Whisper: Speech-to-Text Hallucination Harms." ACM FAccT 2024.
Calm-Whisper: Targeted Decoder Head Fine-tuning for Hallucination Reduction. May 2025.
Baranski, K. et al. "Hallucination Patterns in Whisper." ICASSP 2025.
Artificial Analysis. AA-WER v2.0 Benchmark (41 models, 3 datasets).
MTEB / MMTEB Embedding Leaderboard. HuggingFace, 2026.
CCKM Cross-lingual Embedding Benchmark (166 Chinese-English parallel pairs).
VoiceWriter Independent ASR Comparison (clean/noisy/accented/specialist audio).

Table of Contents

1. Report Summary

2. ASR Industry Landscape and Technology Evolution Trends

2.1 Architectural Paradigm Shift: From Component Stacking to End-to-End Large Models

2.2 The Cambrian Explosion of Chinese ASR

2.3 Whisper's Stagnation and Shrinking Footprint

2.4 Practical Edge ASR

2.5 Unified Models Replacing Component Stacking

3. In-Depth Comparison of Core Chinese ASR Systems

3.1 Benchmarks vs Real-World Performance Gap

The Chasm Between Clean Speech and Real Scenarios

The API-vs-Paper Gap

Dialect Recognition: The Biggest Current Technical Pain Point

The Disconnect Between Benchmark Rankings and User Perception

Data Leakage Risk

3.2 Detailed Analysis of the Four Top Chinese Systems

FireRedASR2 (Xiaohongshu)

Doubao-ASR / Seed-ASR (ByteDance / Volcengine)

Qwen3-ASR (Alibaba Tongyi)

FunASR / Bailian API (Alibaba)

3.3 Other Chinese ASR Solutions

SenseVoice-Small / Large

Fun-ASR-Nano (0.8B)

Moonshine-tiny-zh (27M)

WeNet U2++

MiniMax / Moonshot / Zhipu ASR Status

3.4 The Technical Foundation of CapCut / Jianying

Volcengine Speech Team RNN-T + Conformer Architecture

Relationship with Seed-ASR

AsrTools Open-Source Reverse Engineering

4. English and Chinese-English Code-Switching ASR Evaluation

4.1 English Commercial API Tiers (AA-WER v2.0 Benchmark)

Tier 1: Extreme Accuracy

Tier 2: All-in-one Productivity

Tier 3: Extreme Cost-effectiveness and Speed

Tier 4: LLM Transcription New Forces

Pitfall Guide

English Commercial API Overview Table

4.2 Seed-ASR English Performance

4.3 English Open-Source Solutions

NVIDIA Canary-Qwen-2.5B

NVIDIA Parakeet-TDT-0.6B

Whisper large-v3-turbo

English Open-Source Solution Comparison

4.4 Chinese-English Code-Switching Special Topic

Why Traditional Models Struggle with Chinese-English Code-Switching

Code-Switching Strategies Across Systems

4.5 "Full-featured One-stop" API Comparison

Feature Completeness Matrix

4.6 Community Consensus and Benchmark Limitations

Systematic Distortion in Benchmarks

VoiceWriter Independent Testing

Narrowing Accuracy Competition and True Differentiation Dimensions

LLM-based ASR Trend

5. ASR Pipeline Component Selection

5.1 VAD -- Voice Activity Detection

FSMN-VAD

Silero VAD v5

pyannote VAD

WebRTC VAD

TEN VAD

VAD Selection Quick Reference

5.2 Punctuation Restoration and ITN

CT-Punc (CT-Transformer)

FireRedPunc

LLM Punctuation Restoration

Native Punctuation Output Trend

ITN Pitfalls

5.3 Speaker Diarization

pyannote community-1

3D-Speaker + CAM++

FunASR CAM++ Pipeline

NeMo Sortformer v2

DiariZen

Speaker Diarization Selection Quick Reference

5.4 Timestamps and Word-level Alignment

Timestamp Support by Model

WhisperX: The Mature Solution for English Word-level Alignment

Timestamp Capability Overview

6. Streaming Architecture and Deployment