NV-Bench

article

Abstract

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVVs), their evaluation lacks standardized metrics and reliable ground truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multilingual, in-the-wild utterances with paired human reference audio, balanced across 14 categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, which utilizes our proposed Paralinguistic Character Error Rate (PCER) to assess controllability, (2) Acoustic Fidelity, which quantifies the distributional gap between synthesized and real speech. We evaluate diverse TTS models and develop two reference baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

hub

Overview

Figure 1. Overview of the NV-Bench pipeline — from raw data collection through multi-lingual NVASR transcription to dual-dimensional evaluation.

headphones

Audio Examples

Listen and compare: each example provides a prompt audio (reference speaker), the target text with NVV tags, ground-truth recording, and synthesized outputs from five TTS systems.

Single-Label Example Mandarin

Ground Truth

Target Text

因为刚刚一直待着不动，所以我现在就想动一动，[Laughter]

Prompt Audio

Synthesized Outputs

NV-CV3

NV-FlexiVoice

CosyVoice3

Emilia-NV-CV2

SMIIP-NV-CV2

Orpheus TTS (Single-Speaker Model)

Multi-Label Example Mandarin

Ground Truth

Target Text

提灯人，别说我没有警告过你。[Dissatisfaction-hnn]，要说最害人的就是安逸。，惊得我真是恨不得从悬崖上跳下去。手下别留情？反正我是不会的。美女与野兽，都是在说我？[Uhm]，为什么从来没人叫我重装子弹？，一塌糊涂，但效果很好。

Prompt Audio

Synthesized Outputs

NV-CV3

NV-FlexiVoice

CosyVoice3

Emilia-NV-CV2

SMIIP-NV-CV2

Orpheus TTS (Single-Speaker Model)

Single-Label Example English

Ground Truth

Target Text

Yes, you are definitely here for something other than that which falls within my regular line of work, aren't you?

Prompt Audio

Synthesized Outputs

NV-CV3

NV-FlexiVoice

CosyVoice3

Emilia-NV-CV2

SMIIP-NV-CV2

Orpheus TTS (Single-Speaker Model)

Multi-Label Example English

Ground Truth

Target Text

It's been a few minutes, Gary. Head up to h. I'm already up. [Laughter] [Question-huh]

Prompt Audio

Synthesized Outputs

NV-CV3

NV-FlexiVoice

CosyVoice3

Emilia-NV-CV2

SMIIP-NV-CV2

Orpheus TTS (Single-Speaker Model)

Functional Taxonomy

NVVs are organized into three functional levels based on communicative intent.

spa

Level 1: Vegetative Sounds

Biological reflexes grounding speech in physical realism.

Breathing Cough Sigh

emoji_emotions

Level 2: Affect Bursts

Valenced vocalizations conveying emotion or instant reactions.

Laughter Surprise-ah Surprise-oh Dissatisfaction-hnn

forum

Level 3: Conversational Grunts

Interaction-management cues — filled pauses and prosodic particles.

Uhm Confirmation-en Question-ei Question-ah Question-en Question-oh Question-huh

Pipeline

Data Processing

565K clips (~1,560 hrs) filtered via Emilia-Pipeline & MiMo-Audio for single-speaker verification.

Multi-lingual NVASR

SenseVoice-Small fine-tuned on 6 datasets with unified label taxonomy.

Human Verification

10 annotators, Cohen's κ > 0.85 → 1,651 prompt-GT pairs (7.9 hrs).

Evaluation Protocol

check_circle

Instruction Alignment

Can the model generate the specified NVV events at the correct positions?

CER — Character Error Rate
PCER — Paralinguistic CER
OCER — Overall CER

music_note

Acoustic Fidelity

How realistic is the synthesized speech compared to real recordings?

FAD / FD / KL — Distribution Distance
SIM — Speaker Similarity (WavLM)
DNSMOS — Perceptual Quality

database

NV-Bench Data

1,651 Total Utterances With paired GT audio

7.9 h Audio Duration MP3 @ 24kHz

14 NVV Categories 3 functional levels

2 Languages Mandarin & English

label

Single-label Subset

Strictly balanced — exactly one NVV event per utterance (50 samples per category). Isolates fundamental generation capabilities.

650 Mandarin

350 English

labels

Multi-label Subset

Challenging utterances with 2+ NVV events — tests robustness under dense paralinguistic conditions with relative balance.

41–91 ZH / label

75–112 EN / label

compare_arrows Comparison with Existing NVV Datasets

expand_more

Dataset	Language	Testset	Balance	Prompt
SynParaSpeech	ZH	✗	—	—
NVS	ZH / EN	✗	—	—
Emilia-NV	ZH	✗	—	—
NVTTS	EN	✓	✗	✗
SMIIP-NV	ZH	✓	✗	✗
NV-Bench (Ours)	ZH / EN	✓	✓	✓

sell Unified Label Inventory

expand_more

Language	Vegetative Sounds	Affect Bursts	Conversational Grunts
Mandarin	Breathing, Cough, Sigh	Laughter, Surprise-ah, Surprise-oh, Dissatisfaction-hnn	Uhm, Confirmation-en, Question-ei, Question-ah, Question-en, Question-oh
English	Breathing, Cough, Sigh	Laughter, Surprise-oh	Uhm, Question-huh

leaderboard

NV-Bench Evaluation

tune

Best Controllability

NV-CV3 achieves the lowest PCER (27.69%) on the single-label Mandarin subset.

headphones

Best Acoustic Match

NV-FlexiVoice achieves the lowest FAD (0.29) and FD (2.72), closest to real distribution.

trending_up

Human Correlation

IMOS shows significant correlation with PCER (ρ = −0.62, p < 0.001), confirming reliability.

equalizer Acoustic Fidelity

expand_more

System	FAD ↓	FD ↓	KL ↓
GT (Human)	—	—	—
Orpheus-TTS	5.71	24.49	1.08
SMIIP-NV-CV2	1.32	6.71	0.59
Emilia-NV-CV2	1.08	5.57	0.44
CosyVoice3	0.90	9.46	0.43
NV-FlexiVoice	0.29	2.72	0.76
NV-CV3	0.86	3.94	0.39

Bold = best, underline = second-best. ↓ lower is better, ↑ higher is better. SIM/DNSMOS from single-label Mandarin subset.

groups Human Evaluation

expand_more

10 annotators rated 50 utterances per model on a 5-point scale for two dimensions.

System	IMOS ↑ Instruction Accuracy	NMOS ↑ Naturalness
GT (Human)	4.32	4.34
Orpheus-TTS	3.41	3.57
SMIIP-NV-CV2	3.45	3.44
Emilia-NV-CV2	3.84	4.04
CosyVoice3	3.53	3.96
NV-FlexiVoice	3.91	4.04
NV-CV3	3.93	4.11

Bold = best, underline = second-best. ↑ higher is better.

translate Detailed Results — Mandarin

expand_more

System	Single-label Alignment			Single-label Fidelity		Multi-label Alignment			Multi-label Fidelity
System	CER%	PCER%	OCER%	SIM	DNSMOS	CER%	PCER%	OCER%	SIM	DNSMOS
GT	3.85	9.54	4.06	0.786	3.12	4.11	23.51	5.37	0.796	3.12
Orpheus-TTS	11.36	88.77	13.91	—	3.43	19.83	84.85	24.38	—	3.40
SMIIP-NV-CV2	8.80	75.64	11.34	0.719	3.22	10.66	77.20	14.79	0.715	3.07
Emilia-NV-CV2	5.05	40.00	6.64	0.740	3.21	5.54	48.74	8.09	0.746	3.24
CosyVoice3	3.85	57.69	5.86	0.764	3.30	4.75	61.94	8.26	0.715	3.31
NV-FlexiVoice	6.98	31.08	8.15	0.748	3.22	8.20	39.37	10.39	0.750	3.07
NV-CV3	3.80	27.69	4.90	0.768	3.29	3.44	30.04	4.84	0.776	3.29

Bold = best, underline = second-best.

language Detailed Results — English

expand_more

System	Single-label Alignment			Single-label Fidelity		Multi-label Alignment			Multi-label Fidelity
System	CER%	PCER%	OCER%	SIM	DNSMOS	CER%	PCER%	OCER%	SIM	DNSMOS
GT	6.35	7.74	6.67	0.781	3.14	7.45	21.03	8.81	0.773	3.14
Orpheus-TTS	9.94	80.24	11.57	—	3.34	8.68	71.46	11.89	—	3.34
SMIIP-NV-CV2	20.70	52.02	22.55	0.584	3.00	20.93	54.49	23.87	0.580	2.97
Emilia-NV-CV2	14.06	58.74	16.06	0.644	3.21	11.71	60.28	14.63	0.655	3.26
CosyVoice3	8.12	64.18	11.29	0.707	3.27	6.39	57.84	10.69	0.715	3.31
NV-FlexiVoice	12.94	57.88	15.61	0.693	3.22	9.60	51.32	13.76	0.708	3.07
NV-CV3	8.11	51.58	10.38	0.702	3.25	6.70	47.13	10.10	0.721	3.30

Bold = best, underline = second-best.

mic Multi-lingual NVASR Performance

expand_more

Our NVASR model maintains high-quality general ASR while significantly outperforming baselines on NVV-specific tasks.

Dataset	SenseVoice	Qwen2.5-Omni	NVASR (Ours)
WenetSpeech test-net	5.77	20.14	5.55
LibriSpeech test-other	12.79	23.35	9.90
SMIIP-NV	3.12	3.59 (4.17)	1.29 (1.36)
NVTTS	14.45	21.69 (26.95)	13.52 (16.10)

Values in parentheses denote OCER. All other values are CER (%).