NV-Bench

Benchmarking Nonverbal Vocalization Synthesis
in Expressive Text-to-Speech Models

article

Abstract

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVVs), their evaluation lacks standardized metrics and reliable ground truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multilingual, in-the-wild utterances with paired human reference audio, balanced across 14 categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, which utilizes our proposed Paralinguistic Character Error Rate (PCER) to assess controllability, (2) Acoustic Fidelity, which quantifies the distributional gap between synthesized and real speech. We evaluate diverse TTS models and develop two reference baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

hub

Overview

Overview of NV-Bench pipeline

Figure 1. Overview of the NV-Bench pipeline — from raw data collection through multi-lingual NVASR transcription to dual-dimensional evaluation.

headphones

Audio Examples

Listen and compare: each example provides a prompt audio (reference speaker), the target text with NVV tags, ground-truth recording, and synthesized outputs from five TTS systems.

Single-Label Example Mandarin
Ground Truth
Target Text

因为刚刚一直待着不动,所以我现在就想动一动,[Laughter]

Prompt Audio
Synthesized Outputs
NV-CV3
NV-FlexiVoice
CosyVoice3
Emilia-NV-CV2
SMIIP-NV-CV2
Orpheus TTS (Single-Speaker Model)
Multi-Label Example Mandarin
Ground Truth
Target Text

提灯人,别说我没有警告过你。[Dissatisfaction-hnn],要说最害人的就是安逸。,惊得我真是恨不得从悬崖上跳下去。手下别留情?反正我是不会的。美女与野兽,都是在说我?[Uhm],为什么从来没人叫我重装子弹?,一塌糊涂,但效果很好。

Prompt Audio
Synthesized Outputs
NV-CV3
NV-FlexiVoice
CosyVoice3
Emilia-NV-CV2
SMIIP-NV-CV2
Orpheus TTS (Single-Speaker Model)
Single-Label Example English
Ground Truth
Target Text

Yes, you are definitely here for something other than that which falls within my regular line of work, aren't you?

Prompt Audio
Synthesized Outputs
NV-CV3
NV-FlexiVoice
CosyVoice3
Emilia-NV-CV2
SMIIP-NV-CV2
Orpheus TTS (Single-Speaker Model)
Multi-Label Example English
Ground Truth
Target Text

It's been a few minutes, Gary. Head up to h. I'm already up. [Laughter] [Question-huh]

Prompt Audio
Synthesized Outputs
NV-CV3
NV-FlexiVoice
CosyVoice3
Emilia-NV-CV2
SMIIP-NV-CV2
Orpheus TTS (Single-Speaker Model)

Functional Taxonomy

NVVs are organized into three functional levels based on communicative intent.

spa

Level 1: Vegetative Sounds

Biological reflexes grounding speech in physical realism.

Breathing Cough Sigh
emoji_emotions

Level 2: Affect Bursts

Valenced vocalizations conveying emotion or instant reactions.

Laughter Surprise-ah Surprise-oh Dissatisfaction-hnn
forum

Level 3: Conversational Grunts

Interaction-management cues — filled pauses and prosodic particles.

Uhm Confirmation-en Question-ei Question-ah Question-en Question-oh Question-huh

Pipeline

1

Data Processing

565K clips (~1,560 hrs) filtered via Emilia-Pipeline & MiMo-Audio for single-speaker verification.

2

Multi-lingual NVASR

SenseVoice-Small fine-tuned on 6 datasets with unified label taxonomy.

3

Human Verification

10 annotators, Cohen's κ > 0.85 → 1,651 prompt-GT pairs (7.9 hrs).

Evaluation Protocol

check_circle

Instruction Alignment

Can the model generate the specified NVV events at the correct positions?

  • CER — Character Error Rate
  • PCER — Paralinguistic CER
  • OCER — Overall CER
music_note

Acoustic Fidelity

How realistic is the synthesized speech compared to real recordings?

  • FAD / FD / KL — Distribution Distance
  • SIM — Speaker Similarity (WavLM)
  • DNSMOS — Perceptual Quality
database

NV-Bench Data

1,651 Total Utterances With paired GT audio
7.9 h Audio Duration MP3 @ 24kHz
14 NVV Categories 3 functional levels
2 Languages Mandarin & English
label

Single-label Subset

Strictly balanced — exactly one NVV event per utterance (50 samples per category). Isolates fundamental generation capabilities.

650 Mandarin
350 English
labels

Multi-label Subset

Challenging utterances with 2+ NVV events — tests robustness under dense paralinguistic conditions with relative balance.

41–91 ZH / label
75–112 EN / label

compare_arrows Comparison with Existing NVV Datasets

expand_more
Dataset Language Testset Balance Prompt
SynParaSpeech ZH
NVS ZH / EN
Emilia-NV ZH
NVTTS EN
SMIIP-NV ZH
NV-Bench (Ours) ZH / EN

sell Unified Label Inventory

expand_more
Language Vegetative Sounds Affect Bursts Conversational Grunts
Mandarin Breathing, Cough, Sigh Laughter, Surprise-ah, Surprise-oh, Dissatisfaction-hnn Uhm, Confirmation-en, Question-ei, Question-ah, Question-en, Question-oh
English Breathing, Cough, Sigh Laughter, Surprise-oh Uhm, Question-huh
leaderboard

NV-Bench Evaluation

tune

Best Controllability

NV-CV3 achieves the lowest PCER (27.69%) on the single-label Mandarin subset.

headphones

Best Acoustic Match

NV-FlexiVoice achieves the lowest FAD (0.29) and FD (2.72), closest to real distribution.

trending_up

Human Correlation

IMOS shows significant correlation with PCER (ρ = −0.62, p < 0.001), confirming reliability.

equalizer Acoustic Fidelity

expand_more
System FAD ↓ FD ↓ KL ↓
GT (Human)
Orpheus-TTS 5.71 24.49 1.08
SMIIP-NV-CV2 1.32 6.71 0.59
Emilia-NV-CV2 1.08 5.57 0.44
CosyVoice3 0.90 9.46 0.43
NV-FlexiVoice 0.29 2.72 0.76
NV-CV3 0.86 3.94 0.39

Bold = best, underline = second-best. ↓ lower is better, ↑ higher is better. SIM/DNSMOS from single-label Mandarin subset.

groups Human Evaluation

expand_more

10 annotators rated 50 utterances per model on a 5-point scale for two dimensions.

System IMOS ↑
Instruction Accuracy
NMOS ↑
Naturalness
GT (Human) 4.32 4.34
Orpheus-TTS 3.41 3.57
SMIIP-NV-CV2 3.45 3.44
Emilia-NV-CV2 3.84 4.04
CosyVoice3 3.53 3.96
NV-FlexiVoice 3.91 4.04
NV-CV3 3.93 4.11

Bold = best, underline = second-best. ↑ higher is better.

translate Detailed Results — Mandarin

expand_more

language Detailed Results — English

expand_more

mic Multi-lingual NVASR Performance

expand_more