VoCodec Demo Low-bitrate streamable neural speech codec
Tip: If audio fails to load or playback is choppy, please refresh the page, or try a VPN / network accelerator. (The first load may take a moment.)
Speech Samples

VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization

Authors: Xiao-Hang Jiang1, Yang Ai1, Rui-Chen Zheng1, Li-Rong Dai1, Zhen-Hua Ling1, Ji Wu2
1University of Science and Technology of China, China;
2Tsinghua University, China
Abstract

Neural speech codecs, as key components for compressing and reconstructing speech signals, play a significant role in speech transmission and storage. However, most existing codecs employ a uniform quantization strategy across all speech frames, allocating the same bitrate regardless of content. This approach is suboptimal for speech signals, resulting in unnecessary bitrate consumption and leaving potential for further compression. In this paper, we present a low-bitrate streamable neural speech codec called VoCodec. Unlike existing codecs, VoCodec employs a voicing-driven quantization strategy, assigning different bitrates to voiced and unvoiced frames based on their sensitivity to human auditory perception. Specifically, VoCodec incorporates a voicing detector into its fully causal encoder–quantizer–decoder neural coding framework to identify voicing characteristics in the input speech. Based on this, it adopts complex residual scalar-vector quantization for voiced frames and simple scalar quantization for unvoiced frames during quantization. Experiments show that on the LibriTTS dataset at a 16 kHz sampling rate, VoCodec outperforms baseline neural speech codecs even at a bitrate as low as 1.1 kbps. Our further experiments also confirm that introducing voicing-driven quantization can effectively reduce the bitrate by approximately 27% compared to the original uniform quantization strategy.

Paper Figure

Fig.1 Overview of VoCodec
Fig. 1. Overall architecture of the proposed VoCodec. Here, MDCT, IMDCT, Uni-LSTM, SQ, IVQ, FFT, ABS and SUM stand for modified discrete cosine transform, inverse modified discrete cosine transform, unidirectional long short-term memory layer, scalar quantizer, improved vector quantizer, fast Fourier transform, absolute value calculation and summation, respectively.

3.3 Comparison with Baseline Codecs

Sampling rate: 16 kHz   •   Bitrate: 1.1 kbps

Example 1

121_121726_000004_000003.wav
16 kHz 1.1 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 2

237_126133_000021_000000.wav
16 kHz 1.1 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 3

1284_1180_000043_000001.wav
16 kHz 1.1 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 4

1580_141084_000016_000003.wav
16 kHz 1.1 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 5

2300_131720_000033_000003.wav
16 kHz 1.1 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

48 kHz Samples

Sampling rate: 48 kHz   •   Bitrate: 2.7 kbps

Example 1

p360_009.wav
48 kHz 2.7 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 2

p360_037.wav
48 kHz 2.7 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 3

p361_007.wav
48 kHz 2.7 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 4

p361_237.wav
48 kHz 2.7 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec

Example 5

p361_262.wav
48 kHz 2.7 kbps
Ground Truth
DACBigCodecAudioDecMDCTCodec-SStreamCodecVoCodec
↑ Back to top