Speech Samples

VoCodec: A Low-bitrate Streamable Neural Speech Codec with Voicing-driven Quantization

Authors: Xiao-Hang Jiang¹, Yang Ai¹, Rui-Chen Zheng¹, Li-Rong Dai¹, Zhen-Hua Ling¹, Ji Wu²
¹University of Science and Technology of China, China;
²Tsinghua University, China

Abstract

Neural speech codecs, as key components for compressing and reconstructing speech signals, play a significant role in speech transmission and storage. However, most existing codecs employ a uniform quantization strategy across all speech frames, allocating the same bitrate regardless of content. This approach is suboptimal for speech signals, resulting in unnecessary bitrate consumption and leaving potential for further compression. In this paper, we present a low-bitrate streamable neural speech codec called VoCodec. Unlike existing codecs, VoCodec employs a voicing-driven quantization strategy, assigning different bitrates to voiced and unvoiced frames based on their sensitivity to human auditory perception. Specifically, VoCodec incorporates a voicing detector into its fully causal encoder–quantizer–decoder neural coding framework to identify voicing characteristics in the input speech. Based on this, it adopts complex residual scalar-vector quantization for voiced frames and simple scalar quantization for unvoiced frames during quantization. Experiments show that on the LibriTTS dataset at a 16 kHz sampling rate, VoCodec outperforms baseline neural speech codecs even at a bitrate as low as 1.1 kbps. Our further experiments also confirm that introducing voicing-driven quantization can effectively reduce the bitrate by approximately 27% compared to the original uniform quantization strategy.

Paper Figure

Fig. 1. Overall architecture of the proposed VoCodec. Here, MDCT, IMDCT, Uni-LSTM, SQ, IVQ, FFT, ABS and SUM stand for modified discrete cosine transform, inverse modified discrete cosine transform, unidirectional long short-term memory layer, scalar quantizer, improved vector quantizer, fast Fourier transform, absolute value calculation and summation, respectively.