Speech Samples

An Ultra-low-bitrate Neural Speech Codec with Plain-to-Pseudo Synergistic Vector Quantization

Jump to 0.5 kbps samples Jump to high-bitrate comparisons

Authors: Xiao-hang Jiang¹, Yang Ai¹, Fei Liu¹, Jian-Qing Gao², Zhen-Hua Ling¹, Ji Wu³
¹NERC-SLIP, University of Science and Technology of China, Hefei, China;
²iFLYTEK Research, China;
³Department of Electronic Engineering, Tsinghua University, Beijing, China

Abstract

Most neural speech codecs adopt a residual vector quantizer (RVQ), where successive VQs contribute with decreasing importance. However, assigning the same bitrate to all VQs wastes bits and results in a relatively high overall bitrate. To address this issue, we propose an ultra-low-bitrate neural speech codec, termed P2PSynCodec, which incorporates a plain-to-pseudo synergistic vector quantizer (P2PSVQ). The P2PSVQ extends RVQ by embodying the key principle of allocating zero bitrate to the less important VQs at the rear stages. Specifically, the P2PSVQ is a cascaded structure composed of a plain VQ and multiple pseudo VQs that work in a synergistic manner. The plain VQ serves as the foundation, producing basic tokens through quantization, while the pseudo VQs generate auxiliary tokens through prediction (not used in bitrate calculation). Experiments show P2PSynCodec maintains speech quality comparable to competing codecs at 2.0 kbps, despite operating at merely 0.5 kbps.

Paper Figures

Fig. 1. Overview of the proposed P2PSynCodec and its pseudo-VQ training process.

Table I: objective and subjective results

Table I. Objective experimental results on decoded speech quality and complexity of the compared codecs at 0.5 kbps on the LibriTTS test set (16 kHz) and 1.5 kbps on the VCTK test set (48 kHz). The bold and underlined numbers indicate optimal and sub-optimal results, respectively.

Fig. 2.Average preference scores (%) of ABX tests comparing P2PSynCodec at 0.5 kbps and other codecs at high bitrates on the LibriTTS test set (16 kHz). Here, N/P denotes no preference, and p is the paired t-test p-value.