P2PSynCodec Demo Ultra-low-bitrate speech codec
Note: If the audio fails to load or playback is slow, please refresh the page, or try a VPN / network accelerator and retry (first-time loading may take a while).
Speech Samples

An Ultra-low-bitrate Neural Speech Codec with Plain-to-Pseudo Synergistic Vector Quantization

Authors: Xiao-hang Jiang1, Yang Ai1, Fei Liu1, Jian-Qing Gao2, Zhen-Hua Ling1, Ji Wu3
1NERC-SLIP, University of Science and Technology of China, Hefei, China;
2iFLYTEK Research, China;
3Department of Electronic Engineering, Tsinghua University, Beijing, China
Abstract

Most neural speech codecs adopt a residual vector quantizer (RVQ), where successive VQs contribute with decreasing importance. However, assigning the same bitrate to all VQs wastes bits and results in a relatively high overall bitrate. To address this issue, we propose an ultra-low-bitrate neural speech codec, termed P2PSynCodec, which incorporates a plain-to-pseudo synergistic vector quantizer (P2PSVQ). The P2PSVQ extends RVQ by embodying the key principle of allocating zero bitrate to the less important VQs at the rear stages. Specifically, the P2PSVQ is a cascaded structure composed of a plain VQ and multiple pseudo VQs that work in a synergistic manner. The plain VQ serves as the foundation, producing basic tokens through quantization, while the pseudo VQs generate auxiliary tokens through prediction (not used in bitrate calculation). Experiments show P2PSynCodec maintains speech quality comparable to competing codecs at 2.0 kbps, despite operating at merely 0.5 kbps.

Paper Figures

Fig.1 Overview of P2PSynCodec
Fig. 1. Overview of the proposed P2PSynCodec and its pseudo-VQ training process.
Table I: objective and subjective results
Table I. Objective experimental results on decoded speech quality and complexity of the compared codecs at 0.5 kbps on the LibriTTS test set (16 kHz) and 1.5 kbps on the VCTK test set (48 kHz). The bold and underlined numbers indicate optimal and sub-optimal results, respectively.
Fig.2 ABX preference summary
Fig. 2.Average preference scores (%) of ABX tests comparing P2PSynCodec at 0.5 kbps and other codecs at high bitrates on the LibriTTS test set (16 kHz). Here, N/P denotes no preference, and p is the paired t-test p-value.

3.3 Comparison with Baseline Codecs

Sampling rate: 16 kHz

Setting: Comparisons at equal ultra-low bitrate (0.5 kbps)

Example 1

2961_961_000004_000002.wav
Ground Truth
MDCTCodec @ 0.5 kbps DAC @ 0.5 kbps BigCodec @ 0.5 kbps Wavtokenizer @ 0.5 kbps P2PSynCodec @ 0.5 kbps

Example 2

4970_29093_000037_000000.wav
Ground Truth
MDCTCodec @ 0.5 kbps DAC @ 0.5 kbps BigCodec @ 0.5 kbps Wavtokenizer @ 0.5 kbps P2PSynCodec @ 0.5 kbps

Example 3

260_123288_000033_000000.wav
Ground Truth
MDCTCodec @ 0.5 kbps DAC @ 0.5 kbps BigCodec @ 0.5 kbps Wavtokenizer @ 0.5 kbps P2PSynCodec @ 0.5 kbps

Example 4

6829_68769_000071_000001.wav
Ground Truth
MDCTCodec @ 0.5 kbps DAC @ 0.5 kbps BigCodec @ 0.5 kbps Wavtokenizer @ 0.5 kbps P2PSynCodec @ 0.5 kbps

Example 5

8463_294825_000015_000002.wav
Ground Truth
MDCTCodec @ 0.5 kbps DAC @ 0.5 kbps BigCodec @ 0.5 kbps Wavtokenizer @ 0.5 kbps P2PSynCodec @ 0.5 kbps

Comparisons with High-Bitrate Codecs

Baselines operate at higher bitrates; P2PSynCodec remains at 0.5 kbps.

Example 1

2961_961_000004_000002.wav
Ground Truth
MDCTCodec @ 2.0 kbps DAC @ 2.0 kbps SQCodec @ 1.5 kbps Wavtokenizer @ 2.0 kbps P2PSynCodec @ 0.5 kbps

Example 2

4970_29093_000037_000000.wav
Ground Truth
MDCTCodec @ 2.0 kbps DAC @ 2.0 kbps SQCodec @ 1.5 kbps Wavtokenizer @ 2.0 kbps P2PSynCodec @ 0.5 kbps

Example 3

260_123288_000033_000000.wav
Ground Truth
MDCTCodec @ 2.0 kbps DAC @ 2.0 kbps SQCodec @ 1.5 kbps Wavtokenizer @ 2.0 kbps P2PSynCodec @ 0.5 kbps

Example 4

6829_68769_000071_000001.wav
Ground Truth
MDCTCodec @ 2.0 kbps DAC @ 2.0 kbps SQCodec @ 1.5 kbps Wavtokenizer @ 2.0 kbps P2PSynCodec @ 0.5 kbps

Example 5

8463_294825_000015_000002.wav
Ground Truth
MDCTCodec @ 2.0 kbps DAC @ 2.0 kbps SQCodec @ 1.5 kbps Wavtokenizer @ 2.0 kbps P2PSynCodec @ 0.5 kbps