Academic Research

Many of IngenID's most important voice biometric functions had their start as academic research projects by a team of highly-talented PhDs, PhD candidates, and Masters students.

Research topics include:

  • >
    Speaker Verification Systems

    Including robust noise handling, channel normalization, etc.

  • >
    Synthetic Speech Detection

    Separately, as well as with speaker verification

  • >
    Speech Synthesis and Conversion

    Text to speech, voice cloning, and more

  • >
    Emotion Classification

    and other classifiers such as gender and channel

Sampling of Published Research Papers

Spoof Detection
 Title:  SingFake: Singing Voice Deepfake Detection

Abstract

The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these systems lag significantly behind their performance on speech test data. When trained on SingFake, either using separated vocal tracks or song mixtures, these systems show substantial improvement. However, our evaluations also identify challenges associated with unseen singers, communication codecs, languages, and musical contexts, calling for dedicated research into singing voice deepfake detection. The SingFake dataset and related resources are available online.

Yongyi
Neil
Zhiyao
Submitted: September 13, 2023
Spoof Detection
 Title:  Phase Perturbation Improves Channel Robustness for Speech Spoofing Countermeasures

Abstract

In this paper, we aim to address the problem of channel robustness in speech countermeasure (CM) systems, which are used to distinguish synthetic speech from human natural speech. On the basis of two hypotheses, we suggest an approach for perturbing phase information during the training of time-domain CM systems. Communication networks often employ lossy compression codec that encodes only magnitude information, therefore heavily altering phase information. Also, state-of-the-art CM systems rely on phase information to identify spoofed speech. Thus, we believe the information loss in the phase domain induced by lossy compression codec degrades the performance of the unseen channel. We first establish the dependence of time-domain CM systems on phase information by perturbing phase in evaluation, showing strong degradation. Then, we demonstrated that perturbing phase during training leads to a significant performance improvement, whereas perturbing magnitude leads to further degradation.

Yongyi
Neil
Zhiyao
Published: August 20, 2023
Spoof Detection
 Title:  Generalizing Voice Presentation Attack Detection to Unseen Synthetic Attacks and Channel Variation

Abstract

Automatic Speaker Verification (ASV) systems aim to verify a speaker’s claimed identity through voice. However, voice can be easily forged with replay, text-to-speech (TTS), and voice conversion (VC) techniques, which may compromise ASV systems. Voice presentation attack detection (PAD) is developed to improve the reliability of speaker verification systems against such spoofing attacks. One main issue of voice PAD systems is its generalization ability to unseen synthetic attacks, i.e., synthesis methods that are not seen during training of the presentation attack detection models. We propose one-class learning, where the model compacts the distribution of learned representations of bona fide speech while pushing away spoofing attacks to improve the results. Another issue is the robustness to variations of acoustic and telecommunication channels. To alleviate this issue, we propose channel-robust training strategies, including data augmentation, multi-task learning, and adversarial learning. In this chapter, we analyze the two issues within the scope of synthetic attacks, i.e., TTS and VC, and demonstrate the effectiveness of our proposed methods.

Neil
Ge
Zhiyao
Published: February 24, 2023
Spoof Detection
 Title:  SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

Abstract

Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set.

Neil
Zhiyao
Accepted: February 16, 2023
Speaker Verification
 Title:  A Probabilistic Fusion Framework for Spoofing Aware Speaker Verification
Abstract

The performance of automatic speaker verification (ASV) systems could be degraded by voice spoofing attacks. Most existing works aimed to develop standalone spoofing countermeasure (CM) systems. Relatively little work targeted at developing an integrated spoofing aware speaker verification (SASV) system. In the recent SASV challenge, the organizers encourage the development of such integration by releasing official protocols and baselines. In this paper, we build a probabilistic framework for fusing the ASV and CM subsystem scores. We further propose fusion strategies for direct inference and fine-tuning to predict the SASV score based on the framework. Surprisingly, these strategies significantly improve the SASV equal error rate (EER) from 19.31% of the baseline to 1.53% on the official evaluation trials of the SASV challenge. We verify the effectiveness of our proposed components through ablation studies and provide insights with score distribution analysis.

Neil
Ge
Zhiyao
Updated: April 22, 2022
Spoof Detection
 Title:  UR Channel-Robust Synthetic Speech Detection System for ASVspoof 2021
Abstract

In this paper, we present UR-AIR system submission to the logical access (LA) and the speech deepfake (DF) tracks of the ASVspoof 2021 Challenge. The LA and DF tasks focus on synthetic speech detection (SSD), i.e. detecting text-to-speech and voice conversion as spoofing attacks. Different from previous ASVspoof challenges, the LA task this year presents codec and transmission channel variability, while the new task DF presents general audio compression. Built upon our previous research work on improving the robustness of the SSD systems to channel effects, we propose a channel-robust synthetic speech detection system for the challenge. To mitigate the channel variability issue, we use an acoustic simulator to apply transmission codec, compression codec, and convolutional impulse responses to augmenting the original datasets. For the neural network backbone, we propose to use Emphasized Channel Attention, Propagation and Aggregation Time Delay Neural Networks (ECAPA-TDNN) as our primary model. We also incorporate one-class learning with channel-robust training strategies to further learn a channel-invariant speech representation. Our submission achieved EER 20.33% in the DF task; EER 5.46% and min-tDCF 0.3094 in the LA task.

Neil
Ge
Zhiyao
Updated: October 10, 2021
Spoof Detection
 Title:  An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems

Abstract

Spoofing countermeasure (CM) systems are critical in speaker verification; they aim to discern spoofing attacks from bona fide speech trials. In practice, however, acoustic condition variability in speech utterances may significantly degrade the performance of CM systems. In this paper, we conduct a cross-dataset study on several state-of-the-art CM systems and observe significant performance degradation compared with their single-dataset performance. Observing differences of average magnitude spectra of bona fide utterances across the datasets, we hypothesize that channel mismatch among these datasets is one important reason. We then verify it by demonstrating a similar degradation of CM systems trained on original but evaluated on channel-shifted data. Finally, we propose several channel robust strategies (data augmentation, multi-task learning, adversarial learning) for CM systems, and observe a significant performance improvement on cross-dataset experiments.

Neil
Ge
Zhiyao
Updated: October 10, 2021
Spoof Detection
 Title:  One-class Learning Towards Synthetic Voice Spoofing Detection
Abstract

Human voices can be used to authenticate the identity of the speaker, but the automatic speaker verification (ASV) systems are vulnerable to voice spoofing attacks, such as impersonation, replay, text-to-speech, and voice conversion. Recently, researchers developed anti-spoofing techniques to improve the reliability of ASV systems against spoofing attacks. However, most methods encounter difficulties in detecting unknown attacks in practical use, which often have different statistical distributions from known attacks. Especially, the fast development of synthetic voice spoofing algorithms is generating increasingly powerful attacks, putting the ASV systems at risk of unseen attacks. In this work, we propose an anti-spoofing system to detect unknown synthetic voice spoofing attacks (i.e., text-to-speech or voice conversion) using one-class learning. The key idea is to compact the bona fide speech representation and inject an angular margin to separate the spoofing attacks in the embedding space. Without resorting to any data augmentation methods, our proposed system achieves an equal error rate (EER) of 2.19% on the evaluation set of ASVspoof 2019 Challenge logical access scenario, outperforming all existing single systems (i.e., those without model ensemble).

Neil
Zhiyao
Updated: February 9, 2021
Speaker Verification
 Title:  Y-Vector: Multiscale Waveform Encoder for Speaker Embedding
Abstract

State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies of speech utterances as input features. With the ability of deep neural networks to learn representations from raw data, recent studies attempted to extract speaker embeddings directly from raw waveforms and showed competitive results. In this paper, we propose a new speaker embedding called raw-x-vector for speaker verification in the time domain, combining a multi-scale waveform encoder and an x-vector network architecture. We show that the proposed approach outperforms existing raw-waveform-based speaker verification systems by a large margin. We also show that the proposed multi-scale encoder improves over single-scale encoders for both the proposed system and another state-of-the-art raw-waveform-based speaker verification systems. A further analysis of the learned filters shows that the multi-scale encoder focuses on different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.

Ge
Zhiyao
Updated: June 9, 2021