Abstract: Intra-speaker variability, caused by emotional speech, is a real threat to the performance of speaker recognition systems. In fact, as human beings, we are constantly changing our emotional state. While many efforts have been made to increase automatic speaker verification (ASV) robustness towards channel effects or spoofing attacks, only a handful of studies have addressed the detrimental consequences of affective speech. In this work, we propose a new method to minimize the mismatch between neutral and affective speech. To this end, a Gaussian mixture model is used to learn a prior probability distribution of the neutral speech for a given speaker (i.e., characterizing his/her source space). This knowledge is then used to minimize the differences between target (affective) and source (neutral) spaces. The proposed method is validated across four multilingual emotional datasets. Experimental results show a consistent improvement in performance across eight emotional states, with significant reductions of equal error rate relative to the baseline.
Bio: Prof. Anderson Ávila is an Assistant Professor at INRS-EMT, working in the INRS-UQO Mixed Research Unit in Cybersecurity. Prior to joining INRS-UQO, Dr. Avila was a researcher scientist in natural language and speech processing, working on projects related to model compression, low-latency and robustness of spoken language understanding. His main research interests are in data privacy via federated learning, combating misinformation using AI, and robustness of biometrics.