TDYCNN for TextIndependent Speaker Verification
Official implementation of
 Temporal Dynamic Convolutional Neural Network for TextIndependent Speaker Verification and Phonemetic Analysis
by SeongHu Kim, Hyeonuk Nam, YongHwa Park @ Human Lab, Mechanical Engineering Department, KAIST
Accepted paper in ICASSP 2022.
This code was written mainly with reference to VoxCeleb_trainer of paper ‘In defence of metric learning for speaker recognition‘.
Temporal Dynamic Convolutional Neural Network (TDYCNN)
TDYCNN efficiently applies adaptive convolution depending on time bins by changing the computation order as follows:
where x and y are input and output of TDYCNN module which depends on frequency feature f and time feature t in timefrequency domain data.
kth basis kernel is convoluted with input and kth bias is added. The results are aggregated using the attention weights which depends on time bins.
K is the number of basis kernels, and σ is an activation function ReLU.
The attention weight has a value between 0 and 1, and the sum of all basis kernels on a single time bin is 1 as the weights are processed by softmax.
Requirements and versions used
Python version of 3.7.10 is used with following libraries
 pytorch == 1.8.1
 pytorchaudio == 0.8.1
 numpy == 1.19.2
 scipy == 1.5.3
 scikitlearn == 0.23.2
Dataset
We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.
Training
You can train and save model in exps
folder by running:
python trainSpeakerNet.py model TDy_ResNet34_half log_input True encoder_type AVG trainfunc softmaxproto save_path exps/TDY_CNN_ResNet34 nPerSpeaker 2 batch_size 400
This implementation also provides accelerating training with distributed training and mixed precision training.
 Use
distributed
flag to enable distributed training andmixedprec
flag to enable mixed precision training. GPU indices should be set before training :
os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3'
intrainSpeakernet.py
.
 GPU indices should be set before training :
Results:
Network  #Parm  EER (%)  C_det (%) 

TDYVGGM  71.2M  3.04  0.237 
TDYResNet34(×0.25)  13.3M  1.58  0.116 
TDYResNet34(×0.5)  51.9M  1.48  0.118 

This result is lowdimensional tSNE projection of framelevel speaker embeddings of MHRM0 and FDAS1 using (a) baseline model ResNet34(×0.25) and (b) TDYResNet34(×0.25). Left column represents embeddings for different speakers, and right column represents embeddings for different phoneme classes.

Embeddings by TDYResNet34(×0.25) are closely gathered regardless of phoneme groups. It shows that the temporal dynamic model extracts consistent speaker information regardless of phonemes.
Pretrained models
There are pretrained models in folder pretrained_model
.
For example, you can check 1.4786 of EER by running following script using TDYResNet34(×0.5).
python trainSpeakerNet.py eval model TDy_ResNet34_half log_input True encoder_type AVG trainfunc softmaxproto save_path exps/test eval_frames 400 initial_model pretrained_model/pretrained_TDy_ResNet34_half.model
Citation
@article{kim2021tdycnn,
title={Temporal Dynamic Convolutional Neural Network for TextIndependent Speaker Verification and Phonemetic Analysis},
author={Kim, SeongHu and Nam, Hyeonuk and Park, YongHwa},
journal={arXiv preprint arXiv:2110.03213},
year={2021}
}
Please contact SeongHu Kim at [email protected] for any query.