Examination of cross-dataset generalisation abilities for Speech Emotion Recognition systems
Description:
Speech Emotion Recognition is associated with predicting the underlying emotion levels that are contained in an input speech signal. There is a range of datasets for this task that are based on different design choices. More specifically, datasets differ, along with other aspects, with respect to (i) the type of speech (acted, improvised, etc), (ii) the definitions of emotional levels, (iii) the context (e.g., theater actors are more expressive compared to everyday situations), (iv) the spoken language, and (v) the recording setup. The scope of this thesis is to empirically examine the cross-dataset generalisation ability of models trained under different dataset types and investigate parameters that affect the transfer of knowledge among speech emotion recognition datasets.