"Multimodal learning is a Machine Learning area that is associated with including different types of signals, which are caused by the same phenomenon, in the learning process. A typical example is Multimodal Language where it is crucial to combine three types of information: (i) audio waveform of speech, (ii) speech content (text), and (iii) visual cues (e.g., gestures). Multimodal Sentiment Analysis is a basic task in this area where the scope is to infer the sentiment level of a multimodal language signal, for instance, positive or negative sentiment. Despite their great applicability in real-world applications, such systems are currently limited to research. That is mostly due to (i) the high model complexity (in terms of the number of parameters), (ii) the absence of end-to-end systems that are applied to actual input signals, since most systems use pre-extracted and pre-aligned feature vectors, and (iii) underperforming under missing or noisy modalities.
The scope of this thesis is to perform an empirical study on current models of the literature in order to evaluate their performance and complexity in an end-to-end setting.
For an introduction to Multimodal Learning, see https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8269806"