The goal of this thesis is to train and evaluate deep learning models that are able to identify visual attributes from artwork and map them to matching sounds that are “relevant” in terms of content, style, mood and other contextual characteristics. Towards this end, we will consider using existing datasets but we will also build new ones, particularly focusing on annotating explicit correlations between visual and auditory characteristics. This will enable us to train models that provide a multimodal embedding space for both visual and auditory information of artworks.