Solos: A Dataset for Audio-Visual Music Analysis - Abstract and Intro

8 Jun 2024


(1) Juan F. Montesinos, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {};

(2) Olga Slizovskaia, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {};

(3) Gloria Haro, Department of Information and Communications Technologies Universitat Pompeu Fabra, Barcelona, Spain {}.


In this paper, we present a new dataset of music performance videos which can be used for training machine learning methods for multiple tasks such as audio-visual blind source separation and localization, cross-modal correspondences, cross-modal generation and, in general, any audio-visual selfsupervised task. These videos, gathered from YouTube, consist of solo musical performances of 13 different instruments. Compared to previously proposed audio-visual datasets, Solos is cleaner since a big amount of its recordings are auditions and manually checked recordings, ensuring there is no background noise nor effects added in the video post-processing. Besides, it is, up to the best of our knowledge, the only dataset that contains the whole set of instruments present in the URMP [1] dataset, a high-quality dataset of 44 audio-visual recordings of multiinstrument classical music pieces with individual audio tracks. URMP was intented to be used for source separation, thus, we evaluate the performance on the URMP dataset of two different source-separation models trained on Solos. The dataset is publicly available at

Index Terms—audio-visual, dataset, multimodal, music


There is a growing interest in multimodal techniques for solving Music Information Retrieval (MIR) problems. Music performances have a highly multimodal content and the different modalities involved are highly correlated: sounds are emitted by the motion of the player performing and in chamber music performances the scores constitute an additional encoding that may be as well leveraged for the automatic analysis of music [2].

On the other side, by visually inspecting the scene we may extract information about the number of sound sources, their type, spatio-temporal location and also motion, which naturally relates to the emitted sound. Besides, it is possible to carry out self-supervised tasks in which one modality supervises the other one. This entails another research field, the cross-modal correspondence (CMC). We can find pioneering works for both problems BSS and CMC. [11], [12] make use of audio-visual data for sound localization and [13], [14], [15] for speech separation. In the context of music, visual information has also proven to help model-based methods both in source separation [16], [17] and localization [2]. With the flourishing of deep learning techniques many recent works exploit both, audio and video content, to perform music source separation [18]–[20], source association [21], localization [22] or both [23]. Some CMC works explore features generated from synchronization [24], [25] and prove these features are reusable for source separation. These works use networks that have been trained in a self-supervised way using pairs of corresponding/non-corresponding audio-visual signals for localization purposes [22] or the mix-and-separate approach for source separation [18]–[20], [23]. Despite deep learning made possible to solve classical problems in a different way, it also contributed to create new research fields like crossmodal generation, in which the main aim is to generate video from audio [26], [27] or viceversa [28]. More recent works related to human motion make use of skeleton as an inner representation of the body which can be further converted into video [29], [30] which shows the potential of skeletons. The main contribution of this paper is Solos, a new dataset of musical performance recordings of soloists that can be used to train deep neural networks for any of the aforementioned fields. Compared to a similar dataset of musical instruments presented in [23] and its extended version [31], our dataset does contain the same type of chamber orchestra instruments present in the URMP dataset. Solos is a dataset of 755 realworld recordings gathered from YouTube which provides several features missing in the aforementioned datasets: skeletons and high quality timestamps. Source localization is usually indirectly learned by networks. Thus, providing a practical localization ground-truth is not straightforward. Nevertheless, networks often point to the player hands as if they were the sound source. We expect hands localization can help to provide additional cues to improve audio-visual BSS or can be used as source ground-truth localization. In order to show the benefits of using Solos we trained some popular BSS architectures and compare their results.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.