Make That Sound More Metallic:
Towards a Perceptually Relevant Control of the Timbre of Synthesizer Sounds Using Variational Autoencoder

Fanny ROCHE *, Thomas HUEBER ¤, Maëva GARNIER ¤, Samuel LIMIER * and Laurent GIRIN ¤

This web page presents additional material to the article Make that sound more metallic: Towards a perceptually relevant control of the timbre of synthesizer sounds using variational autoencoder submitted to the Transactions of the International Society for Music Information Retrieval (TISMIR).


  In this article, we propose a new method of sound transformation based on control parameters that are intuitive and relevant for musicians. This method uses a variational autoencoder (VAE) model that is first trained in an unsupervised manner on a large dataset of synthesizer sounds. Then, a perceptual regularization term is added to the loss function to be optimized, and a supervised fine-tuning of the model is carried out using a small subset of perceptually labeled sounds. The labels were obtained from a perceptual test of Verbal Attribute Magnitude Estimation in which listeners rated this training sound dataset along eight perceptual dimensions (French equivalents of metallic, warm, breathy, vibrant, percussive, resonant, evolving, aggressive). These dimensions were identified as relevant for the description of synthesizer sounds in a first Free Verbalization test. The resulting VAE model was evaluated by objective reconstruction measures and a perceptual test. Both showed that the model was able, to a certain extent, to capture the acoustic properties of most of the perceptual dimensions and to transform sound timbre along at least two of them (aggressive and vibrating) in a perceptually relevant manner. Moreover, it was able to generalize to unseen samples even though a small set of labeled sounds was used.

Keywords: Synthesizer sounds, timbre perception and verbal description, variational autoencoders, machine learning, audio synthesis.


In the article, some examples of perceptual score vectors for some samples representative of one particular dimensions are given. Here are presented the corresponding audio material together with some other examples of perceptual score vectors.

Métallique Agressif Chaud Qui vibre Soufflé

Perceptual score vectors
Métallique dim. 0.733 0.672 -0.258 0.178 0.009
Chaud dim. -0.849 -0.901 0.805 -0.090 0.194
Soufflé dim. -0.870 -0.559 -0.543 -0.482 0.644
Qui vibre dim. -0.044 0.874 0.130 0.871 0.477
Percussif dim. 0.753 0.065 -0.647 -0.048 -0.210
Qui résonne dim. 0.225 -0.504 -0.869 0.426 0.470
Qui évolue dim. 0.229 0.309 -0.570 -0.314 -0.011
Agressif dim. 0.097 0.782 -0.667 -0.325 -0.548


In the article, we compared the different models using two objective measures (RMSE and PEMO-Q). Here are presented some complementary audio examples of reconstructions. These samples are reconstructed in the context of an analysis-resynthesis process, i.e. the sequence of spectrum frames is passed through the encoder and the decoder without previous modification. Then the audio samples are synthesized with inverse STFT using the transmitted original phase spectrogram.

For this experiment, we modified two different setting parameters of the peceptually-regularized models: the value of the α coefficient the number of iterations of the 2-step learning procedure
The examples presented below are reconstructed using the following models with an encoding dimension of 32:

Variation of α coefficient while keeping only 1 iteration of the 2-step learning procedure: classic VAE [513, 128, 32, 128, 513] architecture, (tanh, lin) activation functions and β = 1.10-6 perceptually-regularized (with same architecture and β) and α = 0.01 perceptually-regularized (with same architecture and β) and α = 0.1 perceptually-regularized (with same architecture and β) and α = 1

Example 1 Example 2 Example 3 Example 4 Example 5

Variation of α
Classic VAE (α = 0)
α = 0.01
α = 0.1
α = 1


As explained in the article, here are some examples of sound transformations along the 5 selected (and not related to temporal characteristics) perceptual dimensions (Agressif, Chaud, Métallique, Qui vibre and Soufflé).
To perform the transformation on the samples, we first encoded the spectrogram of the target sound using our perceptually-regularized VAE, then applied an offset to the latent trajectory corresponding the the desired dimension, decoded the new latent trajectory and finally reconstructed the time signal by applying Griffin and Lim algorithm followed by an inverse short-term Fourier transform with overlap-add.

Sweep on the different perceptual dimensions

In this section will be presented examples of samples obtained for the 5 dimensions using different offset values when transforming the target sounds: Low offset value (first offset value for which there is a perceptual difference between the encoded-decoded sample and the modified one as explained in the article) 50% of the threshold value (the threshold being the offset value for which the output is constant) 75% of the threshold value
For each dimension, we selected one representative sample, one unrepresentative sample and one sample from the test set.

All the results presented here are obtained with a perceptually-regularized VAE with: a [513, 128, 64, 128, 513] architecture, (tanh, lin) pair of activation functions, β = 1.10-6 and α=0.1.

AGRESSIF Original sample Low offset 50% thresold 75% threshold
Representative sample
Unrepresentative sample
Test sample

CHAUD Original sample Low offset 50% thresold 75% threshold
Representative sample
Unrepresentative sample
Test sample

METALLIQUE Original sample Low offset 50% thresold 75% threshold
Representative sample
Unrepresentative sample
Test sample

QUI VIBRE Original sample Low offset 50% thresold 75% threshold
Representative sample
Unrepresentative sample
Test sample

SOUFFLE Original sample Low offset 50% thresold 75% threshold
Representative sample
Unrepresentative sample
Test sample