summary Summary

AI researchers from Sony show GANstrument, a neural synthesizer that transforms arbitrary input sounds into instrument sounds.

Generative AI systems such as DALL-E 2, Midjourney, or Stable Diffusion are currently shaking up the visual arts. The text-to-image systems allow impressive results even with simple text inputs.

Comparably powerful systems do not yet exist in music. But here, too, recent projects such as the generative text-to-music model of the US start-up Mubert show where the journey could lead.

Apart from end-to-end music synthesis, there is a second focus in the research field: the synthesis of individual notes that are then played back in a symbolic format such as MIDI (Musical Instrument Digital Interface). This allows independent control of MIDI and timbre, and the process is therefore compatible with production workflows in the music industry.


In a new paper, AI researchers at Sony are now demonstrating GANstrument, a neural synthesizer for instrument sounds.

GANstrument: Sony shows GAN-based neural synthesizer

Currently, realistic instrument sounds are synthesized with samplers that use recorded sounds. Although any sound material can be used, it is difficult to synthesize a completely new timbre or combine multiple sounds in an intelligent way, Sony said. Generative AI models for audio synthesis however have shown that AI can create and mix a variety of timbres.

The research team, therefore, aims to develop a neural synthesizer that combines the flexibility of classic samplers with the generative power of neural networks. With such a tool, users would be able to freely control the timbre based on existing sound material.

For its neural synthesizer, Sony uses a GAN (Generative Adversarial Network), which is trained with waveforms transformed into Mel spectrograms. The team relies on so-called instance conditioning instead of class conditioning, which is usually used in GAN training.

Class conditioning sorts the data into different distributions with no overlap, whereas instance conditioning sorts the data into many overlapping local distributions.


GANstrument can turn a rooster into a cello piece

Along with other improvements, such as a feature extractor that is invariant to pitch, GANstrument thus achieves better and more diverse synthesized sounds, as well as a generalization to different sound inputs, the team writes. After the training, GANstrument can transform e.g. flute sounds into brass sounds or organ sounds into guitar sounds.



Interpolation (Input 1 to 2)

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

The AI system can also smoothly mix different instruments and thus merge two input instruments into one track, for example.

Melody (Mallet to Reed)

Input 1

Input 2

Interpolation (Input 1 to 2)

The system also works with input sounds it has never heard before. It can transform them into known instrument sounds or change the pitch of the input. GANstrument can therefore also convert the crow of a rooster or a cat's meow into sounds of different pitches.

Rooster Chicken

Pitch 48

Pitch 55

Pitch 60

According to Sony, GANstrument generates a sound in 1.62 seconds on an Intel Core i7-7800X CPU.

Our novel neural synthesizer, GANStrument, generates pitched instrument sounds reflecting one-shot input timbre within an interactive time. It incorporates two key features: 1) instance conditioning, resulting in better generation quality and generalization ability to various inputs and 2) pitchinvariant feature extraction based on adversarial training, resulting in significantly improved pitch accuracy and timbre consistency.


The authors believe that GANstrument can produce novel instrument sounds and make desired timbres freely explorable by using a variety of sound materials. Further examples can be found on the GANstrument project page.

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:
Bank transfer
  • AI researchers from Sony demonstrate GANstrument, a neural synthesizer for instrument sound synthesis.
  • GANstrument can generate pitches from a single sound in under two seconds that reflect the timbre of the input, as well as seamlessly interpolate multiple sounds.
  • Unlike the end-to-end music synthesis of generative AI systems, GANstrument allows independent control of MIDIs and timbres, which is compatible with typical music industry production workflows.
Max is managing editor at THE DECODER. As a trained philosopher, he deals with consciousness, AI, and the question of whether machines can really think or just pretend to.
Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.