AI research

Sony shows neural synthesizer GANstrument

Maximilian Schreiner
A rooster plays a cello while being watched by chicks, digital art

DALL-E 2 prompted by THE DECODER

AI researchers from Sony show GANstrument, a neural synthesizer that transforms arbitrary input sounds into instrument sounds.

Generative AI systems such as DALL-E 2, Midjourney, or Stable Diffusion are currently shaking up the visual arts. The text-to-image systems allow impressive results even with simple text inputs.

Comparably powerful systems do not yet exist in music. But here, too, recent projects such as the generative text-to-music model of the US start-up Mubert show where the journey could lead.

Apart from end-to-end music synthesis, there is a second focus in the research field: the synthesis of individual notes that are then played back in a symbolic format such as MIDI (Musical Instrument Digital Interface). This allows independent control of MIDI and timbre, and the process is therefore compatible with production workflows in the music industry.

In a new paper, AI researchers at Sony are now demonstrating GANstrument, a neural synthesizer for instrument sounds.

GANstrument: Sony shows GAN-based neural synthesizer

Currently, realistic instrument sounds are synthesized with samplers that use recorded sounds. Although any sound material can be used, it is difficult to synthesize a completely new timbre or combine multiple sounds in an intelligent way, Sony said. Generative AI models for audio synthesis however have shown that AI can create and mix a variety of timbres.

The research team, therefore, aims to develop a neural synthesizer that combines the flexibility of classic samplers with the generative power of neural networks. With such a tool, users would be able to freely control the timbre based on existing sound material.

For its neural synthesizer, Sony uses a GAN (Generative Adversarial Network), which is trained with waveforms transformed into Mel spectrograms. The team relies on so-called instance conditioning instead of class conditioning, which is usually used in GAN training.

Class conditioning sorts the data into different distributions with no overlap, whereas instance conditioning sorts the data into many overlapping local distributions.

GANstrument can turn a rooster into a cello piece

Along with other improvements, such as a feature extractor that is invariant to pitch, GANstrument thus achieves better and more diverse synthesized sounds, as well as a generalization to different sound inputs, the team writes. After the training, GANstrument can transform e.g. flute sounds into brass sounds or organ sounds into guitar sounds.

Flute

https://the-decoder.de/wp-content/uploads/2022/11/query1_audio-1.wav?_=1

Brass

https://the-decoder.de/wp-content/uploads/2022/11/query2_audio-1.wav?_=2

Interpolation (Input 1 to 2)

https://the-decoder.de/wp-content/uploads/2022/11/FluteBrass.mp3?_=3

The AI system can also smoothly mix different instruments and thus merge two input instruments into one track, for example.

Melody (Mallet to Reed)

Input 1

https://the-decoder.de/wp-content/uploads/2022/11/query1_audio.wav?_=4

Input 2

https://the-decoder.de/wp-content/uploads/2022/11/query2_audio.wav?_=5

Interpolation (Input 1 to 2)

https://the-decoder.de/wp-content/uploads/2022/11/MalletReed.wav?_=6

The system also works with input sounds it has never heard before. It can transform them into known instrument sounds or change the pitch of the input. GANstrument can therefore also convert the crow of a rooster or a cat's meow into sounds of different pitches.

Rooster Chicken

https://the-decoder.de/wp-content/uploads/2022/11/query_audio.wav?_=7

Pitch 48

https://the-decoder.de/wp-content/uploads/2022/11/fake_audio_pitch_27.wav?_=8

Pitch 55

https://the-decoder.de/wp-content/uploads/2022/11/fake_audio_pitch_34.wav?_=9

Pitch 60

https://the-decoder.de/wp-content/uploads/2022/11/fake_audio_pitch_39.wav?_=10

According to Sony, GANstrument generates a sound in 1.62 seconds on an Intel Core i7-7800X CPU.

Our novel neural synthesizer, GANStrument, generates pitched instrument sounds reflecting one-shot input timbre within an interactive time. It incorporates two key features: 1) instance conditioning, resulting in better generation quality and generalization ability to various inputs and 2) pitchinvariant feature extraction based on adversarial training, resulting in significantly improved pitch accuracy and timbre consistency.

Sony

The authors believe that GANstrument can produce novel instrument sounds and make desired timbres freely explorable by using a variety of sound materials. Further examples can be found on the GANstrument project page.

Sources: