Researchers at Google have developed an AI that can compose music based on text inputs, much like how ChatGPT can transform a text command into a story and DALL-E can compose images based on written prompts.
The AI model can quickly transform a user’s written words into music lasting several minutes, or convert hummed melodies into other instruments. This comes just a month over the time the Tech giant issued a “code red” alert over the rise of ChatGPT.
The company has published its findings on Github, naming the AI model MusicLM, and providing a number of samples created with the aid of the model. The samples, dubbed MusicCaps, are essentially a dataset made up of 5.5k music-text pairs with rich text description provided by human experts.
I was very impressed by the sample musics. Sounding songs can be generated in as little as 30 seconds from as little as a single word, such as “melodic techno,” and as long as five minutes from as little as a paragraph outlining the genre, mood, and instruments to be used. Perhaps the most interesting demonstration is the “story mode,” in which the model is essentially given a script to morph between prompts.
Take, for instance, the following prompt:
The above sample was described as having “induces the experience of being lost in space,” and I must say that I agree with that assessment.
This next example was taken from a description that began with the sentence “The main soundtrack of an arcade game.” It makes sense, doesn’t it?
The Google team demonstrates that their system can improve upon preexisting melodies, whether they were previously hummed, sung, whistled, or played on an instrument. In addition, MusicLM is capable of transforming a series of written descriptions (such as “time to meditate,” “time to wake up,” “time to run,” and “time to give 100%”) into a musical “story” or narrative of up to several minutes in length, ideal for use as a film score.
I can totally picture a human being sitting down and writing this, though I understand if others don’t share my opinion (I also listened to it on loop dozens of times while writing this article).
The demo site also features examples of the model’s output when asked to generate 10-second clips of instruments like the cello or maracas (the latter example is one where the system does a relatively poor job), eight-second clips of a certain genre, music that would fit a prison escape, and even what a beginner piano player would sound like compared to an advanced player. Words and phrases like “futuristic club” and “accordion death metal” are explained as well.
MusicLM can even mimic human singing, and while it sounds fairly accurate in terms of pitch and volume, there’s still something off about it. I would say that they have a grainy or staticky quality to them.
The previous examples doesn’t make that feature quite so apparent, but I think following one does.
In case you were wondering, that’s what happens when you tell it to make music that would be played in a gym.
While I can’t claim to have any insight into Google’s methodology, the company has released a research paper seen by Innovation Village. It goes into great detail if you’re the type to appreciate such a chart as the following:
It’s no secret that AI has been used to create music for decades; some systems have even been given credit for composing pop songs, copying Bach better than a human could in the 1990s, and providing accompaniment for live performances.
The most recent version uses the artificial intelligence image generation engine StableDiffusion to convert text prompts into spectrograms, which are then transformed into music.
Based on its ability to take in audio and mimic the melody, the paper claims that MusicLM can outperform competing systems.
The final bit is arguably the most impressive demonstrations the researchers have made. Here, you can listen to the input audio, in which a person hums or whistles a melody, and then hear how the model transforms that melody into different musical styles, such as an electronic synth lead, a string quartet, a guitar solo, and so on. Listening to its examples in action, I can attest that it performs admirably.