MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies

In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.

In this page, we present the generation demos of MusicLDM, and the comparison to Riffusion, MuBERT and AudioLDM's generations. Our MusicLDM was trained on Audiostock dataset, which contains 10,000 text-music pairs with the total duration of 455.35 hours. We show how the generation is on such magnitude of the dataset. All generations are in 16kHz sampling rate and 10-sec length.

Please wait a few seconds to let the webpage load all generation samples.

Quick Demo Video for Check-it-Out


General Music & Electronic Dance Music (EDM)


Text Description
MusicLDM (ours)
Riffusion
MuBERT
AudioLDM
Western music, chill out, folk instrument R & B beat
Cute toy factory theme loop
Fashionable Pop Cute Electr
The royal road! kawaii future bass
Blessed harpsichord
Light rhythm techno
Futuristic drum and bass
Fashionable lo-fi hip hop
A tropical country resort style BGM characterized by ukulele
Feeling pounding and mysterious synth sound
Bingo music box
Fun, Lively, Pets, Information, Exciting
Dramatic EDM

String & Orchestra


Text Description
MusicLDM (ours)
Riffusion
MuBERT
AudioLDM
Royal Film Music Orchestra
Castle town music, a gentle orchestra
Elegant and gentle tunes of string quartet + harp
Orchestra foretelling the beginning of an adventure
Dreamy fantasy orchestra clip

Piano


Text Description
MusicLDM (ours)
Riffusion
MuBERT
AudioLDM
Pop OP/ED with acoustic guitar and piano
Japanese healing music of piano and shakuhachi
A fantastic piece of music with the deep sound of overlapping pianos
Quiet piano
Piano ballad remembering fresh youth

Guitar


Text Description
MusicLDM (ours)
Riffusion
MuBERT
AudioLDM
Bright and energetic guitars, synths, drums, etc
Refreshing and relaxed guitar sound
Gentle live acoustic guitar
Performance of guitar and keyboard harmonica, bright tune
Relaxing guitar pop BG
Acoustic guitar hiphop
Exciting guitar rock short jingle
Jazz-funk-taste guitar instrument

Saxophone


Text Description
MusicLDM (ours)
Riffusion
MuBERT
AudioLDM
Lyrical ballad sung by saxophone

MusicLDM Generation from Rilke's Poem


We made a creative process among MusicLDM, ChatGPT, and Rilk's Poem "End of Autumn": we first discuss this poem with ChatGPT and let it generate desscriptions of each poem line. Then, we send these descriptions to MusicLDM and generate differnet pieces of music samples. Finally, we combine these samples together to get the generation of MusicLDM, from Rilk's Poem "End of Autumn". Such creative process gives a potential application of how text-to-music generation systems can establish a new paradigm of machine composition, by leverging the advanced techniques of diffusion models and large language models, to benefit the computer music area.

Similarity on Training Set


In order to verify if an audio track in the training set is copied by the model. We use the CLAP score to measure similarity between each audio embedding in the training set, to its most similar audio embedding from the generation pieces. Therefore, we can verify if the model could leverage Beat-Synchronous Latent Mixup (BLM) to prevent copying the audio tracks in the training set when performing the generation.

Music Track in Training Set
Most Similar Generation from MusicLDM-Original
Most Similar Generation from MusicLDM-BLM
Similarity: N/A
Similarity: 0.935
Similarity: 0.654
Music Track in Training Set
Most Similar Generation from MusicLDM-Original
Most Similar Generation from MusicLDM-BLM
Similarity: N/A
Similarity: 0.872
Similarity: 0.709
Music Track in Training Set
Most Similar Generation from MusicLDM-Original
Most Similar Generation from MusicLDM-BLM
Similarity: N/A
Similarity: 0.879
Similarity: 0.565

Similarity on Test Set


Similarily and Reversely, we use the CLAP score to measure similarity between each audio embedding from the generation pieces, to its most similar audio embedding in the training set. This is the verification, in terms of the recall, to detemine if BLM can prevent the model from copying the audio tracks in the training set.

MusicLDM-Original's Generation
Most Similar Track in the Training Set
MusicLDM-BLM's Generation (from the same prompt)
Most Similar Track in the Training Set