New model can generate audio and music tracks from diverse data inputs

In recent years, computer scientists have created various highly performing machine learning tools to generate texts, images, videos, songs and other content. Most of these computational models are designed to create content based on text-based instructions provided by users.

Researchers at the Hong Kong University of Science and Technology recently introduced AudioX, a model that can generate high quality audio and music tracks using texts, video footage, images, music and audio recordings as inputs. Their model, introduced in a paper published on the arXiv preprint server, relies on a diffusion transformer, an advanced machine learning algorithm that leverages the so-called transformer architecture to generate content by progressively de-noising the input data it receives.

“Our research stems from a fundamental question in artificial intelligence: how can intelligent systems achieve unified cross-modal understanding and generation?” Wei Xue, the corresponding author of the paper, told Tech Xplore. “Human creation is a seamlessly integrated process, where information from different sensory channels is naturally fused by the brain. Traditional systems have often relied on specialized models, failing to capture and fuse these intrinsic connections between modalities.”

Blog

New model can generate audio and music tracks from diverse data inputs

Leave a CommentCancel reply