Toggle light / dark theme

In recent years, computer scientists have created various highly performing machine learning tools to generate texts, images, videos, songs and other content. Most of these computational models are designed to create content based on text-based instructions provided by users.

Researchers at the Hong Kong University of Science and Technology recently introduced AudioX, a model that can generate high quality audio and music tracks using texts, video footage, images, music and audio recordings as inputs. Their model, introduced in a paper published on the arXiv preprint server, relies on a diffusion transformer, an advanced machine learning algorithm that leverages the so-called transformer architecture to generate content by progressively de-noising the input data it receives.

“Our research stems from a fundamental question in artificial intelligence: how can intelligent systems achieve unified cross-modal understanding and generation?” Wei Xue, the corresponding author of the paper, told Tech Xplore. “Human creation is a seamlessly integrated process, where information from different sensory channels is naturally fused by the brain. Traditional systems have often relied on specialized models, failing to capture and fuse these intrinsic connections between modalities.”

Leave a Comment

If you are already a member, you can use this form to update your payment info.

Lifeboat Foundation respects your privacy! Your email address will not be published.