{"id":156008,"date":"2023-01-22T23:25:25","date_gmt":"2023-01-23T05:25:25","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2023\/01\/vall-e"},"modified":"2023-01-22T23:25:25","modified_gmt":"2023-01-23T05:25:25","slug":"vall-e","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2023\/01\/vall-e","title":{"rendered":"VALL-E"},"content":{"rendered":"<p><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/vall-e3.jpg\"><\/a><\/p>\n<p>Chengyi wang*, sanyuan chen*, yu wu*, ziqiang zhang, long zhou, shujie liu, zhuo chen, yanqing liu, huaming wang, jinyu li, lei he, sheng zhao, furu wei.<\/p>\n<p><b>Microsoft<\/b><\/p>\n<p><b>Abstract.<\/b> We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker\u2019s emotion and acoustic environment of the acoustic prompt in synthesis.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Chengyi wang*, sanyuan chen*, yu wu*, ziqiang zhang, long zhou, shujie liu, zhuo chen, yanqing liu, huaming wang, jinyu li, lei he, sheng zhao, furu wei. Microsoft Abstract. We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from [\u2026]<\/p>\n","protected":false},"author":661,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[47],"tags":[],"class_list":["post-156008","post","type-post","status-publish","format-standard","hentry","category-neuroscience"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/156008","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/661"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=156008"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/156008\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=156008"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=156008"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=156008"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}