Large-scale text-to-speech (TTS) models are typically categorized into autoregressive and non-autoregressive systems. Although autoregressive systems exhibit certain advantages in speech naturalness, their token-by-token generation mechanism makes it difficult to precisely control the duration of the synthesized speech. This becomes a significant limitation in applications such as video dubbing, where strict audio-visual synchronization is required. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive-model-friendly method for speech duration control. The method supports two generation modes: one allows explicit specification of the number of generated tokens, thereby enabling precise control over speech duration; the other does not require manual token count input, letting the model freely generate speech in an autoregressive manner while faithfully reproducing prosodic characteristics from the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control of timbre and emotion. In the zero-shot setting, the model is capable of perfectly reproducing the emotional characteristics inherent in the input prompt. Additionally, users may provide a separate emotion prompt (which can originate from a different speaker than the timbre prompt), thereby enabling the model to accurately reconstruct the target timbre while conveying the specified emotional tone. In order to enhance the clarity of speech during strong emotional expressions, we incorporate GPT latent representations to improve the stability of the generated speech. Meanwhile, to lower the barrier for emotion control, we design a soft instruction mechanism based on textual descriptions by fine-tuning Qwen3. This facilitates the effective guidance of speech generation with the desired emotional tendencies through natural language input. Finally, experimental results on multiple datasets demonstrate that IndexTTS2 outperforms existing state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. To promote further research and facilitate practical adoption, we will release both the model weights and inference code, enabling the community to reproduce and build upon our work.
Audio-Prompt | Text | GroundTruth | Model | Audio(duration0.75x) | Audio(duration1.0x) | Audio(duriation1.25x) |
---|---|---|---|---|---|---|
The equipment needed to do this includes rock saws and polishers. | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
There is no wine in this country, the young man said. | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
只有当科技为本地社群创造价值的时候,才真正有意义。 | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS | ||||||
类推可用于颠覆惯性思维,以便为新的创意开路。 | IndexTTS2 | |||||
MaskGCT | ||||||
F5-TTS |
Emotion | Audio-Prompt | Text | IndexTTS2 | IndexTTS | MaskGCT | CosyVoice2 | SparkTTS | F5-TTS | IndexTTS2-wog | IndexTTS2-wos2m |
---|---|---|---|---|---|---|---|---|---|---|
Angry | 你在我们屋子里走路的时候,发现路程遥远,这是不足为怪的。 | |||||||||
似乎科琳完成的这身午夜蓝套,裙与旧时代的职业女性并无分别。 | ||||||||||
Cry | 共同建设面向未来的交通,和出行服务新生态 | 汤姆,我真愿意信你的话,这样可以一肥遮百丑。 | ||||||||
Fear | 但到投票前日,内菲斯竟以黑马之姿冲过席尔瓦,日渐下降的支持率。 | |||||||||
过了一会一切都结束了,这座山在月光下显得幽静而静谧。 | ||||||||||
Depressed | 基本上隔一天,小如便会因为不听话而挨揍。 | |||||||||
狗狗阿黄同志,当森林学校的门卫有五年啦,工作尽职尽责。 | ||||||||||
Happy | 更傻眼的是过了没多久,银行就开始催款了。 | |||||||||
其中一只正又两条前肢,抓住一只有自己身体五倍大的死蜘蛛。 | ||||||||||
Surprise | 他希望能看到灯笼闪一下光,这虽然让他害怕。 | |||||||||
比如有的业主,贪便宜找马路上的游击队来装修。 | ||||||||||
Calm | 攀爬上官场高位后,开始给家里的各种亲戚安排工作。 | |||||||||
近日,除了葛洲坝股价下跌外,其余三家均有不同程度的上涨。 |
Timbre-Audio-Prompt | Emotion-Audio-Prompt | Text | Emotion Weight: 0 | Emotion Weight: 0.6 | Emotion Weight: 1.0 | Emotion Weight: 1.4 |
---|---|---|---|---|---|---|
这一天,天上的乌云又多又厚又沉,整个森林暗得就像黑夜一样。 | ||||||
这他妈就是你给的解决方案?老子连续加班三个月,就换来一沓废纸!现在、立刻、马上给我滚出! | ||||||
我站在人海中,却感觉比任何时候都要孤独。 | ||||||
尾号四四九幺的乘客刚夸了你,厉害了我的师傅,你真是个活地图。 | ||||||
有些人走了就再也没有回来过,所以等待和犹豫是这个世界上最无情的杀手。 | ||||||
做一个温暖的人,将岁月里的凝重、安暖,写意成简单,将过往的风景,安放在清浅的时光中。 |
Timbre-Audio-Prompt | Emotion-Description | Text | Audio |
---|---|---|---|
I feel really down | 这究竟是我的福,还是我的孽?岂止是皇上错了,我更是错了!这几年的情爱与时光,究竟是错付了! | ||
有点快乐,哈哈 | |||
巨巨巨巨巨巨巨巨难过 | I feel really down | Was this my blessing, or my curse? It’s not just the Emperor who was wrong — I was even more mistaken! All these years of love and devotion… in the end, were they nothing but a wasted heart? | |
有点快乐,哈哈 | |||
巨巨巨巨巨巨巨巨难过 | |||
书桓走的第一天,想他,想他,想他。 | 书桓走的第一天,想他,想他,想他。 | ||
On the first day that Shuhan left, all I did was miss him. Miss him. Miss him. | |||
书桓走的第二天,想他,想他,想他。 | 书桓走的第二天,想他,想他,想他。 | ||
The second day Shuhan is gone, and still — I miss him. I miss him. I miss him. | |||
书桓走了第三天了,想他,想他,想他,发疯一样的想他。 | 书桓走了第三天了,想他,想他,想他,发疯一样的想他。 | ||
The third day Shuhan has been gone… and I still miss him. Miss him. Miss him. I miss him like I've lost my mind. | |||
超级无敌爆炸angry的情感,就像刚中了彩票被人偷拿了 | 你问他为什么我没谈恋爱,我就失恋了,你问他,为什么这么对我,我以为我会问,可是我见到他之后,我就不想问了,因为人家根本就不想说,人家甚至都不想见到你,我为什么在那儿犯贱呢,所以我不是放过他,我是想放过我自己,人家不联系你怎么了,不回你微信怎么了,伤害你又怎么了,你算他谁啊? | ||
又生气又委屈 | 我为什么非得知道发生了什么呢,我不就是想给自己找一个原因嘛,我只是想找一个原因,我原谅他,可是我为什么非得原谅他呢,我干嘛把自己搞得这么卑微啊? | ||
我们正在做一些神奇的事情,给我来一种又fear,但是又有点开心的情感。 | 这游戏太刺激了,心跳都快停了...但我们又能感受到那种挑战未知的兴奋和快乐。说真的,我现在是又害怕又期待,紧张得手心都在冒汗! |
Audio-Prompt | Text | GroundTruth | IndexTTS2 | IndexTTS | MaskGCT | CosyVoice2 | SparkTTS | F5-TTS | IndexTTS2-wog | IndexTTS2-wos2m |
---|---|---|---|---|---|---|---|---|---|---|
家居养娃的李娜又重新出现在媒体大众的面前 | ||||||||||
These are two of only three known formations to have dinosaur fossils in Antarctica. | ||||||||||
The man looked at him without responding. | ||||||||||
胡萝卜凉拌或炒鸡蛋味道都是棒极的,胡萝卜骄傲地说。 | ||||||||||
rodolfo arrived at his own house without any impediment and leocadia's parents reached theirs heart broken and despairing | ||||||||||
那些袖珍衣服挂在架子上,远远看上去就像一幅画,可漂亮了。 |