Natural Human Voice Synthesis Orpheus-TTS

Project Overview

Finally, there is a TTS (Text-to-Speech) system that focuses on the timbre, naturalness, and human-like qualities of Chinese speech. Orpheus TTS is a state-of-the-art (SOTA) open-source text-to-speech system based on the Llama-3b backbone network. Orpheus demonstrates emergent capabilities in speech synthesis using large language models (LLMs).

Key Features

  • • Human-like Voice: Natural intonation, emotion, and rhythm, surpassing SOTA closed-source models.
  • • Zero-shot Voice Cloning: Clone voices without pre-finetuning.
  • • Guided Emotion and Intonation: Control voice and emotional characteristics with simple tags.
  • • Low Latency: Stream latency of approximately 200ms for real-time applications, reducible to ~100ms with input streaming.

Models

We provide two English models, along with data processing scripts and sample datasets to help you easily create your own fine-tuned versions:

  • • Fine-tuned Model: Suitable for everyday TTS applications.
  • • Pre-trained Model: Our base model trained on over 100,000 hours of English speech data.
    Additionally, we offer a research version of multilingual models:
  • • Multilingual Series: Seven pairs of pre-trained and fine-tuned models.

Inference

A simple setup on Colab is provided with standardized prompts across languages. These notebooks demonstrate how to use our models in English.

  • • Colab for Fine-tuned Model (non-streaming; see below for real-time streaming): Suitable for everyday TTS applications.
  • • Colab for Pre-trained Model: Set up for conditional generation but can be extended to various tasks.

Streaming Inference Example

Clone this repository:

git clone https://github.com/canopyai/Orpheus-TTS.git  

Navigate and install packages:

cd Orpheus-TTS && pip install orpheus-speech  # Uses vllm under the hood for fast inference  

A slightly problematic version of vllm was pushed on March 18, so some issues are being resolved by rolling back to:

pip install vllm==0.7.3  
pip install orpheus-speech  

Run the example below:

from orpheus_tts import OrpheusModel  
import wave  
import time  

model = OrpheusModel(model_name="canopylabs/orpheus-tts-0.1-finetune-prod")  
prompt = '''Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we're all connected 24/7 but somehow people feel more alone than ever. And don't even get me started on how it's messing with kids' self-esteem and mental health and whatnot.'''  

start_time = time.monotonic()  
syn_tokens = model.generate_speech(  
    prompt=prompt,  
    voice="tara",  
)  

with wave.open("output.wav", "wb") as wf:  
    wf.setnchannels(1)  
    wf.setsampwidth(2)  
    wf.setframerate(24000)  

    total_frames = 0  
    chunk_counter = 0  
    for audio_chunk in syn_tokens:  # Output streaming  
        chunk_counter += 1  
        frame_count = len(audio_chunk) // (wf.getsampwidth() * wf.getnchannels())  
        total_frames += frame_count  
        wf.writeframes(audio_chunk)  
    duration = total_frames / wf.getframerate()  

    end_time = time.monotonic()  
    print(f"It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio")  

Pre-trained Model

This is a very straightforward process, similar to training an LLM using Trainer and Transformers. The provided base model was trained on 100,000+ hours of data. We recommend avoiding synthetic data for training, as it yields poorer results when fine-tuning for specific voices. This may be due to the lack of diversity in synthetic voices and their mapping to the same token set during tokenization (i.e., resulting in low codebook utilization).

We train the 3B model on sequences of length 8192. The same dataset format is used for TTS fine-tuning pre-training. We concatenate _input_ids sequences to improve training efficiency. The required text dataset format is described in issue #37. If you plan to scale training this model (e.g., for other languages or styles), we recommend starting only from fine-tuning (without a text dataset). The main idea behind the text dataset is discussed in the blog post. (Summary: Do not forget too much semantic/reasoning capability so it can better understand how to pronounce/express phrases; however, most forgetting occurs early in training, i.e., <100,000 lines), unless you perform very extensive fine-tuning, which may not have a significant impact.

Project Link

https://github.com/canopyai/Orpheus-TTS

Reproduction without permission is prohibited:AI LAB » Natural Human Voice Synthesis Orpheus-TTS