Executive Summary
OptimumLLM Audio 1.0 is a text-to-speech foundation model built on Meta's LLaMA open-source architecture. This document provides a comprehensive end-to-end guide for developing, training, and deploying a production-quality TTS system that produces natural, emotionally expressive, human-like speech.
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Base LLM | LLaMA 3.2 3B | Best balance of quality, trainability on Colab Pro, proven TTS capability |
| Audio Codec | SNAC (24kHz) | Multi-scale tokenization, superior quality at low bitrate, proven in Orpheus |
| Training Approach | Full fine-tuning | LoRA adapters cannot learn audio token distribution deeply enough |
| Languages | US + British English | Focused quality over breadth |
| Speakers | Multi-speaker + zero-shot | Built-in voices + reference audio cloning |
| Emotion | Data + controllable tags | Natural emotion from data + explicit tags like [happy], [sad] |
| Training Hardware | Colab Pro (A100 40GB) | Feasible with gradient checkpointing + DeepSpeed ZeRO-3 |
| Hosting | HuggingFace Hub | Open weights, community access, inference API |
What This Document Covers
The LLM-TTS Revolution
Why Language Models Are the Future
Evolution of TTS Technology
| Era | Period | Approach | Example | Quality |
|---|---|---|---|---|
| Concatenative | 1990s–2010 | Splicing recorded phonemes | Festival, MaryTTS | Robotic |
| Statistical Parametric | 2005–2016 | HMM-based acoustic models | HTS | Monotone |
| Neural Seq2Seq | 2017–2021 | Encoder-decoder + attention | Tacotron 2 | Good |
| Diffusion-Based | 2021–2023 | Score-based generative models | Grad-TTS | Very Good |
| LLM-Based | 2023–Now | Autoregressive LMs on audio tokens | VALL-E, Orpheus | Human-level |
Why Language Models Excel at TTS
The key insight: speech is just another language. When audio is discretized into tokens using neural codecs, an autoregressive language model can learn to “speak” just as it learns to “write”.
Provide reference audio as prompt context — the LLM continues in that voice via in-context learning.
LLMs capture long-range dependencies better than encoder-decoder models, producing natural rhythm and intonation.
Trained on diverse speech, LLMs learn emotional expression implicitly without explicit prosody labels.
The same scaling laws that improved text LLMs apply: more data + more params = better speech.
One model handles TTS, voice conversion, speech continuation, and dialogue generation.
The Core Pipeline
┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────┐
│ Text │───▶│ Text Tokenizer│───▶│ LLM Decoder │───▶│ Audio │
│ Input │ │ (BPE/SPM) │ │ (LLaMA-based) │ │ Tokens │
└──────────┘ └──────────────┘ └───────────────┘ └────┬─────┘
│
▼
┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────┐
│ Output │◀───│ Vocoder │◀───│ Audio Codec │◀───│ Decode │
│ .wav │ │ (optional) │ │ Decoder │ │ Tokens │
└──────────┘ └──────────────┘ └───────────────┘ └──────────┘Figure 1 — The LLM-based TTS pipeline: text tokens in, audio tokens out, codec decodes to waveform.
Landscape Analysis
8 LLaMA-Based TTS Models
We analyzed eight state-of-the-art TTS models to inform the design of OptimumLLM Audio 1.0. Each model represents a different approach to the LLM-TTS paradigm.
Comparative Architecture
Full comparison matrix
| Model | Params | Codec | CB | SR | Clone | AR | License |
|---|---|---|---|---|---|---|---|
| Orpheus TTS | 3B | SNAC | 3 | 24kHz | Zero-shot | Yes | Apache 2.0 |
| Sesame CSM | 1B | Mimi | 32 | 24kHz | Context-based | Yes + NAR | Apache 2.0 |
| OuteTTS 0.3 | 1B | DAC | 4 | 24kHz | One-shot | Yes | Apache 2.0 |
| Chatterbox | 350M–500M | S3Tokenizer | Custom | 24kHz | Zero-shot | Yes | MIT |
| Dia TTS | 1.6B | DAC | 9 | 44.1kHz | Speaker tags | No (NAR) | Apache 2.0 |
| Kokoro | 82M | None (mel) | N/A | 24kHz | Style vectors | No | Apache 2.0 |
| Parler TTS | 880M–2.3B | DAC | 4 | 24kHz | Text description | Yes | Apache 2.0 |
| XTTS v2 | ~1B | VQ-VAE | 1 | 24kHz | 6s reference | Yes | MPL 2.0 |
The most successful LLaMA-based TTS models share a common formula:
LLaMA Base (1B–3B) + SNAC/DAC Codec + Full Fine-Tuning + Emotion Data = SOTA QualityAudio Tokenization
The Bridge Between Text and Sound
Audio tokenization is the most critical component of LLM-based TTS. The quality of your codec directly determines the quality ceiling of your model.
SNAC — Multi-Scale Neural Audio Codec ✅ Recommended
Input Audio (24kHz waveform)
│
▼
┌───────────────────┐
│ SNAC Encoder │
│ │
│ ┌─────────────┐ │
│ │ Codebook 1 │ │ ← 12 Hz (coarse: prosody, pitch, speaker identity)
│ │ 4096 codes │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Codebook 2 │ │ ← 24 Hz (mid: phonetic detail, formants)
│ │ 4096 codes │ │
│ └─────────────┘ │
│ ┌─────────────┐ │
│ │ Codebook 3 │ │ ← 47 Hz (fine: texture, breathiness, micro-detail)
│ │ 4096 codes │ │
│ └─────────────┘ │
└───────────────────┘
│
▼
Output: 7 tokens per frame at 12 Hz = 84 tokens/secondFigure 2 — SNAC multi-scale encoder: 3 codebooks at different temporal resolutions.
Codec Comparison Matrix
| Codec | Codebooks | Tokens/sec | PESQ | Size | Best For | License |
|---|---|---|---|---|---|---|
| SNACRec. | 3 | 84 | 3.8+ | ~10M | LLM-TTS (proven) | MIT |
| Encodec | 2–32 | Variable | 3.5+ | ~15M | General audio | MIT |
| DAC | 4–9 | 48–108 | 3.7+ | ~74M | High-quality speech | MIT |
| Mimi | 32 | ~400 | 3.6+ | ~20M | Conversational | Apache 2.0 |
For a 30-second audio clip: 30s × 84 tokens/s = 2,520 audio tokens. Plus ~50–100 text tokens for transcript. Total: ~2,620 tokens — well within LLaMA's 8,192 context window.
The TTS Pipeline
Text → Tokens → Speech
┌─────────────────────────────────────────────────────────────────────────┐
│ OptimumLLM Audio 1.0 Pipeline │
│ │
│ INPUT │
│ ├── Text: "Hello, welcome to OptimumAI" │
│ ├── Speaker: "male_us_01" OR audio_prompt.wav │
│ └── Emotion: "friendly" (optional) │
│ │
│ STEP 1: TEXT TOKENIZATION │
│ ├── BPE tokenizer (LLaMA) → [15043, 29892, 12345, ...] │
│ └── Add special tokens: <|text_start|> ... <|text_end|> │
│ │
│ STEP 2: SPEAKER CONDITIONING │
│ ├── IF audio_prompt: Encode via SNAC → speaker_tokens │
│ ├── IF speaker_id: Lookup speaker embedding │
│ └── Add: <|speaker_start|> ... <|speaker_end|> │
│ │
│ STEP 3: EMOTION CONDITIONING (optional) │
│ └── Add: <|emotion:friendly|> │
│ │
│ STEP 4: LLM GENERATION │
│ ├── LLaMA 3.2 3B generates SNAC audio tokens autoregressively │
│ ├── Token pattern: [CB1, CB2a, CB2b, CB3a, CB3b, CB3c, CB3d] × N │
│ └── Stops at <|audio_end|> │
│ │
│ STEP 5: AUDIO DECODING │
│ ├── Separate tokens back into 3 SNAC codebooks │
│ ├── SNAC decoder reconstructs waveform │
│ └── Output: 24kHz WAV file │
└─────────────────────────────────────────────────────────────────────────┘Figure 3 — Complete OptimumLLM Audio 1.0 inference pipeline.
Multi-Codebook Interleaving Strategies
Used by: Orpheus
Each frame is self-contained: [CB1₀, CB2₀ₐ, CB2₀ᵦ, CB3₀ₐ, CB3₀ᵦ, CB3₀ᵧ, CB3₀ᵨ]
Used by: MusicGen
Fine codebooks see more coarse context with increasing delay.
Used by: Various
All of CB1, then all of CB2, then all of CB3.
Used by: Sesame CSM
Stage 1: LLM generates CB1; Stage 2: small decoder generates CB2–CB32.
Why LoRA Adapters Fail
For Base TTS Models
LoRA (Low-Rank Adaptation) works for text-domain fine-tuning but fundamentally fails for building base TTS models. This was validated through direct experimentation with OuteTTS + Unsloth.
Why This Fails for Audio
Creating a TTS base model isn't “adapting” text knowledge — it's learning an entirely new modality. LoRA's low-rank constraint (rank 16–64) cannot capture acoustic physics, timing, and speaker characteristics.
Audio tokens (4,096 SNAC codes per codebook) require new embeddings trained from scratch. LoRA doesn't apply to embedding layers meaningfully.
The LM head must learn when to switch between text and audio prediction. Cross-modal attention patterns cannot form in low-rank space.
LoRA tries to stay “close” to the original model. Audio token prediction is very far from text prediction — LoRA's constraint prevents the necessary large weight changes.
Approach Comparison
| Approach | Trainable Params | GPU Memory | Quality |
|---|---|---|---|
| LoRA r=16 | ~10M (0.3%) | 8GB | Poor |
| LoRA r=64 | ~40M (1.3%) | 12GB | Mediocre |
| Full FT (1B) | 1B (100%) | 24–40GB | Good |
| Full FT (3B) | 3B (100%) | 40–80GB | Excellent |
OptimumLLM Audio 1.0
Architecture Design
Architecture Diagram
┌─────────────────────────────────────────┐
│ INPUT FORMATTING │
│ <|text_start|> Hello, welcome <|text_end|>│
│ <|emotion:friendly|> │
│ <|speaker_start|> [SNAC ref] <|speaker_end|>│
│ <|audio_start|> │
└────────────────┬────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────────┐
│ LLaMA 3.2 3B Backbone │
│ ┌──────────────┐ │
│ │ Extended │ ~140K vocab (128K text + 12K audio) │
│ │ Embedding │ │
│ └──────┬───────┘ │
│ ┌──────▼───────┐ │
│ │ Transformer │ 36 layers, 3072 hidden dim │
│ │ Blocks ×36 │ GQA attention, RoPE, RMSNorm, SwiGLU │
│ └──────┬───────┘ │
│ ┌──────▼───────┐ │
│ │ Extended │ Logits over full vocab (text + audio) │
│ │ LM Head │ │
│ └──────┬───────┘ │
└─────────┼──────────────────────────────────────────────────────────────┘
│
▼ (autoregressively generates SNAC tokens)
┌──────────────────┐
│ De-interleave │ → CB1 @ 12Hz, CB2 @ 24Hz, CB3 @ 47Hz
└────────┬─────────┘
▼
┌──────────────────┐
│ SNAC Decoder │ Neural codec → 24kHz WAV
└──────────────────┘Figure 4 — OptimumLLM Audio 1.0 architecture: Extended LLaMA 3.2 3B with SNAC tokenization.
Vocabulary Extension
| Category | Count | Token Format | Purpose |
|---|---|---|---|
| SNAC CB1 codes | 4,096 | <|snac1_0|> to <|snac1_4095|> | Coarse audio (12Hz) |
| SNAC CB2 codes | 4,096 | <|snac2_0|> to <|snac2_4095|> | Mid audio (24Hz) |
| SNAC CB3 codes | 4,096 | <|snac3_0|> to <|snac3_4095|> | Fine audio (47Hz) |
| Special tokens | ~20 | <|audio_start|>, etc. | Structure |
| Emotion tokens | ~15 | <|emotion:happy|>, etc. | Emotion control |
| Speaker tokens | ~10 | <|speaker_start|>, etc. | Speaker conditioning |
| Total new | ~12,237 |
Built-in Speaker Voices
male_us_01Male · US EnglishWarm, authoritativemale_us_02Male · US EnglishYoung, energeticfemale_us_01Female · US EnglishProfessional, clearfemale_us_02Female · US EnglishWarm, conversationalmale_uk_01Male · British (RP)Formal, polishedmale_uk_02Male · BritishCasual, friendlyfemale_uk_01Female · British (RP)Professionalfemale_uk_02Female · BritishNatural, warmInference Code Example
from optimumllm_audio import OptimumTTS
model = OptimumTTS.from_pretrained("optimumai/OptimumLLM-Audio-1.0")
# Generate with built-in speaker
wav = model.generate(
text="Welcome to OptimumAI.",
speaker="male_us_01",
emotion="friendly"
)
# Voice cloning from audio prompt
wav = model.generate(
text="This is my cloned voice.",
audio_prompt="reference_voice.wav", # 10-30 seconds
emotion="excited"
)
# Two-speaker dialogue
wav = model.generate_dialogue(
script=[
("speaker1", "Welcome to the show!"),
("speaker2", "Thanks for having me."),
],
speaker1="male_us_01",
speaker2="custom_voice.wav",
)Audio Datasets
13 datasets evaluated for training
| Dataset | Hours | Speakers | SR | Emotion | Accent | Quality | Priority |
|---|---|---|---|---|---|---|---|
| LibriTTS-R | 585 | 2,456 | 24k | — | US | ★★★★★ | P0 |
| VCTK | 44 | 110 | 48k | — | British+ | ★★★★★ | P0 |
| HiFi-TTS | 292 | 10 | 44.1k | — | US | ★★★★★ | P1 |
| Expresso | 40 | 4 | 48k | ✓ | US | ★★★★★ | P0 |
| RAVDESS | 7 | 24 | 48k | ✓ | US | ★★★★☆ | P1 |
| EmoV-DB | 7 | 4 | 16k | ✓ | US/UK | ★★★☆☆ | P2 |
| GigaSpeech | 10,000 | Many | 16k | — | US | ★★★☆☆ | P1 |
| MLS (en) | 44,000 | 5,000+ | 16k | — | US/UK | ★★★☆☆ | P2 |
| LJSpeech | 24 | 1 | 22k | — | US | ★★★★☆ | P2 |
Recommended Training Mix
LibriTTS-R (460h) + GigaSpeech (300h) + HiFi-TTS (292h)
LibriTTS-R clean (100h) + VCTK (44h) + Expresso (40h)
Expresso (40h) + RAVDESS (7h) + EmoV-DB (7h)
Data Preparation
Audio preprocessing pipeline
Audio Format Requirements
| Parameter | Requirement | Reason |
|---|---|---|
| Sample Rate | 24,000 Hz | SNAC operates at 24kHz |
| Channels | Mono | Speech is mono; stereo wastes tokens |
| Format | WAV (16-bit PCM) | Lossless; MP3 introduces artifacts |
| Loudness | -23 LUFS (±1 dB) | Consistent volume across speakers |
| Max Duration | 30 seconds | Fits in LLaMA context window |
| Min Duration | 1 second | Shorter clips lack context |
| SNR | ≥20 dB | Reject noisy samples |
Training Pipeline
End-to-end, 4 stages
Extend LLaMA vocab with ~12K audio tokens, initialize embeddings.
N/AData: N/AEst: 5 minTeach the model text-to-audio token generation on diverse data.
5e-5Data: 1,000hEst: 72–120hRefine on curated, highest-quality audio data.
1e-5Data: 184hEst: 24–40hAlign speaker prompts and emotion tags for accurate control.
5e-6Data: 54hEst: 12–20hHardware Requirements
| GPU | VRAM | Feasible? | Strategy |
|---|---|---|---|
| A100 40GB | 40 GB | ✅ Yes | BF16 + gradient checkpointing + ZeRO-2 |
| L4 24GB | 24 GB | ⚠️ Tight | ZeRO-3 + CPU offload, batch=1 |
| T4 16GB | 16 GB | ❌ No | Cannot fit 3B model |
| V100 16GB | 16 GB | ❌ No | Same limitation |
Total training time: ~108–180 hours on A100. Colab Pro ≈ $0.50–1.50/hour → $54–270 total.
Inference Architecture
Generation, cloning, and emotion control
Emotion Control at Inference
| Emotion | Pitch | Speed | Energy | Breathiness | Use Case |
|---|---|---|---|---|---|
neutral | Normal | Normal | Normal | Normal | Default narration |
happy | +15% | +10% | +20% | Low | Positive announcements |
sad | −10% | −15% | −20% | High | Somber content |
angry | +20% | +5% | +30% | Low | Confrontational |
excited | +20% | +20% | +30% | Low | Sports commentary |
whisper | −20% | −10% | −40% | Very high | Intimate, secretive |
calm | −5% | −10% | −15% | Medium | Meditation |
Voice Cloning Quality vs Reference Length
| Reference Length | Quality | Use Case |
|---|---|---|
| 5 seconds | Fair | Quick testing |
| 10 seconds | Good | General use |
| 15 seconds | Very good | Recommended |
| 30 seconds | Excellent | Best quality |
| 60+ seconds | Diminishing returns | Not needed |
Hosting on HuggingFace
# Install HuggingFace CLI
pip install huggingface_hub
# Login & create repo
huggingface-cli login
huggingface-cli repo create OptimumLLM-Audio-1.0 --type model
# Upload model files
huggingface-cli upload optimumai/OptimumLLM-Audio-1.0 \
./optimum-tts/checkpoints/final/ \
--include "*.safetensors" "*.json" "*.txt" "*.md"Evaluation & Benchmarks
Benchmark Comparison Target
| Metric | Kokoro (82M) | OuteTTS (1B) | Orpheus (3B) | OptimumLLM Target |
|---|---|---|---|---|
| MOS | 3.7 | 3.5 | 4.2 | ≥ 4.0 |
| WER | 4% | 6% | 3% | ≤ 5% |
| Speaker Sim | N/A | 0.78 | 0.87 | ≥ 0.85 |
| PESQ | 3.3 | 3.1 | 3.8 | ≥ 3.5 |
| RTF | 0.05x | 0.8x | 1.2x | ≤ 1.5x |