Skip to main content

OptimumLLM Audio 1.0

Complete TTS Research
& Development Guide

Building a state-of-the-art text-to-speech model on LLaMA architecture. From landscape analysis to deployment — every decision documented.

Section 1

Executive Summary

OptimumLLM Audio 1.0 is a text-to-speech foundation model built on Meta's LLaMA open-source architecture. This document provides a comprehensive end-to-end guide for developing, training, and deploying a production-quality TTS system that produces natural, emotionally expressive, human-like speech.

Key Design Decisions

DecisionChoiceRationale
Base LLMLLaMA 3.2 3BBest balance of quality, trainability on Colab Pro, proven TTS capability
Audio CodecSNAC (24kHz)Multi-scale tokenization, superior quality at low bitrate, proven in Orpheus
Training ApproachFull fine-tuningLoRA adapters cannot learn audio token distribution deeply enough
LanguagesUS + British EnglishFocused quality over breadth
SpeakersMulti-speaker + zero-shotBuilt-in voices + reference audio cloning
EmotionData + controllable tagsNatural emotion from data + explicit tags like [happy], [sad]
Training HardwareColab Pro (A100 40GB)Feasible with gradient checkpointing + DeepSpeed ZeRO-3
HostingHuggingFace HubOpen weights, community access, inference API

What This Document Covers

Landscape analysis of 8 state-of-the-art TTS systems
Audio tokenization deep dive (SNAC, Encodec, DAC, Mimi)
Architecture design for OptimumLLM Audio 1.0
Dataset curation (13 datasets evaluated, optimal training mix)
Data preparation pipeline with code
Training pipeline (4 stages, Colab Pro-optimized)
Inference architecture with voice cloning and emotion control
HuggingFace deployment guide
Section 2

The LLM-TTS Revolution
Why Language Models Are the Future

Evolution of TTS Technology

EraPeriodApproachExampleQuality
Concatenative1990s–2010Splicing recorded phonemesFestival, MaryTTSRobotic
Statistical Parametric2005–2016HMM-based acoustic modelsHTSMonotone
Neural Seq2Seq2017–2021Encoder-decoder + attentionTacotron 2Good
Diffusion-Based2021–2023Score-based generative modelsGrad-TTSVery Good
LLM-Based2023–NowAutoregressive LMs on audio tokensVALL-E, OrpheusHuman-level

Why Language Models Excel at TTS

The key insight: speech is just another language. When audio is discretized into tokens using neural codecs, an autoregressive language model can learn to “speak” just as it learns to “write”.

Zero-shot Voice Cloning

Provide reference audio as prompt context — the LLM continues in that voice via in-context learning.

Natural Prosody

LLMs capture long-range dependencies better than encoder-decoder models, producing natural rhythm and intonation.

Emergent Emotion

Trained on diverse speech, LLMs learn emotional expression implicitly without explicit prosody labels.

Scalability

The same scaling laws that improved text LLMs apply: more data + more params = better speech.

Multi-task

One model handles TTS, voice conversion, speech continuation, and dialogue generation.

The Core Pipeline

┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌──────────┐
│   Text   │───▶│ Text Tokenizer│───▶│   LLM Decoder │───▶│  Audio   │
│  Input   │    │  (BPE/SPM)   │    │ (LLaMA-based) │    │  Tokens  │
└──────────┘    └──────────────┘    └───────────────┘    └────┬─────┘
                                                              │
                                                              ▼
┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌──────────┐
│  Output  │◀───│   Vocoder    │◀───│ Audio Codec   │◀───│  Decode  │
│   .wav   │    │  (optional)  │    │  Decoder      │    │  Tokens  │
└──────────┘    └──────────────┘    └───────────────┘    └──────────┘

Figure 1 — The LLM-based TTS pipeline: text tokens in, audio tokens out, codec decodes to waveform.

Section 3

Landscape Analysis
8 LLaMA-Based TTS Models

We analyzed eight state-of-the-art TTS models to inform the design of OptimumLLM Audio 1.0. Each model represents a different approach to the LLM-TTS paradigm.

Orpheus TTSCanopy AI
3B
BaseLLaMA 3.1 3B
CodecSNAC
Codebooks3
Sample Rate24kHz
Voice CloneZero-shot
EmotionEmotion tags
Sesame CSMSesame AI Labs
1B
BaseLLaMA 3.2 1B
CodecMimi
Codebooks32
Sample Rate24kHz
Voice CloneContext-based
EmotionConversational
OuteTTS 0.3edwko
1B
BaseLLaMA 3.2 1B
CodecDAC
Codebooks4
Sample Rate24kHz
Voice CloneOne-shot
EmotionLimited
ChatterboxResemble AI
350M–500M
BaseLLaMA 3 (mod)
CodecS3Tokenizer
CodebooksCustom
Sample Rate24kHz
Voice CloneZero-shot
EmotionExaggeration param
Dia TTSNari Labs
1.6B
BaseCustom Transformer
CodecDAC
Codebooks9
Sample Rate44.1kHz
Voice CloneSpeaker tags
EmotionNon-verbal
Kokorohexgrad
82M
BaseStyleTTS 2
CodecNone (mel)
CodebooksN/A
Sample Rate24kHz
Voice CloneStyle vectors
EmotionStyle transfer
Parler TTSHuggingFace
880M–2.3B
BaseT5 Enc-Dec
CodecDAC
Codebooks4
Sample Rate24kHz
Voice CloneText description
EmotionVia description
XTTS v2Coqui AI
~1B
BaseGPT-2-like
CodecVQ-VAE
Codebooks1
Sample Rate24kHz
Voice Clone6s reference
EmotionLimited
Section 4

Comparative Architecture
Full comparison matrix

ModelParamsCodecCBSRCloneARLicense
Orpheus TTS3BSNAC324kHzZero-shotYesApache 2.0
Sesame CSM1BMimi3224kHzContext-basedYes + NARApache 2.0
OuteTTS 0.31BDAC424kHzOne-shotYesApache 2.0
Chatterbox350M–500MS3TokenizerCustom24kHzZero-shotYesMIT
Dia TTS1.6BDAC944.1kHzSpeaker tagsNo (NAR)Apache 2.0
Kokoro82MNone (mel)N/A24kHzStyle vectorsNoApache 2.0
Parler TTS880M–2.3BDAC424kHzText descriptionYesApache 2.0
XTTS v2~1BVQ-VAE124kHz6s referenceYesMPL 2.0
Key Insight: The Winning Pattern

The most successful LLaMA-based TTS models share a common formula:

LLaMA Base (1B–3B) + SNAC/DAC Codec + Full Fine-Tuning + Emotion Data = SOTA Quality
Section 5

Audio Tokenization
The Bridge Between Text and Sound

Audio tokenization is the most critical component of LLM-based TTS. The quality of your codec directly determines the quality ceiling of your model.

SNAC — Multi-Scale Neural Audio Codec ✅ Recommended

Input Audio (24kHz waveform)
        │
        ▼
┌───────────────────┐
│   SNAC Encoder    │
│                   │
│  ┌─────────────┐  │
│  │ Codebook 1  │  │ ← 12 Hz  (coarse: prosody, pitch, speaker identity)
│  │ 4096 codes  │  │
│  └─────────────┘  │
│  ┌─────────────┐  │
│  │ Codebook 2  │  │ ← 24 Hz  (mid: phonetic detail, formants)
│  │ 4096 codes  │  │
│  └─────────────┘  │
│  ┌─────────────┐  │
│  │ Codebook 3  │  │ ← 47 Hz  (fine: texture, breathiness, micro-detail)
│  │ 4096 codes  │  │
│  └─────────────┘  │
└───────────────────┘
        │
        ▼
Output: 7 tokens per frame at 12 Hz = 84 tokens/second

Figure 2 — SNAC multi-scale encoder: 3 codebooks at different temporal resolutions.

Codec Comparison Matrix

CodecCodebooksTokens/secPESQSizeBest ForLicense
SNACRec.3843.8+~10MLLM-TTS (proven)MIT
Encodec2–32Variable3.5+~15MGeneral audioMIT
DAC4–948–1083.7+~74MHigh-quality speechMIT
Mimi32~4003.6+~20MConversationalApache 2.0
Token Budget

For a 30-second audio clip: 30s × 84 tokens/s = 2,520 audio tokens. Plus ~50–100 text tokens for transcript. Total: ~2,620 tokens — well within LLaMA's 8,192 context window.

Section 6

The TTS Pipeline
Text → Tokens → Speech

┌─────────────────────────────────────────────────────────────────────────┐
│                     OptimumLLM Audio 1.0 Pipeline                       │
│                                                                         │
│  INPUT                                                                  │
│  ├── Text: "Hello, welcome to OptimumAI"                               │
│  ├── Speaker: "male_us_01" OR audio_prompt.wav                         │
│  └── Emotion: "friendly" (optional)                                    │
│                                                                         │
│  STEP 1: TEXT TOKENIZATION                                             │
│  ├── BPE tokenizer (LLaMA) → [15043, 29892, 12345, ...]              │
│  └── Add special tokens: <|text_start|> ... <|text_end|>              │
│                                                                         │
│  STEP 2: SPEAKER CONDITIONING                                          │
│  ├── IF audio_prompt: Encode via SNAC → speaker_tokens                 │
│  ├── IF speaker_id: Lookup speaker embedding                           │
│  └── Add: <|speaker_start|> ... <|speaker_end|>                       │
│                                                                         │
│  STEP 3: EMOTION CONDITIONING (optional)                               │
│  └── Add: <|emotion:friendly|>                                         │
│                                                                         │
│  STEP 4: LLM GENERATION                                               │
│  ├── LLaMA 3.2 3B generates SNAC audio tokens autoregressively        │
│  ├── Token pattern: [CB1, CB2a, CB2b, CB3a, CB3b, CB3c, CB3d] × N    │
│  └── Stops at <|audio_end|>                                            │
│                                                                         │
│  STEP 5: AUDIO DECODING                                               │
│  ├── Separate tokens back into 3 SNAC codebooks                       │
│  ├── SNAC decoder reconstructs waveform                                │
│  └── Output: 24kHz WAV file                                           │
└─────────────────────────────────────────────────────────────────────────┘

Figure 3 — Complete OptimumLLM Audio 1.0 inference pipeline.

Multi-Codebook Interleaving Strategies

Delay Pattern

Used by: MusicGen

Fine codebooks see more coarse context with increasing delay.

Better coarse-to-fine conditioningMore complex, variable group sizes
Depth-First

Used by: Various

All of CB1, then all of CB2, then all of CB3.

Clean separation between codebooksVery long sequences, cannot stream
Two-Stage

Used by: Sesame CSM

Stage 1: LLM generates CB1; Stage 2: small decoder generates CB2–CB32.

LLM handles shorter sequencesRequires training two separate models
Section 7

Why LoRA Adapters Fail
For Base TTS Models

Critical Finding

LoRA (Low-Rank Adaptation) works for text-domain fine-tuning but fundamentally fails for building base TTS models. This was validated through direct experimentation with OuteTTS + Unsloth.

Why This Fails for Audio

1
New Modality, Not New Style

Creating a TTS base model isn't “adapting” text knowledge — it's learning an entirely new modality. LoRA's low-rank constraint (rank 16–64) cannot capture acoustic physics, timing, and speaker characteristics.

2
Embedding Layer Cannot Be LoRA-Adapted

Audio tokens (4,096 SNAC codes per codebook) require new embeddings trained from scratch. LoRA doesn't apply to embedding layers meaningfully.

3
Output Head Mismatch

The LM head must learn when to switch between text and audio prediction. Cross-modal attention patterns cannot form in low-rank space.

4
Distribution Shift

LoRA tries to stay “close” to the original model. Audio token prediction is very far from text prediction — LoRA's constraint prevents the necessary large weight changes.

Approach Comparison

ApproachTrainable ParamsGPU MemoryQuality
LoRA r=16~10M (0.3%)8GBPoor
LoRA r=64~40M (1.3%)12GBMediocre
Full FT (1B)1B (100%)24–40GBGood
Full FT (3B)3B (100%)40–80GBExcellent
Section 8

OptimumLLM Audio 1.0
Architecture Design

Quality over speed
Prioritize natural, expressive output
Open weights
Full model published on HuggingFace
English focus
US English + British English for v1.0
Multi-speaker
Built-in voices + zero-shot cloning
Emotion-aware
Implicit + explicit emotion control
Colab-trainable
Feasible on A100 40GB

Architecture Diagram

                    ┌─────────────────────────────────────────┐
                    │              INPUT FORMATTING             │
                    │  <|text_start|> Hello, welcome <|text_end|>│
                    │  <|emotion:friendly|>                    │
                    │  <|speaker_start|> [SNAC ref] <|speaker_end|>│
                    │  <|audio_start|>                         │
                    └────────────────┬────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────┐
│                        LLaMA 3.2 3B Backbone                          │
│  ┌──────────────┐                                                      │
│  │  Extended     │   ~140K vocab (128K text + 12K audio)              │
│  │  Embedding    │                                                      │
│  └──────┬───────┘                                                      │
│  ┌──────▼───────┐                                                      │
│  │  Transformer  │   36 layers, 3072 hidden dim                       │
│  │  Blocks ×36   │   GQA attention, RoPE, RMSNorm, SwiGLU            │
│  └──────┬───────┘                                                      │
│  ┌──────▼───────┐                                                      │
│  │  Extended     │   Logits over full vocab (text + audio)            │
│  │  LM Head      │                                                      │
│  └──────┬───────┘                                                      │
└─────────┼──────────────────────────────────────────────────────────────┘
          │
          ▼  (autoregressively generates SNAC tokens)
┌──────────────────┐
│  De-interleave   │  → CB1 @ 12Hz, CB2 @ 24Hz, CB3 @ 47Hz
└────────┬─────────┘
         ▼
┌──────────────────┐
│  SNAC Decoder    │  Neural codec → 24kHz WAV
└──────────────────┘

Figure 4 — OptimumLLM Audio 1.0 architecture: Extended LLaMA 3.2 3B with SNAC tokenization.

Vocabulary Extension

CategoryCountToken FormatPurpose
SNAC CB1 codes4,096<|snac1_0|> to <|snac1_4095|>Coarse audio (12Hz)
SNAC CB2 codes4,096<|snac2_0|> to <|snac2_4095|>Mid audio (24Hz)
SNAC CB3 codes4,096<|snac3_0|> to <|snac3_4095|>Fine audio (47Hz)
Special tokens~20<|audio_start|>, etc.Structure
Emotion tokens~15<|emotion:happy|>, etc.Emotion control
Speaker tokens~10<|speaker_start|>, etc.Speaker conditioning
Total new~12,237

Built-in Speaker Voices

male_us_01Male · US EnglishWarm, authoritative
male_us_02Male · US EnglishYoung, energetic
female_us_01Female · US EnglishProfessional, clear
female_us_02Female · US EnglishWarm, conversational
male_uk_01Male · British (RP)Formal, polished
male_uk_02Male · BritishCasual, friendly
female_uk_01Female · British (RP)Professional
female_uk_02Female · BritishNatural, warm

Inference Code Example

python
from optimumllm_audio import OptimumTTS

model = OptimumTTS.from_pretrained("optimumai/OptimumLLM-Audio-1.0")

# Generate with built-in speaker
wav = model.generate(
    text="Welcome to OptimumAI.",
    speaker="male_us_01",
    emotion="friendly"
)

# Voice cloning from audio prompt
wav = model.generate(
    text="This is my cloned voice.",
    audio_prompt="reference_voice.wav",  # 10-30 seconds
    emotion="excited"
)

# Two-speaker dialogue
wav = model.generate_dialogue(
    script=[
        ("speaker1", "Welcome to the show!"),
        ("speaker2", "Thanks for having me."),
    ],
    speaker1="male_us_01",
    speaker2="custom_voice.wav",
)
Section 9

Audio Datasets
13 datasets evaluated for training

DatasetHoursSpeakersSREmotionAccentQualityPriority
LibriTTS-R5852,45624kUS★★★★★P0
VCTK4411048kBritish+★★★★★P0
HiFi-TTS2921044.1kUS★★★★★P1
Expresso40448kUS★★★★★P0
RAVDESS72448kUS★★★★P1
EmoV-DB7416kUS/UK★★★☆☆P2
GigaSpeech10,000Many16kUS★★★☆☆P1
MLS (en)44,0005,000+16kUS/UK★★★☆☆P2
LJSpeech24122kUS★★★★P2

Recommended Training Mix

Phase 1~1,050 hours
Large-Scale Pretraining

LibriTTS-R (460h) + GigaSpeech (300h) + HiFi-TTS (292h)

Phase 2~184 hours
Quality Fine-Tuning

LibriTTS-R clean (100h) + VCTK (44h) + Expresso (40h)

Phase 3~54 hours
Emotion & Speaker Alignment

Expresso (40h) + RAVDESS (7h) + EmoV-DB (7h)

Section 10

Data Preparation
Audio preprocessing pipeline

Audio Format Requirements

ParameterRequirementReason
Sample Rate24,000 HzSNAC operates at 24kHz
ChannelsMonoSpeech is mono; stereo wastes tokens
FormatWAV (16-bit PCM)Lossless; MP3 introduces artifacts
Loudness-23 LUFS (±1 dB)Consistent volume across speakers
Max Duration30 secondsFits in LLaMA context window
Min Duration1 secondShorter clips lack context
SNR≥20 dBReject noisy samples
View Preprocessing Code
python
import torch, torchaudio
import pyloudnorm as pyln

def preprocess_audio(input_path, output_path, target_sr=24000):
    waveform, sr = torchaudio.load(input_path)
    
    # Convert to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    # Resample to 24kHz
    if sr != target_sr:
        waveform = torchaudio.transforms.Resample(sr, target_sr)(waveform)
    
    # Loudness normalization to -23 LUFS
    audio_np = waveform.squeeze().numpy()
    meter = pyln.Meter(target_sr)
    loudness = meter.integrated_loudness(audio_np)
    audio_np = pyln.normalize.loudness(audio_np, loudness, -23.0)
    
    waveform = torch.from_numpy(audio_np).unsqueeze(0).clamp(-1, 1)
    torchaudio.save(output_path, waveform, target_sr)
Section 11

Training Pipeline
End-to-end, 4 stages

Stage 1
Vocabulary Extension & Initialization

Extend LLaMA vocab with ~12K audio tokens, initialize embeddings.

LR: N/AData: N/AEst: 5 min
Stage 2
Large-Scale Pretraining

Teach the model text-to-audio token generation on diverse data.

LR: 5e-5Data: 1,000hEst: 72–120h
Stage 3
High-Quality Fine-Tuning

Refine on curated, highest-quality audio data.

LR: 1e-5Data: 184hEst: 24–40h
Stage 4
Speaker & Emotion Alignment

Align speaker prompts and emotion tags for accurate control.

LR: 5e-6Data: 54hEst: 12–20h

Hardware Requirements

GPUVRAMFeasible?Strategy
A100 40GB40 GB✅ YesBF16 + gradient checkpointing + ZeRO-2
L4 24GB24 GB⚠️ TightZeRO-3 + CPU offload, batch=1
T4 16GB16 GB❌ NoCannot fit 3B model
V100 16GB16 GB❌ NoSame limitation
Cost Estimate

Total training time: ~108–180 hours on A100. Colab Pro ≈ $0.50–1.50/hour → $54–270 total.

Section 12

Inference Architecture
Generation, cloning, and emotion control

Emotion Control at Inference

EmotionPitchSpeedEnergyBreathinessUse Case
neutralNormalNormalNormalNormalDefault narration
happy+15%+10%+20%LowPositive announcements
sad−10%−15%−20%HighSomber content
angry+20%+5%+30%LowConfrontational
excited+20%+20%+30%LowSports commentary
whisper−20%−10%−40%Very highIntimate, secretive
calm−5%−10%−15%MediumMeditation

Voice Cloning Quality vs Reference Length

Reference LengthQualityUse Case
5 secondsFairQuick testing
10 secondsGoodGeneral use
15 secondsVery goodRecommended
30 secondsExcellentBest quality
60+ secondsDiminishing returnsNot needed
Section 13

Hosting on HuggingFace

bash
# Install HuggingFace CLI
pip install huggingface_hub

# Login & create repo
huggingface-cli login
huggingface-cli repo create OptimumLLM-Audio-1.0 --type model

# Upload model files
huggingface-cli upload optimumai/OptimumLLM-Audio-1.0 \
    ./optimum-tts/checkpoints/final/ \
    --include "*.safetensors" "*.json" "*.txt" "*.md"
Section 14

Evaluation & Benchmarks

Benchmark Comparison Target

MetricKokoro (82M)OuteTTS (1B)Orpheus (3B)OptimumLLM Target
MOS3.73.54.2≥ 4.0
WER4%6%3%≤ 5%
Speaker SimN/A0.780.87≥ 0.85
PESQ3.33.13.8≥ 3.5
RTF0.05x0.8x1.2x≤ 1.5x
Section 15

Roadmap & Future Work

v1.0 — Current
Architecture design (LLaMA 3.2 3B + SNAC)
Data pipeline implementation
SNAC tokenization of training data
Stage 2–4: Training pipeline
Evaluation & benchmarking
HuggingFace deployment
Built-in speakers (4 US + 4 UK English)
v1.1 — Planned
Streaming inference
Paralinguistic tags: [laugh], [sigh]
ONNX/TensorRT export
Voice conversion mode
v2.0 — Future
Multilingual support (10+ languages)
Two-stage architecture
Turbo variant (350M)
Natural language voice description
Real-time streaming for agents
Section 16

References

Models

Audio Codecs

Key Papers

[20]VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (Wang et al., 2023)
[21]SoundStorm: Efficient Parallel Audio Generation (Borsos et al., 2023)
[22]NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec (Ju et al., 2024)
[23]StyleTTS 2: Towards Human-Level TTS through Style Diffusion (Li et al., 2023)
[24]Voicebox: Text-Guided Multilingual Universal Speech Generation (Le et al., 2023)
Section 17

Appendices

Appendix A: Token Format Specification
text
# Full input format for OptimumLLM Audio 1.0

<|text_start|>
[BPE-tokenized text content]
<|text_end|>

<|emotion:{emotion_tag}|>

<|speaker_start|>
[SNAC tokens from reference audio - optional]
<|speaker_end|>

<|audio_start|>
[Generated SNAC tokens in flat-interleaved format]
[CB1_0] [CB2_0a] [CB2_0b] [CB3_0a] [CB3_0b] [CB3_0c] [CB3_0d]
[CB1_1] [CB2_1a] [CB2_1b] [CB3_1a] [CB3_1b] [CB3_1c] [CB3_1d]
...
<|audio_end|>
Appendix B: SNAC Quick Test
python
pip install snac torchaudio

import torch, torchaudio
from snac import SNAC

snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cuda")

# 1 second sine wave at 440 Hz
t = torch.linspace(0, 1, 24000).unsqueeze(0)
audio = torch.sin(2 * 3.14159 * 440 * t)

codes = snac.encode(audio.unsqueeze(0).to("cuda"))
print(f"CB1: {codes[0].shape}")  # ~12 tokens at 12 Hz
print(f"CB2: {codes[1].shape}")  # ~24 tokens at 24 Hz
print(f"CB3: {codes[2].shape}")  # ~47 tokens at 47 Hz

reconstructed = snac.decode(codes)
torchaudio.save("test_snac.wav", reconstructed.squeeze(0).cpu(), 24000)
Appendix D: Glossary
TermDefinition
ARAutoregressive — generating one token at a time
NARNon-autoregressive — generating all tokens in parallel
BPEByte-Pair Encoding — text tokenization method
CBCodebook — a vocabulary of discrete audio codes
CFGClassifier-Free Guidance — generation control technique
DACDescript Audio Codec
F0Fundamental frequency (pitch)
FSDPFully Sharded Data Parallel
GQAGrouped Query Attention
MOSMean Opinion Score (1–5 subjective quality)
PESQPerceptual Evaluation of Speech Quality
RoPERotary Position Embedding
RVQResidual Vector Quantization
SNACMulti-Scale Neural Audio Codec
STOIShort-Time Objective Intelligibility
TTSText-to-Speech
WERWord Error Rate
ZeROZero Redundancy Optimizer (DeepSpeed)