OptimumAI - AI Bootcamps, Research & Internships

Section 1

Executive Summary

OptimumLLM Audio 1.0 is a text-to-speech foundation model built on Meta's LLaMA open-source architecture. This document provides a comprehensive end-to-end guide for developing, training, and deploying a production-quality TTS system that produces natural, emotionally expressive, human-like speech.

Key Design Decisions

Decision	Choice	Rationale
Base LLM	`LLaMA 3.2 3B`	Best balance of quality, trainability on Colab Pro, proven TTS capability
Audio Codec	`SNAC (24kHz)`	Multi-scale tokenization, superior quality at low bitrate, proven in Orpheus
Training Approach	`Full fine-tuning`	LoRA adapters cannot learn audio token distribution deeply enough
Languages	`US + British English`	Focused quality over breadth
Speakers	`Multi-speaker + zero-shot`	Built-in voices + reference audio cloning
Emotion	`Data + controllable tags`	Natural emotion from data + explicit tags like [happy], [sad]
Training Hardware	`Colab Pro (A100 40GB)`	Feasible with gradient checkpointing + DeepSpeed ZeRO-3
Hosting	`HuggingFace Hub`	Open weights, community access, inference API

What This Document Covers

✓Landscape analysis of 8 state-of-the-art TTS systems

✓Audio tokenization deep dive (SNAC, Encodec, DAC, Mimi)

✓Architecture design for OptimumLLM Audio 1.0

✓Dataset curation (13 datasets evaluated, optimal training mix)

✓Data preparation pipeline with code

✓Training pipeline (4 stages, Colab Pro-optimized)

✓Inference architecture with voice cloning and emotion control

✓HuggingFace deployment guide

Section 2

The LLM-TTS Revolution
Why Language Models Are the Future

Evolution of TTS Technology

Era	Period	Approach	Example	Quality
Concatenative	1990s–2010	Splicing recorded phonemes	Festival, MaryTTS	Robotic
Statistical Parametric	2005–2016	HMM-based acoustic models	HTS	Monotone
Neural Seq2Seq	2017–2021	Encoder-decoder + attention	Tacotron 2	Good
Diffusion-Based	2021–2023	Score-based generative models	Grad-TTS	Very Good
LLM-Based	2023–Now	Autoregressive LMs on audio tokens	VALL-E, Orpheus	Human-level

Why Language Models Excel at TTS

The key insight: speech is just another language. When audio is discretized into tokens using neural codecs, an autoregressive language model can learn to “speak” just as it learns to “write”.

Zero-shot Voice Cloning

Provide reference audio as prompt context — the LLM continues in that voice via in-context learning.

Natural Prosody

LLMs capture long-range dependencies better than encoder-decoder models, producing natural rhythm and intonation.

Emergent Emotion

Trained on diverse speech, LLMs learn emotional expression implicitly without explicit prosody labels.

Scalability

The same scaling laws that improved text LLMs apply: more data + more params = better speech.

Multi-task

One model handles TTS, voice conversion, speech continuation, and dialogue generation.

The Core Pipeline

┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌──────────┐
│   Text   │───▶│ Text Tokenizer│───▶│   LLM Decoder │───▶│  Audio   │
│  Input   │    │  (BPE/SPM)   │    │ (LLaMA-based) │    │  Tokens  │
└──────────┘    └──────────────┘    └───────────────┘    └────┬─────┘
                                                              │
                                                              ▼
┌──────────┐    ┌──────────────┐    ┌───────────────┐    ┌──────────┐
│  Output  │◀───│   Vocoder    │◀───│ Audio Codec   │◀───│  Decode  │
│   .wav   │    │  (optional)  │    │  Decoder      │    │  Tokens  │
└──────────┘    └──────────────┘    └───────────────┘    └──────────┘

Figure 1 — The LLM-based TTS pipeline: text tokens in, audio tokens out, codec decodes to waveform.

Section 3

Landscape Analysis
8 LLaMA-Based TTS Models

We analyzed eight state-of-the-art TTS models to inform the design of OptimumLLM Audio 1.0. Each model represents a different approach to the LLM-TTS paradigm.

Orpheus TTSCanopy AI

BaseLLaMA 3.1 3B

CodecSNAC

Codebooks3

Sample Rate24kHz

Voice CloneZero-shot

EmotionEmotion tags

Apache 2.0GitHub

Sesame CSMSesame AI Labs

BaseLLaMA 3.2 1B

CodecMimi

Codebooks32

Sample Rate24kHz

Voice CloneContext-based

EmotionConversational

Apache 2.0GitHub

OuteTTS 0.3edwko

BaseLLaMA 3.2 1B

CodecDAC

Codebooks4

Sample Rate24kHz

Voice CloneOne-shot

EmotionLimited

Apache 2.0GitHub

ChatterboxResemble AI

350M–500M

BaseLLaMA 3 (mod)

CodecS3Tokenizer

CodebooksCustom

Sample Rate24kHz

Voice CloneZero-shot

EmotionExaggeration param

MITGitHub

Dia TTSNari Labs

1.6B

BaseCustom Transformer

CodecDAC

Codebooks9

Sample Rate44.1kHz

Voice CloneSpeaker tags

EmotionNon-verbal

Apache 2.0GitHub

Kokorohexgrad

82M

BaseStyleTTS 2

CodecNone (mel)

CodebooksN/A

Sample Rate24kHz

Voice CloneStyle vectors

EmotionStyle transfer

Apache 2.0GitHub

Parler TTSHuggingFace

880M–2.3B

BaseT5 Enc-Dec

CodecDAC

Codebooks4

Sample Rate24kHz

Voice CloneText description

EmotionVia description

Apache 2.0GitHub

XTTS v2Coqui AI

~1B

BaseGPT-2-like

CodecVQ-VAE

Codebooks1

Sample Rate24kHz

Voice Clone6s reference

EmotionLimited

MPL 2.0GitHub

Section 4

Comparative Architecture
Full comparison matrix

Model	Params	Codec	CB	SR	Clone	AR	License
Orpheus TTS	3B	SNAC	3	24kHz	Zero-shot	Yes	`Apache 2.0`
Sesame CSM	1B	Mimi	32	24kHz	Context-based	Yes + NAR	`Apache 2.0`
OuteTTS 0.3	1B	DAC	4	24kHz	One-shot	Yes	`Apache 2.0`
Chatterbox	350M–500M	S3Tokenizer	Custom	24kHz	Zero-shot	Yes	`MIT`
Dia TTS	1.6B	DAC	9	44.1kHz	Speaker tags	No (NAR)	`Apache 2.0`
Kokoro	82M	None (mel)	N/A	24kHz	Style vectors	No	`Apache 2.0`
Parler TTS	880M–2.3B	DAC	4	24kHz	Text description	Yes	`Apache 2.0`
XTTS v2	~1B	VQ-VAE	1	24kHz	6s reference	Yes	`MPL 2.0`

Key Insight: The Winning Pattern

The most successful LLaMA-based TTS models share a common formula:

LLaMA Base (1B–3B) + SNAC/DAC Codec + Full Fine-Tuning + Emotion Data = SOTA Quality

Section 5

Audio Tokenization
The Bridge Between Text and Sound

Audio tokenization is the most critical component of LLM-based TTS. The quality of your codec directly determines the quality ceiling of your model.

SNAC — Multi-Scale Neural Audio Codec ✅ Recommended

Input Audio (24kHz waveform)
        │
        ▼
┌───────────────────┐
│   SNAC Encoder    │
│                   │
│  ┌─────────────┐  │
│  │ Codebook 1  │  │ ← 12 Hz  (coarse: prosody, pitch, speaker identity)
│  │ 4096 codes  │  │
│  └─────────────┘  │
│  ┌─────────────┐  │
│  │ Codebook 2  │  │ ← 24 Hz  (mid: phonetic detail, formants)
│  │ 4096 codes  │  │
│  └─────────────┘  │
│  ┌─────────────┐  │
│  │ Codebook 3  │  │ ← 47 Hz  (fine: texture, breathiness, micro-detail)
│  │ 4096 codes  │  │
│  └─────────────┘  │
└───────────────────┘
        │
        ▼
Output: 7 tokens per frame at 12 Hz = 84 tokens/second

Figure 2 — SNAC multi-scale encoder: 3 codebooks at different temporal resolutions.

Codec Comparison Matrix

Codec	Codebooks	Tokens/sec	PESQ	Size	Best For	License
SNACRec.	3	84	3.8+	~10M	LLM-TTS (proven)	`MIT`
Encodec	2–32	Variable	3.5+	~15M	General audio	`MIT`
DAC	4–9	48–108	3.7+	~74M	High-quality speech	`MIT`
Mimi	32	~400	3.6+	~20M	Conversational	`Apache 2.0`

Token Budget

For a 30-second audio clip: 30s × 84 tokens/s = 2,520 audio tokens. Plus ~50–100 text tokens for transcript. Total: ~2,620 tokens — well within LLaMA's 8,192 context window.

Section 6

The TTS Pipeline
Text → Tokens → Speech

┌─────────────────────────────────────────────────────────────────────────┐
│                     OptimumLLM Audio 1.0 Pipeline                       │
│                                                                         │
│  INPUT                                                                  │
│  ├── Text: "Hello, welcome to OptimumAI"                               │
│  ├── Speaker: "male_us_01" OR audio_prompt.wav                         │
│  └── Emotion: "friendly" (optional)                                    │
│                                                                         │
│  STEP 1: TEXT TOKENIZATION                                             │
│  ├── BPE tokenizer (LLaMA) → [15043, 29892, 12345, ...]              │
│  └── Add special tokens: <|text_start|> ... <|text_end|>              │
│                                                                         │
│  STEP 2: SPEAKER CONDITIONING                                          │
│  ├── IF audio_prompt: Encode via SNAC → speaker_tokens                 │
│  ├── IF speaker_id: Lookup speaker embedding                           │
│  └── Add: <|speaker_start|> ... <|speaker_end|>                       │
│                                                                         │
│  STEP 3: EMOTION CONDITIONING (optional)                               │
│  └── Add: <|emotion:friendly|>                                         │
│                                                                         │
│  STEP 4: LLM GENERATION                                               │
│  ├── LLaMA 3.2 3B generates SNAC audio tokens autoregressively        │
│  ├── Token pattern: [CB1, CB2a, CB2b, CB3a, CB3b, CB3c, CB3d] × N    │
│  └── Stops at <|audio_end|>                                            │
│                                                                         │
│  STEP 5: AUDIO DECODING                                               │
│  ├── Separate tokens back into 3 SNAC codebooks                       │
│  ├── SNAC decoder reconstructs waveform                                │
│  └── Output: 24kHz WAV file                                           │
└─────────────────────────────────────────────────────────────────────────┘

Figure 3 — Complete OptimumLLM Audio 1.0 inference pipeline.

Multi-Codebook Interleaving Strategies

Flat InterleavingRecommended

Used by: Orpheus

Each frame is self-contained: [CB1₀, CB2₀ₐ, CB2₀ᵦ, CB3₀ₐ, CB3₀ᵦ, CB3₀ᵧ, CB3₀ᵨ]

✓ Simple, proven, self-contained frames✗ Fine tokens predicted before next coarse token

Delay Pattern

Used by: MusicGen

Fine codebooks see more coarse context with increasing delay.

✓ Better coarse-to-fine conditioning✗ More complex, variable group sizes

Depth-First

Used by: Various

All of CB1, then all of CB2, then all of CB3.

✓ Clean separation between codebooks✗ Very long sequences, cannot stream

Two-Stage

Used by: Sesame CSM

Stage 1: LLM generates CB1; Stage 2: small decoder generates CB2–CB32.

✓ LLM handles shorter sequences✗ Requires training two separate models

Section 7

Why LoRA Adapters Fail
For Base TTS Models

Critical Finding

LoRA (Low-Rank Adaptation) works for text-domain fine-tuning but fundamentally fails for building base TTS models. This was validated through direct experimentation with OuteTTS + Unsloth.

Why This Fails for Audio

New Modality, Not New Style

Creating a TTS base model isn't “adapting” text knowledge — it's learning an entirely new modality. LoRA's low-rank constraint (rank 16–64) cannot capture acoustic physics, timing, and speaker characteristics.

Embedding Layer Cannot Be LoRA-Adapted

Audio tokens (4,096 SNAC codes per codebook) require new embeddings trained from scratch. LoRA doesn't apply to embedding layers meaningfully.

Output Head Mismatch

The LM head must learn when to switch between text and audio prediction. Cross-modal attention patterns cannot form in low-rank space.

Distribution Shift

LoRA tries to stay “close” to the original model. Audio token prediction is very far from text prediction — LoRA's constraint prevents the necessary large weight changes.

Approach Comparison

Approach	Trainable Params	GPU Memory	Quality
LoRA r=16	~10M (0.3%)	8GB	Poor
LoRA r=64	~40M (1.3%)	12GB	Mediocre
Full FT (1B)	1B (100%)	24–40GB	Good
Full FT (3B)	3B (100%)	40–80GB	Excellent

Section 8

OptimumLLM Audio 1.0
Architecture Design

Quality over speed

Prioritize natural, expressive output

Open weights

Full model published on HuggingFace

English focus

US English + British English for v1.0

Multi-speaker

Built-in voices + zero-shot cloning

Emotion-aware

Implicit + explicit emotion control

Colab-trainable

Feasible on A100 40GB

Architecture Diagram

                    ┌─────────────────────────────────────────┐
                    │              INPUT FORMATTING             │
                    │  <|text_start|> Hello, welcome <|text_end|>│
                    │  <|emotion:friendly|>                    │
                    │  <|speaker_start|> [SNAC ref] <|speaker_end|>│
                    │  <|audio_start|>                         │
                    └────────────────┬────────────────────────┘
                                     │
                                     ▼
┌────────────────────────────────────────────────────────────────────────┐
│                        LLaMA 3.2 3B Backbone                          │
│  ┌──────────────┐                                                      │
│  │  Extended     │   ~140K vocab (128K text + 12K audio)              │
│  │  Embedding    │                                                      │
│  └──────┬───────┘                                                      │
│  ┌──────▼───────┐                                                      │
│  │  Transformer  │   36 layers, 3072 hidden dim                       │
│  │  Blocks ×36   │   GQA attention, RoPE, RMSNorm, SwiGLU            │
│  └──────┬───────┘                                                      │
│  ┌──────▼───────┐                                                      │
│  │  Extended     │   Logits over full vocab (text + audio)            │
│  │  LM Head      │                                                      │
│  └──────┬───────┘                                                      │
└─────────┼──────────────────────────────────────────────────────────────┘
          │
          ▼  (autoregressively generates SNAC tokens)
┌──────────────────┐
│  De-interleave   │  → CB1 @ 12Hz, CB2 @ 24Hz, CB3 @ 47Hz
└────────┬─────────┘
         ▼
┌──────────────────┐
│  SNAC Decoder    │  Neural codec → 24kHz WAV
└──────────────────┘

Figure 4 — OptimumLLM Audio 1.0 architecture: Extended LLaMA 3.2 3B with SNAC tokenization.

Vocabulary Extension

Category	Count	Token Format	Purpose
SNAC CB1 codes	4,096	`<\|snac1_0\|> to <\|snac1_4095\|>`	Coarse audio (12Hz)
SNAC CB2 codes	4,096	`<\|snac2_0\|> to <\|snac2_4095\|>`	Mid audio (24Hz)
SNAC CB3 codes	4,096	`<\|snac3_0\|> to <\|snac3_4095\|>`	Fine audio (47Hz)
Special tokens	~20	`<\|audio_start\|>`, etc.	Structure
Emotion tokens	~15	`<\|emotion:happy\|>`, etc.	Emotion control
Speaker tokens	~10	`<\|speaker_start\|>`, etc.	Speaker conditioning
Total new	~12,237

Built-in Speaker Voices

male_us_01Male · US EnglishWarm, authoritative

male_us_02Male · US EnglishYoung, energetic

female_us_01Female · US EnglishProfessional, clear

female_us_02Female · US EnglishWarm, conversational

male_uk_01Male · British (RP)Formal, polished

male_uk_02Male · BritishCasual, friendly

female_uk_01Female · British (RP)Professional

female_uk_02Female · BritishNatural, warm

Inference Code Example

python

from optimumllm_audio import OptimumTTS

model = OptimumTTS.from_pretrained("optimumai/OptimumLLM-Audio-1.0")

# Generate with built-in speaker
wav = model.generate(
    text="Welcome to OptimumAI.",
    speaker="male_us_01",
    emotion="friendly"
)

# Voice cloning from audio prompt
wav = model.generate(
    text="This is my cloned voice.",
    audio_prompt="reference_voice.wav",  # 10-30 seconds
    emotion="excited"
)

# Two-speaker dialogue
wav = model.generate_dialogue(
    script=[
        ("speaker1", "Welcome to the show!"),
        ("speaker2", "Thanks for having me."),
    ],
    speaker1="male_us_01",
    speaker2="custom_voice.wav",
)

Section 9

Audio Datasets
13 datasets evaluated for training

Dataset	Hours	Speakers	SR	Emotion	Accent	Quality	Priority
LibriTTS-R	585	2,456	24k	—	US	★★★★★	P0
VCTK	44	110	48k	—	British+	★★★★★	P0
HiFi-TTS	292	10	44.1k	—	US	★★★★★	P1
Expresso	40	4	48k	✓	US	★★★★★	P0
RAVDESS	7	24	48k	✓	US	★★★★☆	P1
EmoV-DB	7	4	16k	✓	US/UK	★★★☆☆	P2
GigaSpeech	10,000	Many	16k	—	US	★★★☆☆	P1
MLS (en)	44,000	5,000+	16k	—	US/UK	★★★☆☆	P2
LJSpeech	24	1	22k	—	US	★★★★☆	P2

Recommended Training Mix

Phase 1~1,050 hours

Large-Scale Pretraining

LibriTTS-R (460h) + GigaSpeech (300h) + HiFi-TTS (292h)

Phase 2~184 hours

Quality Fine-Tuning

LibriTTS-R clean (100h) + VCTK (44h) + Expresso (40h)

Phase 3~54 hours

Emotion & Speaker Alignment

Expresso (40h) + RAVDESS (7h) + EmoV-DB (7h)

Section 10

Data Preparation
Audio preprocessing pipeline

Audio Format Requirements

Parameter	Requirement	Reason
Sample Rate	`24,000 Hz`	SNAC operates at 24kHz
Channels	`Mono`	Speech is mono; stereo wastes tokens
Format	`WAV (16-bit PCM)`	Lossless; MP3 introduces artifacts
Loudness	`-23 LUFS (±1 dB)`	Consistent volume across speakers
Max Duration	`30 seconds`	Fits in LLaMA context window
Min Duration	`1 second`	Shorter clips lack context
SNR	`≥20 dB`	Reject noisy samples

View Preprocessing Code

python

import torch, torchaudio
import pyloudnorm as pyln

def preprocess_audio(input_path, output_path, target_sr=24000):
    waveform, sr = torchaudio.load(input_path)
    
    # Convert to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    # Resample to 24kHz
    if sr != target_sr:
        waveform = torchaudio.transforms.Resample(sr, target_sr)(waveform)
    
    # Loudness normalization to -23 LUFS
    audio_np = waveform.squeeze().numpy()
    meter = pyln.Meter(target_sr)
    loudness = meter.integrated_loudness(audio_np)
    audio_np = pyln.normalize.loudness(audio_np, loudness, -23.0)
    
    waveform = torch.from_numpy(audio_np).unsqueeze(0).clamp(-1, 1)
    torchaudio.save(output_path, waveform, target_sr)

Section 11

Training Pipeline
End-to-end, 4 stages

Stage 1

Vocabulary Extension & Initialization

Extend LLaMA vocab with ~12K audio tokens, initialize embeddings.

LR: N/AData: N/AEst: 5 min

Stage 2

Large-Scale Pretraining

Teach the model text-to-audio token generation on diverse data.

LR: 5e-5Data: 1,000hEst: 72–120h

Stage 3

High-Quality Fine-Tuning

Refine on curated, highest-quality audio data.

LR: 1e-5Data: 184hEst: 24–40h

Stage 4

Speaker & Emotion Alignment

Align speaker prompts and emotion tags for accurate control.

LR: 5e-6Data: 54hEst: 12–20h

Hardware Requirements

GPU	VRAM	Feasible?	Strategy
A100 40GB	40 GB	✅ Yes	BF16 + gradient checkpointing + ZeRO-2
L4 24GB	24 GB	⚠️ Tight	ZeRO-3 + CPU offload, batch=1
T4 16GB	16 GB	❌ No	Cannot fit 3B model
V100 16GB	16 GB	❌ No	Same limitation

Cost Estimate

Total training time: ~108–180 hours on A100. Colab Pro ≈ $0.50–1.50/hour → $54–270 total.

Section 12

Inference Architecture
Generation, cloning, and emotion control

Emotion Control at Inference

Emotion	Pitch	Speed	Energy	Breathiness	Use Case
`neutral`	Normal	Normal	Normal	Normal	Default narration
`happy`	+15%	+10%	+20%	Low	Positive announcements
`sad`	−10%	−15%	−20%	High	Somber content
`angry`	+20%	+5%	+30%	Low	Confrontational
`excited`	+20%	+20%	+30%	Low	Sports commentary
`whisper`	−20%	−10%	−40%	Very high	Intimate, secretive
`calm`	−5%	−10%	−15%	Medium	Meditation

Voice Cloning Quality vs Reference Length

Reference Length	Quality	Use Case
5 seconds	Fair	Quick testing
10 seconds	Good	General use
15 seconds	Very good	Recommended
30 seconds	Excellent	Best quality
60+ seconds	Diminishing returns	Not needed

Section 13

Hosting on HuggingFace

bash

# Install HuggingFace CLI
pip install huggingface_hub

# Login & create repo
huggingface-cli login
huggingface-cli repo create OptimumLLM-Audio-1.0 --type model

# Upload model files
huggingface-cli upload optimumai/OptimumLLM-Audio-1.0 \
    ./optimum-tts/checkpoints/final/ \
    --include "*.safetensors" "*.json" "*.txt" "*.md"

Section 14

Evaluation & Benchmarks

Benchmark Comparison Target

Metric	Kokoro (82M)	OuteTTS (1B)	Orpheus (3B)	OptimumLLM Target
MOS	3.7	3.5	4.2	≥ 4.0
WER	4%	6%	3%	≤ 5%
Speaker Sim	N/A	0.78	0.87	≥ 0.85
PESQ	3.3	3.1	3.8	≥ 3.5
RTF	0.05x	0.8x	1.2x	≤ 1.5x

Section 15

Roadmap & Future Work

v1.0 — Current

✓Architecture design (LLaMA 3.2 3B + SNAC)

○Data pipeline implementation

○SNAC tokenization of training data

○Stage 2–4: Training pipeline

○Evaluation & benchmarking

○HuggingFace deployment

○Built-in speakers (4 US + 4 UK English)

v1.1 — Planned

○Streaming inference

○Paralinguistic tags: [laugh], [sigh]

○ONNX/TensorRT export

○Voice conversion mode

v2.0 — Future

○Multilingual support (10+ languages)

○Two-stage architecture

○Turbo variant (350M)

○Natural language voice description

○Real-time streaming for agents

Section 16

References

Models

[1]Orpheus TTSgithub.com/canopyai/Orpheus-TTS

[2]Sesame CSMgithub.com/SesameAILabs/csm

[3]OuteTTSgithub.com/edwko/OuteTTS

[4]Chatterbox TTSgithub.com/resemble-ai/chatterbox

[5]Dia TTSgithub.com/nari-labs/dia

[6]Kokoro TTShuggingface.co/hexgrad/Kokoro-82M

[7]Parler TTSgithub.com/huggingface/parler-tts

[8]XTTS v2github.com/coqui-ai/TTS

Audio Codecs

[9]SNACgithub.com/hubertsiuzdak/snac

[10]Encodecgithub.com/facebookresearch/encodec

[11]DACgithub.com/descriptinc/descript-audio-codec

Key Papers

[20]VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (Wang et al., 2023)

[21]SoundStorm: Efficient Parallel Audio Generation (Borsos et al., 2023)

[22]NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec (Ju et al., 2024)

[23]StyleTTS 2: Towards Human-Level TTS through Style Diffusion (Li et al., 2023)

[24]Voicebox: Text-Guided Multilingual Universal Speech Generation (Le et al., 2023)

Section 17

Appendices

Appendix A: Token Format Specification

text

# Full input format for OptimumLLM Audio 1.0

<|text_start|>
[BPE-tokenized text content]
<|text_end|>

<|emotion:{emotion_tag}|>

<|speaker_start|>
[SNAC tokens from reference audio - optional]
<|speaker_end|>

<|audio_start|>
[Generated SNAC tokens in flat-interleaved format]
[CB1_0] [CB2_0a] [CB2_0b] [CB3_0a] [CB3_0b] [CB3_0c] [CB3_0d]
[CB1_1] [CB2_1a] [CB2_1b] [CB3_1a] [CB3_1b] [CB3_1c] [CB3_1d]
...
<|audio_end|>

Appendix B: SNAC Quick Test

python

pip install snac torchaudio

import torch, torchaudio
from snac import SNAC

snac = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").to("cuda")

# 1 second sine wave at 440 Hz
t = torch.linspace(0, 1, 24000).unsqueeze(0)
audio = torch.sin(2 * 3.14159 * 440 * t)

codes = snac.encode(audio.unsqueeze(0).to("cuda"))
print(f"CB1: {codes[0].shape}")  # ~12 tokens at 12 Hz
print(f"CB2: {codes[1].shape}")  # ~24 tokens at 24 Hz
print(f"CB3: {codes[2].shape}")  # ~47 tokens at 47 Hz

reconstructed = snac.decode(codes)
torchaudio.save("test_snac.wav", reconstructed.squeeze(0).cpu(), 24000)

Appendix D: Glossary

Term	Definition
`AR`	Autoregressive — generating one token at a time
`NAR`	Non-autoregressive — generating all tokens in parallel
`BPE`	Byte-Pair Encoding — text tokenization method
`CB`	Codebook — a vocabulary of discrete audio codes
`CFG`	Classifier-Free Guidance — generation control technique
`DAC`	Descript Audio Codec
`F0`	Fundamental frequency (pitch)
`FSDP`	Fully Sharded Data Parallel
`GQA`	Grouped Query Attention
`MOS`	Mean Opinion Score (1–5 subjective quality)
`PESQ`	Perceptual Evaluation of Speech Quality
`RoPE`	Rotary Position Embedding
`RVQ`	Residual Vector Quantization
`SNAC`	Multi-Scale Neural Audio Codec
`STOI`	Short-Time Objective Intelligibility
`TTS`	Text-to-Speech
`WER`	Word Error Rate
`ZeRO`	Zero Redundancy Optimizer (DeepSpeed)

This document is a living research guide. Updated as OptimumLLM Audio 1.0 development progresses.

All Documentation Research Hub

Executive Summary

Key Design Decisions

What This Document Covers

The LLM-TTS RevolutionWhy Language Models Are the Future

Evolution of TTS Technology

Why Language Models Excel at TTS

The Core Pipeline

Landscape Analysis8 LLaMA-Based TTS Models

Comparative ArchitectureFull comparison matrix

Audio TokenizationThe Bridge Between Text and Sound

SNAC — Multi-Scale Neural Audio Codec ✅ Recommended

Codec Comparison Matrix

The TTS PipelineText → Tokens → Speech

Multi-Codebook Interleaving Strategies

Why LoRA Adapters FailFor Base TTS Models

Why This Fails for Audio

Approach Comparison

OptimumLLM Audio 1.0Architecture Design

Architecture Diagram

Vocabulary Extension

Built-in Speaker Voices

Inference Code Example

Audio Datasets13 datasets evaluated for training

Recommended Training Mix

Data PreparationAudio preprocessing pipeline

Audio Format Requirements

Training PipelineEnd-to-end, 4 stages

Hardware Requirements

Inference ArchitectureGeneration, cloning, and emotion control

Emotion Control at Inference

Voice Cloning Quality vs Reference Length

Hosting on HuggingFace

Evaluation & Benchmarks

Benchmark Comparison Target

Roadmap & Future Work

References

Models

Audio Codecs

Key Papers

Appendices

The LLM-TTS Revolution
Why Language Models Are the Future

Landscape Analysis
8 LLaMA-Based TTS Models

Comparative Architecture
Full comparison matrix

Audio Tokenization
The Bridge Between Text and Sound

The TTS Pipeline
Text → Tokens → Speech

Why LoRA Adapters Fail
For Base TTS Models

OptimumLLM Audio 1.0
Architecture Design

Audio Datasets
13 datasets evaluated for training

Data Preparation
Audio preprocessing pipeline

Training Pipeline
End-to-end, 4 stages

Inference Architecture
Generation, cloning, and emotion control