Tokenization

How text is converted into tokens that LLMs process.

Overview

Tokenization is the first step in the LLM pipeline, converting raw text into discrete tokens.

Key Concepts

Tokenization Strategies

Byte-Pair Encoding (BPE)

Most common approach (GPT, LLaMA)
Merges frequent byte pairs iteratively

WordPiece

Used by BERT
Similar to BPE with greedy longest-match first

SentencePiece

Language-agnostic
Handles raw text without pre-tokenization

Vocabulary Size

Typical sizes: 32K, 64K, 100K tokens
Trade-off between compression and expressivity