Data Tokenization

Transform raw data into machine-processable units that power intelligence.

Data tokenization is the foundation of modern AI. It converts text, code, and multimodal inputs into numerical representations that GPUs can process in parallel. Understanding this pipelinefrom capture to inferencereveals how AI systems actually work and where optimization opportunities exist.

The Full Pipeline

Data Capture & Preparation

Before tokenization, raw data must be cleaned and structured:

Stage	What Happens	Why It Matters
Collection	Scraping, APIs, uploads, sensors	Determines data quality ceiling
Cleaning	Remove duplicates, fix encoding, filter noise	Garbage in = garbage out
Preprocessing	Normalize formats, chunk documents, extract structure	Enables consistent tokenization
Quality Filtering	Remove low-quality, toxic, or irrelevant content	Shapes model behavior and safety

Data Sources for AI Training

Source Type	Examples	Considerations
Web Crawls	Common Crawl, custom scrapes	Scale vs quality tradeoff
Curated Datasets	Wikipedia, arXiv, GitHub	High quality, limited scope
Synthetic Data	LLM-generated, simulations	Scaling without new sources
Proprietary Data	Internal docs, customer interactions	Competitive moat, privacy concerns
User Feedback	RLHF, thumbs up/down, edits	Alignment signal

Tokenization Deep Dive

Tokenization converts text into discrete units (tokens) that models can process.

How Tokenizers Work

Key Concepts

Concept	Description	Impact
Token	Unit of text (word, subword, character)	Granularity of understanding
Vocabulary	All tokens a model recognizes (typically 32K-100K+)	Coverage vs efficiency
Token ID	Integer mapping for each token	What the model actually sees
Embedding	Dense vector representation (768-4096+ dims)	Semantic meaning encoded
Context Window	Max tokens processable at once (4K-128K+)	Working memory limit

Tokenization Algorithms

Algorithm	Used By	Approach
BPE (Byte Pair Encoding)	GPT, Claude	Iteratively merge frequent char pairs
WordPiece	BERT, DistilBERT	Similar to BPE, different merge criteria
SentencePiece	T5, LLaMA	Language-agnostic, treats text as raw bytes
Tiktoken	OpenAI models	Optimized BPE implementation

Token Efficiency Matters

"Tokenization" as one token vs "Token" + "ization" as two tokens

Fewer tokens means:

More content fits in context window
Faster inference
Lower API costs
Better long-range coherence

Analyze tokenization:

tik-tokenizer - Visualize how text becomes tokens
bbycroft.net - Interactive LLM architecture explorer

GPU Processing

Tokens flow through neural network layers on GPU hardware.

Why GPUs?

Property	CPU	GPU
Cores	8-64 powerful cores	1000s of simpler cores
Parallelism	Sequential optimization	Massive parallelism
Memory Bandwidth	~100 GB/s	~1-3 TB/s
Best For	Complex branching logic	Matrix math (attention, FFN)

The Transformer Architecture

Each token passes through repeated blocks:

Attention Mechanism

The core innovation that enables context understanding:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Q = Query (what am I looking for?)
K = Key (what do I contain?)
V = Value (what do I contribute?)

Every token attends to every other token, creating O(n^2) complexitywhy context windows are expensive.

Memory & Compute Bottlenecks

Bottleneck	Cause	Mitigation
KV Cache	Storing attention keys/values	Quantization, sliding window
Attention Compute	O(n^2) with sequence length	Flash Attention, sparse attention
Memory Bandwidth	Moving weights to compute units	Tensor parallelism, batching
Model Size	Billions of parameters	Quantization (FP16, INT8, INT4)

AI Consumption: Inference

Inference is how trained models generate outputs.

Inference Strategies

Strategy	Description	Use Case
Greedy Decoding	Always pick highest probability token	Deterministic, fast
Temperature Sampling	Scale logits before sampling	Control randomness
Top-K Sampling	Sample from K most likely tokens	Bounded creativity
Top-P (Nucleus)	Sample from smallest set with cumulative P	Dynamic vocabulary
Beam Search	Track multiple hypotheses	Translation, structured output

Inference Optimization

Technique	Speedup	Tradeoff
Quantization	2-4x	Slight quality loss
Speculative Decoding	2-3x	Requires draft model
Continuous Batching	2-10x	Infrastructure complexity
KV Cache Optimization	1.5-2x	Memory management
Flash Attention	2-4x	Kernel implementation

Inference Cost Drivers

Cost = (Input Tokens + Output Tokens) x Model Size x Compute Time

Factors:
- Context length (quadratic attention cost)
- Output length (sequential generation)
- Model parameters (memory + compute)
- Batch efficiency (utilization)
- Hardware (GPU type, availability)

RAG: Retrieval-Augmented Generation

RAG grounds LLM outputs in external knowledge, reducing hallucination and enabling current information access.

RAG Architecture

RAG Components

Component	Options	Considerations
Chunking	Fixed size, semantic, recursive	Balance context vs precision
Embeddings	OpenAI, Cohere, open-source	Dimension, quality, cost
Vector DB	Pinecone, Weaviate, Chroma, pgvector	Scale, features, hosting
Retrieval	Dense, sparse, hybrid	Recall vs precision
Reranking	Cross-encoder, ColBERT	Quality vs latency

RAG Optimization

Challenge	Solution
Poor retrieval	Better chunking, hybrid search, reranking
Context overflow	Summarization, filtering, hierarchical retrieval
Outdated info	Incremental indexing, freshness scoring
Hallucination	Citation requirements, confidence thresholds
Latency	Caching, async retrieval, smaller models

Context & Progress Disclosure Strategies

How to effectively use limited context windows and communicate processing state.

Context Management Strategies

Strategy	Description	Best For
Sliding Window	Keep recent N tokens, drop oldest	Streaming, chat
Summarization	Compress history into summary	Long conversations
Hierarchical	Summary + recent detail	Best of both worlds
RAG Integration	Retrieve relevant history	Large knowledge bases
Tool Use	Offload to external memory	Complex workflows

Progress Disclosure Patterns

For long-running AI tasks, users need visibility:

Pattern	Implementation	User Experience
Streaming	Token-by-token output	Immediate feedback
Chunked Updates	Periodic progress messages	Balanced latency
Status Indicators	"Thinking...", "Searching..."	Process transparency
Partial Results	Show intermediate outputs	Validate direction
Confidence Signals	Uncertainty indicators	Trust calibration

Effective Prompting for Context Efficiency

Front-load important context - Models attend more to beginning and end
Use structured formats - JSON, XML, markdown for clear parsing
Explicit instructions - Don't assume implicit understanding
Few-shot examples - Show desired output format
Chain of thought - "Think step by step" for complex reasoning

Context Window Economics

Model Class	Context	Input Cost	Output Cost
GPT-4o	128K	$2.50/M	$10/M
Claude Opus 4	200K	$15/M	$75/M
Claude Sonnet 4	200K	$3/M	$15/M
Gemini 1.5 Pro	1M+	$1.25/M	$5/M
Open Source	8K-128K	Compute cost	Compute cost

Prices approximate and subject to change

Monitoring Protocol

Track the evolving AI infrastructure landscape.

Key Metrics

Metric	What It Indicates	Where to Find
Tokens/Second	Inference throughput	Benchmarks, provider dashboards
Time to First Token	Response latency	API monitoring
Cost per 1M Tokens	Economic efficiency	Provider pricing pages
Context Window Size	Memory capacity	Model announcements
Benchmark Scores	Model capability	Papers, leaderboards

Trends to Watch

Context window expansion (1M+ tokens becoming standard)
Inference cost reduction (10x cheaper each year)
Multimodal tokenization (unified text/image/audio)
Mixture of Experts scaling (sparse activation)
On-device inference (mobile, edge deployment)

Sources to Monitor

Benchmarks & Research:

Hugging Face Leaderboards - Model comparisons
Papers With Code - Latest research
arXiv cs.CL - NLP papers

Infrastructure:

Artificial Analysis - LLM performance benchmarks
GPU availability trackers - Hardware access
Provider status pages - Reliability metrics

Voices to Follow:

Andrej Karpathy (AI education, Tesla/OpenAI alum)
Jim Fan (NVIDIA, embodied AI)
Sasha Rush (Cornell, efficient transformers)
Research teams: Anthropic, OpenAI, DeepMind, Meta AI

Learn More

Attention Is All You Need - Original transformer paper
The Illustrated Transformer - Visual explanation
LLM Visualization - Interactive architecture explorer
What Is ChatGPT Doing? - Stephen Wolfram deep dive

The Full Pipeline​

Data Capture & Preparation​

Data Sources for AI Training​

Tokenization Deep Dive​

How Tokenizers Work​

Key Concepts​

Tokenization Algorithms​

Token Efficiency Matters​

GPU Processing​

Why GPUs?​

The Transformer Architecture​

Attention Mechanism​

Memory & Compute Bottlenecks​

AI Consumption: Inference​

Inference Strategies​

Inference Optimization​

Inference Cost Drivers​

RAG: Retrieval-Augmented Generation​

RAG Architecture​

RAG Components​

RAG Optimization​

Context & Progress Disclosure Strategies​

Context Management Strategies​

Progress Disclosure Patterns​

Effective Prompting for Context Efficiency​

Context Window Economics​

Monitoring Protocol​

Key Metrics​

Trends to Watch​

Sources to Monitor​

Learn More​