Data Tokenization
Transform raw data into machine-processable units that power intelligence.
Data tokenization is the foundation of modern AI. It converts text, code, and multimodal inputs into numerical representations that GPUs can process in parallel. Understanding this pipelinefrom capture to inferencereveals how AI systems actually work and where optimization opportunities exist.
The Full Pipeline
Data Capture & Preparation
Before tokenization, raw data must be cleaned and structured:
| Stage | What Happens | Why It Matters |
|---|---|---|
| Collection | Scraping, APIs, uploads, sensors | Determines data quality ceiling |
| Cleaning | Remove duplicates, fix encoding, filter noise | Garbage in = garbage out |
| Preprocessing | Normalize formats, chunk documents, extract structure | Enables consistent tokenization |
| Quality Filtering | Remove low-quality, toxic, or irrelevant content | Shapes model behavior and safety |
Data Sources for AI Training
| Source Type | Examples | Considerations |
|---|---|---|
| Web Crawls | Common Crawl, custom scrapes | Scale vs quality tradeoff |
| Curated Datasets | Wikipedia, arXiv, GitHub | High quality, limited scope |
| Synthetic Data | LLM-generated, simulations | Scaling without new sources |
| Proprietary Data | Internal docs, customer interactions | Competitive moat, privacy concerns |
| User Feedback | RLHF, thumbs up/down, edits | Alignment signal |
Tokenization Deep Dive
Tokenization converts text into discrete units (tokens) that models can process.
How Tokenizers Work
Key Concepts
| Concept | Description | Impact |
|---|---|---|
| Token | Unit of text (word, subword, character) | Granularity of understanding |
| Vocabulary | All tokens a model recognizes (typically 32K-100K+) | Coverage vs efficiency |
| Token ID | Integer mapping for each token | What the model actually sees |
| Embedding | Dense vector representation (768-4096+ dims) | Semantic meaning encoded |
| Context Window | Max tokens processable at once (4K-128K+) | Working memory limit |
Tokenization Algorithms
| Algorithm | Used By | Approach |
|---|---|---|
| BPE (Byte Pair Encoding) | GPT, Claude | Iteratively merge frequent char pairs |
| WordPiece | BERT, DistilBERT | Similar to BPE, different merge criteria |
| SentencePiece | T5, LLaMA | Language-agnostic, treats text as raw bytes |
| Tiktoken | OpenAI models | Optimized BPE implementation |
Token Efficiency Matters
"Tokenization" as one token vs "Token" + "ization" as two tokens
Fewer tokens means:
- More content fits in context window
- Faster inference
- Lower API costs
- Better long-range coherence
Analyze tokenization:
- tik-tokenizer - Visualize how text becomes tokens
- bbycroft.net - Interactive LLM architecture explorer
GPU Processing
Tokens flow through neural network layers on GPU hardware.
Why GPUs?
| Property | CPU | GPU |
|---|---|---|
| Cores | 8-64 powerful cores | 1000s of simpler cores |
| Parallelism | Sequential optimization | Massive parallelism |
| Memory Bandwidth | ~100 GB/s | ~1-3 TB/s |
| Best For | Complex branching logic | Matrix math (attention, FFN) |
The Transformer Architecture
Each token passes through repeated blocks:
Attention Mechanism
The core innovation that enables context understanding:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Q = Query (what am I looking for?)
K = Key (what do I contain?)
V = Value (what do I contribute?)
Every token attends to every other token, creating O(n^2) complexitywhy context windows are expensive.
Memory & Compute Bottlenecks
| Bottleneck | Cause | Mitigation |
|---|---|---|
| KV Cache | Storing attention keys/values | Quantization, sliding window |
| Attention Compute | O(n^2) with sequence length | Flash Attention, sparse attention |
| Memory Bandwidth | Moving weights to compute units | Tensor parallelism, batching |
| Model Size | Billions of parameters | Quantization (FP16, INT8, INT4) |
AI Consumption: Inference
Inference is how trained models generate outputs.
Inference Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Greedy Decoding | Always pick highest probability token | Deterministic, fast |
| Temperature Sampling | Scale logits before sampling | Control randomness |
| Top-K Sampling | Sample from K most likely tokens | Bounded creativity |
| Top-P (Nucleus) | Sample from smallest set with cumulative P | Dynamic vocabulary |
| Beam Search | Track multiple hypotheses | Translation, structured output |
Inference Optimization
| Technique | Speedup | Tradeoff |
|---|---|---|
| Quantization | 2-4x | Slight quality loss |
| Speculative Decoding | 2-3x | Requires draft model |
| Continuous Batching | 2-10x | Infrastructure complexity |
| KV Cache Optimization | 1.5-2x | Memory management |
| Flash Attention | 2-4x | Kernel implementation |
Inference Cost Drivers
Cost = (Input Tokens + Output Tokens) x Model Size x Compute Time
Factors:
- Context length (quadratic attention cost)
- Output length (sequential generation)
- Model parameters (memory + compute)
- Batch efficiency (utilization)
- Hardware (GPU type, availability)
RAG: Retrieval-Augmented Generation
RAG grounds LLM outputs in external knowledge, reducing hallucination and enabling current information access.
RAG Architecture
RAG Components
| Component | Options | Considerations |
|---|---|---|
| Chunking | Fixed size, semantic, recursive | Balance context vs precision |
| Embeddings | OpenAI, Cohere, open-source | Dimension, quality, cost |
| Vector DB | Pinecone, Weaviate, Chroma, pgvector | Scale, features, hosting |
| Retrieval | Dense, sparse, hybrid | Recall vs precision |
| Reranking | Cross-encoder, ColBERT | Quality vs latency |
RAG Optimization
| Challenge | Solution |
|---|---|
| Poor retrieval | Better chunking, hybrid search, reranking |
| Context overflow | Summarization, filtering, hierarchical retrieval |
| Outdated info | Incremental indexing, freshness scoring |
| Hallucination | Citation requirements, confidence thresholds |
| Latency | Caching, async retrieval, smaller models |
Context & Progress Disclosure Strategies
How to effectively use limited context windows and communicate processing state.
Context Management Strategies
| Strategy | Description | Best For |
|---|---|---|
| Sliding Window | Keep recent N tokens, drop oldest | Streaming, chat |
| Summarization | Compress history into summary | Long conversations |
| Hierarchical | Summary + recent detail | Best of both worlds |
| RAG Integration | Retrieve relevant history | Large knowledge bases |
| Tool Use | Offload to external memory | Complex workflows |
Progress Disclosure Patterns
For long-running AI tasks, users need visibility:
| Pattern | Implementation | User Experience |
|---|---|---|
| Streaming | Token-by-token output | Immediate feedback |
| Chunked Updates | Periodic progress messages | Balanced latency |
| Status Indicators | "Thinking...", "Searching..." | Process transparency |
| Partial Results | Show intermediate outputs | Validate direction |
| Confidence Signals | Uncertainty indicators | Trust calibration |
Effective Prompting for Context Efficiency
- Front-load important context - Models attend more to beginning and end
- Use structured formats - JSON, XML, markdown for clear parsing
- Explicit instructions - Don't assume implicit understanding
- Few-shot examples - Show desired output format
- Chain of thought - "Think step by step" for complex reasoning
Context Window Economics
| Model Class | Context | Input Cost | Output Cost |
|---|---|---|---|
| GPT-4o | 128K | $2.50/M | $10/M |
| Claude Opus 4 | 200K | $15/M | $75/M |
| Claude Sonnet 4 | 200K | $3/M | $15/M |
| Gemini 1.5 Pro | 1M+ | $1.25/M | $5/M |
| Open Source | 8K-128K | Compute cost | Compute cost |
Prices approximate and subject to change
Monitoring Protocol
Track the evolving AI infrastructure landscape.
Key Metrics
| Metric | What It Indicates | Where to Find |
|---|---|---|
| Tokens/Second | Inference throughput | Benchmarks, provider dashboards |
| Time to First Token | Response latency | API monitoring |
| Cost per 1M Tokens | Economic efficiency | Provider pricing pages |
| Context Window Size | Memory capacity | Model announcements |
| Benchmark Scores | Model capability | Papers, leaderboards |
Trends to Watch
- Context window expansion (1M+ tokens becoming standard)
- Inference cost reduction (10x cheaper each year)
- Multimodal tokenization (unified text/image/audio)
- Mixture of Experts scaling (sparse activation)
- On-device inference (mobile, edge deployment)
Sources to Monitor
Benchmarks & Research:
- Hugging Face Leaderboards - Model comparisons
- Papers With Code - Latest research
- arXiv cs.CL - NLP papers
Infrastructure:
- Artificial Analysis - LLM performance benchmarks
- GPU availability trackers - Hardware access
- Provider status pages - Reliability metrics
Voices to Follow:
- Andrej Karpathy (AI education, Tesla/OpenAI alum)
- Jim Fan (NVIDIA, embodied AI)
- Sasha Rush (Cornell, efficient transformers)
- Research teams: Anthropic, OpenAI, DeepMind, Meta AI
Learn More
- Attention Is All You Need - Original transformer paper
- The Illustrated Transformer - Visual explanation
- LLM Visualization - Interactive architecture explorer
- What Is ChatGPT Doing? - Stephen Wolfram deep dive