From Text to Tokens: How Language Models Read
Date:
This talk introduced the concept of tokenization, the process by which language models convert raw text into sequences of discrete numeric tokens before any computation can occur. The core methods examined were Byte Pair Encoding (BPE) and WordPiece, two subword tokenization algorithms that iteratively merge frequent character pairs to build a vocabulary that balances coverage and sequence length. A step-by-step algorithmic intuition was presented for both approaches, illustrating how subword tokenization resolves the fundamental tradeoffs of word-level methods (unknown words, vocabulary explosion) and character-level methods (excessively long sequences, no linguistic structure) more efficiently and robustly than either extreme. Tokenization Slides. BPE Notebook. WordPiece Notebook.