Maya Scolastica Updated on Mar 5, 2024 4 min read

What is Tokenization? – Karpathy Series

What is tokenization? It is a process that breaks down text into bite-sized pieces called tokens. But this seemingly simple step hides surprising complexity, impacting everything from language model performance to AI safety.

Tokenization is a critical component of large language models (LLMs) that often goes overlooked, but it is essential to understand its intricacies and impact on model performance. In this blog post, we will explore the process of tokenization, its complexities, and how it affects various aspects of LLMs.

In this Karpathy series, we are learning about tokenization, an integral part of the LLM. You can see his video here: https://www.youtube.com/watch?v=zduSFxRajkE

You can also see a full transcript of the video here: https://app.meeting.ai/en/m/01hr6w1prketj92b61eqmd9fs9

What is Tokenization?

At its core, tokenization is about breaking down text into smaller units called tokens. In the simplest case, as demonstrated in Andrej Karpathy's "Let's Build GPT from Scratch" video, this can be done at the character level. Each unique character in the input text is assigned an integer ID, creating a vocabulary. The text is then converted into a sequence of these token IDs, which serve as the input to the language model.

However, state-of-the-art LLMs use more sophisticated tokenization schemes, such as the byte pair encoding (BPE) algorithm, which operates at the sub-word level. BPE iteratively merges the most frequently occurring pairs of bytes into new tokens, building up a vocabulary of common character sequences. This allows the model to represent text more efficiently, as frequent combinations of characters can be represented by a single token.

Complexities and Challenges of Tokenization

While the basic idea of tokenization is straightforward, applying it to diverse real-world text data reveals many complexities and challenges. These issues can have significant impacts on the performance and behavior of LLMs.

While the core BPE algorithm is straightforward, many complexities arise when tokenizing real-world text that leads to various issues:

Case sensitivity: The same word capitalized differently (e.g. "egg", "Egg", "EGG") gets assigned totally different token IDs, even though semantically they represent the same concept. The model has to learn this equivalence from data.
Multi-byte characters: Characters outside the basic ASCII set, like emoji 😊 or Korean an-nyeong-ha-se-yo, are encoded as multiple bytes in UTF-8 and tend to get split up by the tokenizer into many tokens. This "bloats" the sequence length and reduces the context the model can attend to.
Numbers: Digits get separated in inconsistent ways based on what byte pairs were merged. So "677" could be [6, 77] while "804" is [80, 4]. This makes numerical understanding harder for the model.
Whitespace & code: Indentation in Python code wastes tokens on individual spaces, leaving less context for the actual code. This was an issue in GPT-2 but improved in GPT-4.
Special tokens: Models designate specific token IDs as control codes, like "end of text." If these appear in the input they can trigger unexpected behavior.

Many of these issues stem from the fact that BPE relies on the simple statistical co-occurrence of bytes without considering their semantic meaning. Improving tokenization to be more semantically-aware is an important challenge for advancing LLMs.

Implementing BPE Tokenization

To truly understand the nuts and bolts of tokenization, it's instructive to implement the BPE algorithm from scratch. The key steps are:

Encode the input text into bytes using UTF-8.
Iteratively find the most frequent byte pair and merge it into a new token, updating the vocabulary.
Tokenize text by splitting it into bytes and greedily applying the learned merge rules.
Decode by mapping token IDs back to their byte representations.

The core data structures are the merges dictionary, which maps byte pairs to token IDs, and the vocab dictionary for looking up bytes from token IDs. OpenAI's official GPT-2 tokenizer code provides a reference implementation against which to compare.

Importantly, the tokenizer is trained separately from the LLM itself, often on a much larger and more diverse dataset. This helps build a robust subword vocabulary that can handle the wide variety of text the model may encounter.

The Evolution of Tokenization

As LLMs have advanced, so too have their tokenization schemes. The GPT-4 model increased its vocabulary size to around 100K tokens, up from 50K in GPT-2. This allows for shorter token sequences to represent the same amount of text. GPT-4 also made improvements to whitespace handling and case sensitivity.

However, the fundamental limitations of BPE remain. Some researchers are exploring ways to eliminate tokenization entirely and train models directly on raw bytes. This would require rethinking the Transformer architecture to handle much longer sequences efficiently.

Other promising directions include adapting tokenization to non-text modalities like images and video, using "token-free" approaches based on vector quantization, and dynamically expanding the vocabulary with new "magic tokens" to embed specialized knowledge.

Vocabulary Size and Model Surgery

The choice of vocabulary size is an important hyperparameter in tokenization. Increasing the vocabulary size leads to denser input representations but also increases computational cost and may lead to undertrained parameters. Finding the right balance is crucial for optimal performance.

In addition to tokens derived from byte pairs, LLMs often include special tokens to delimit documents, conversations, or introduce special functionality. Adding special tokens requires model surgery, such as extending the embedding matrix and output projection layer. This is a common operation when fine-tuning models for specific tasks like chat interfaces.

Conclusion

Tokenization is a complex and often overlooked aspect of LLMs that has significant implications for model performance, cross-lingual ability, code handling, and more. Understanding the intricacies of tokenization algorithms like BPE and the impact of vocabulary size is essential for developing and fine-tuning LLMs effectively.

As Andrej Karpathy notes, "eternal glory" awaits whoever can obviate the need for this fiddly preprocessing step. But until that day comes, a deep understanding of tokenization in all its complexity is essential for anyone building or working with language models. By shedding light on this often-overlooked aspect of LLM, we can develop better, more robust, and more semantically-aware models to power the next generation of language AI.

While tokenization may not be the most glamorous part of LLMs, it is a critical component that deserves close attention. The specific way text is converted into tokens has far-reaching effects on model performance, linguistic coverage, and even safety and alignment. As we continue to push the boundaries of what is possible with language models, we must not overlook the importance of this unsung hero of LLM.

Updated on Mar 5, 2024