Lior's View
Posts
📚 Part 2 - Understanding LLMs From 0 To 1

📚 Part 2 - Understanding LLMs From 0 To 1

Your weekly deep dive on the latest technical topic in AI you should know about.

Lior Sinclair
October 29, 2023

AlphaSignal

Hey ,

Welcome to our new deep-dive series where we'll dive deep into different AI topics.

We’re back with part 2 of our understanding LLMs series. As a quick recap last week we learned:

How LLMs/Machine Learning (ML) models process text via text vectorization.
What tokenizers are.
The need for building a vocabulary of tokens that a model can recognize.
How tokens/documents are converted to vectors.

Today we’ll dive into the principles of the tokenizer used by the GPT family of LLMs.

Let’s get into it!

Reading time: 5 min 31 sec

^{DEEP DIVE}
Part 2 - Sub-words and Byte Pair Encoding

Read Part 1 Here ↗

Last week, we discussed simple rule-based tokenizers that divide text into words using spaces and punctuation. While commonly used until recently, they have a major limitation: they can't recognize new words not in their vocabulary. These unknown words, called out-of-vocabulary (OOV) tokens, get mapped to a single special token, losing distinct information.

As a result, all OOV tokens share the same numerical representation or "embedding" in natural language processing (NLP) models. This method has its flaws, as it fails to capture the uniqueness of different unfamiliar words, limiting the model's ability to understand and analyze text effectively.

Word Level Tokenizers
To understand this better let's see how spacy, which uses a rule-based tokenizer, deals with OOV tokens. We can see that there are two OOV words, both of which are mapped to the same embedding.


import spacy
import numpy as np

# Load spaCy model
spacy_model = spacy.load("en_core_web_lg")

# Process text with spaCy
text = "This contains 2 outofvocabulary words ohnoooo"

token_embs = [(t.text, t.vector) for t in 
                        spacy_model(text) if t.is_oov]

# Extract OOV tokens
oov_tokens = [t[0] for t in token_embs]

print(f"OOV tokens: {oov_tokens}")
# Output: OOV tokens: ['outofvocabulary', 'ohnoooo']

# Extract size of embeddings
embedding_size = token_embs[0][1].size

# Print size of embeddings
print(f"Embedding size: {embedding_size}")

# Output: Embedding size: 300

Meet Adala: The Most Efficient Agent Framework for Data Processing

Doing data labeling or data processing? Then you know how critical accuracy is, and how time-consuming it can be. What if you could have an autonomous AI agent do all your data labeling and creation for you?

Enter Adala, a new agent framework that dramatically increases the efficiency of data labeling (and broader application across data processing tasks), with the unique ability to be guided by human signal.

Brought to you by the same folks who made Label Studio, Adala represents a new way to think about data labeling and processing.

Click below to head over to the GitHub repo:

Sub-Word Tokenizers
GPT-2's tokenizer is different from spaCy's rule-based version. In spaCy, unknown words are mapped to a single "OOV" token. GPT-2, however, uses sub-words for better flexibility.

To show this, let's examine GPT-2's special tokens and how it deals with out-of-vocabulary (OOV) words.


from transformers import GPT2Tokenizer

# Initialize tokenizer
gpt_tk = GPT2Tokenizer.from_pretrained("gpt2")

# Sample text
txt = "This contains 2 outofvocabulary words ohnoooo"

# Tokenize
tkns = gpt_tk.tokenize(txt)

# Get token IDs
ids = gpt_tk.convert_tokens_to_ids(tkns)

# Count OOV tokens
oov_ct = sum(id == gpt_tk.unk_token_id for id in ids)

print(f"OOV count: {oov_ct}")
# Output: OOV count: 0

Notice, there are zero OOV tokens. This happens because GPT-2 breaks unknown words into known sub-words: "outofvocabulary" becomes "out of voc abulary" and "ohnoooo" becomes "oh n oooo."

So, why is this better? Sub-word tokenization allows GPT-2 to make sense of new words by breaking them into pieces it understands. This means each new word gets a unique representation based on its sub-words, unlike in spaCy where all unknown words share the same 'OOV' representation.

Word Or Subword
Remember, that each token maps to a unique embedding (vector) which contains information about the meaning (broadly speaking) of that token.

While breaking a word into its component sub-words is better than mapping it to an OOV token, we still sacrifice guaranteed uniqueness for each word’s embedding. With composition, some information about the precise meaning of a word can be lost

For a model specialized for a particular domain like programming, it is better to keep meaningful keywords intact in the vocabulary than to break them down into sub-words.

Let's examine this by comparing GPT-2 vs StarCoder, an open source equivalent of github copilot.


# GPT-2 example
print(f"GPT-2 tokenizes 'elif': 
{' '.join(tokenizer.tokenize('elif'))}")

# Output: GPT-2 tokenizes 'elif': el if

# Initialize Starcoder tokenizer
from transformers import AutoTokenizer as AT
sc_tokenizer = AT.from_pretrained("bigcode/starcoder")

# Tokenize 'elif' with Starcoder
sc_elif = ' '.join(sc_tokenizer.tokenize('elif'))

# Print result
print(f"Starcoder tokenizes 'elif': {sc_elif}")
# Output: Starcoder tokenizes 'elif': elif

You can see that the StarCoder model that specializes in programming represents elif as a single word/token compared to GPT-2 which breaks it down into two sub-words.

Byte Pair Encoding
How did the tokenizer decide to break down elif as el, if and not e, l, if or eli, f?

This is where an algorithm called Byte Pair Encoding (BPE) comes into the picture. As a reminder a vocabulary can be built based on the top-k tokens in our training data.

BPE helps us build a vocabulary made up of words and sub-words based on their frequencies. Additionally, it helps with the creation of merge rules that dictate how Sub-words can be merged together to form another token in the model's vocabulary. This is how the algorithm works:

Initialization with Unicode Characters: Byte Pair Encoding (BPE) starts by initializing the vocabulary with individual characters present in the training data. The alphabet can serve as a simple example.

The vocabulary is initialized with all the individual unicode characters that are present in the training corpus. For the sake of simplicity let's assume that we start with the 26 letters of the alphabet. Let's represent each token in the vocab with a unique id.


import string

# Create vocabulary
vocab = list(string.ascii_lowercase)

# Assign unique IDs to tokens
id2token = {ind: alpha for ind, alpha in enumerate(vocab)}
token2id = {alpha: ind for ind, alpha in enumerate(vocab)}

A rule-based tokenizer can then be used to identify all the unique words and their counts in the training corpus.



from collections import Counter

# Corpus
corpus = ["i like ai", "ai is cool"]

# Count tokens
words = []
for text in corpus:
  for word in text.split():
    words.append(word)
token_counts = Counter(words)

# Output: {'i': 1, 'like': 1, 'ai': 2, 'is': 1, 'cool': 1}

We calculate how often each pair of adjacent letters appears in our training data. At first, our vocabulary only has single letters, so we focus on letter pairs. The outcome might look like this:


# Create dict
{('l', 'i'): 1, ('i', 'k'): 1, 
('k', 'e'): 1, ('a', 'i'): 2}

The most frequent pair ai is added to the vocabulary. All occurrences of this new token in the vocabulary are now represented via its new token id and not as the token ids of “a” and “i” individually.


vocab.append("ai")
vocab_size = len(vocab)
id2token[vocab_size] = vocab[-1]
token2id[vocab[-1]] = vocab_size

A merge rule is created. In this instance the rule would be to always merge the letters a and i whenever they occur next to each other in an incoming document.


merge_rules.update({("a", "i"): "ai"})

A nice way to visualize the effect of a new token added to the vocabulary is by seeing how the corpus is represented before and after the addition of the new token. I'll use the | (pipe) symbol to indicate the boundaries of each token. Notice how all occurrences of ai are now treated as one token.


corpus_before_adding_ai = [
"|i| |l|i|k|e| |a|i|",
]
corpus_after_adding_ai = [
"|i| |l|i|k|e| |ai|",
]

Repeat steps 3 to 5 until the vocabulary reaches a user defined size. This bottom-up approach of constructing a vocabulary ensures that the most frequent sub-words and words are always represented with their own unique embedding. The presence of all individual characters in the vocabulary ensures that we can always reconstruct any OOV word by just combining the individual characters (worst-case scenario).

Wrap Up
That’s all for this week’s deep-dive folks! Today we learned:

What sub-words are and why we need them.
How an algorithm called Byte-Pair-Encoding can be used to create a vocabulary made up of words and sub-words.

References

Intro to tokenizers: https://huggingface.co/learn/nlp-course/chapter2/4?fw=pt
Different type of tokenizers: https://huggingface.co/docs/transformers/tokenizer_summary
Paper which popularized the use of BPE as a tokenization strategy: *arXiv:1508.07909v5 [cs.CL] 10 Jun 2016

Pramodith is a contributing writer at AlphaSignal and AI Engineer at LinkedIn with expertise in Natural Language Processing, Computer Vision, and Reinforcement Learning. A graduate of the Georgia Institute of Technology, he has a strong foundation in Conversational AI. Feel free to connect and reach out.

How was today’s email?

Not Great Good Amazing

Thank You

Want to promote your company, product, job, or event to 100,000+ AI researchers and engineers? You can reach out here.

📚 Part 2 - Understanding LLMs From 0 To 1

Your weekly deep dive on the latest technical topic in AI you should know about.

DEEP DIVEPart 2 - Sub-words and Byte Pair Encoding

Meet Adala: The Most Efficient Agent Framework for Data Processing

^{DEEP DIVE}
Part 2 - Sub-words and Byte Pair Encoding