LLM Tokenizers
Tokens are the basic unit of text in an LLM model like a word, part of a word, or punctuation. Most Generative AI models split the input data into tokens and output one token at a time. Different Generative AI models can split the same sentence into different types of tokens. This is due to three major factors:
1. Tokenisation method
2. Parameters and special tokens used to initialise the tokenizer 3. Dataset the LLM is trained on
Let us look at a specific text and compare how different popular models split it into tokens.
text = """
Harsha is a Data Scientist currently working as a Senior Consultant at Deloitte in Hyderabad, India.
He has <span id="yearsofexp">10</span> years of data science and machine learning experience.
He built solutions for companies such as Walmart, Pepsico, Rolls-Royce, Dr-Reddys, Tesco, and Tata AIA.
He is proficient in Python, R, SQL, Tableau, PySpark and cloud platforms such as Azure and Google Cloud.
<script type="text/javascript">
// Define a function called diff_years that calculates the difference in years between two given dates (dt2 and dt1)
function diff_years(dt2, dt1)
{
// Calculate the difference in milliseconds between the two dates
var diff = (dt2.getTime() - dt1.getTime()) / 1000;
// Convert the difference from milliseconds to days
diff /= (60 * 60 * 24);
// Calculate the approximate number of years by dividing the difference in days by the average number of days in a year (365.25)
return Math.abs(Math.round(diff / 365.25));
}
dt1 = new Date(2017, 11, 4); // October 4 2017
dt2 = new Date(); // Today
document.getElementById("yearsofexp").innerHTML = diff_years(dt2, dt1)
</script>
"""
The above text contains the first few lines of my website. This contains conversational text, lots of company and technology names, numbers and javascript code. Let us see how different LLM's split it as tokenisors.
from transformers import AutoTokenizer
# list of colors to show text
colors_list = ['102;194;165', '252;141;98', '141;160;203', '231;138;195', '166;216;84', '255;217;47']
def show_tokens(sentence, tokenizer_name):
# Fuction that takes a sentence and tokenizer and prints the tokens
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
token_ids = tokenizer(sentence).input_ids
for i, token in enumerate(token_ids):
print(f'\x1b[0;30;48;2;{colors_list[i % len(colors_list)]}m'+
tokenizer.decode(token)+
'\x1b[0m', end=' ')
BERT base model (uncased)¶
The Google-based Bert model was introduced in 2018 and was pre-trained in self supervised fashion (no human labels) with two objectives:
1. Masked language modelling: Model has to predict missing words between sentences
2. Next sentence prediction: Predict the next sentence
The uncased model does not differentiate between uppercase and lowercase letters.
show_tokens(text, "bert-base-uncased")
We can see that the model ignores new lines and spaces used for coding and is thus not suitable for coding. Because of the absence of new lines, it can be difficult to understanding chat logs and similar texts. Some words such as "Walmart" are split into two tokens wal and ##mart, '##' characters indicating that the token is a partial token. It does not identify company names such as "Deloitte" & "Tesco" and common technology names such as "Tableau", "Pyspark".
It starts with a [CLS] token for classification tasks and ends with a [SEP] token for the separator token. The other tokens used in the model are [PAD] padding token, [UNK] for unknown and [MASK] for masking token (used while training).
BERT base model (cased)¶
This model is similar to the BERT uncased model but it differentiates between uppercase and lowercase letters.
show_tokens(text, "bert-base-cased")
Tokens such as consultant in the uncased model have been split into Consult and ##ing in the cased model.
GPT 2¶
OpenAI's GPT 2 is a large language model built to predict the next word given all the previous words. It was built on 40GB of text and has 1.5 Billion parameters. It is trained using unsupervised unlabeled data.
show_tokens(text, "gpt2")
New lines are now represented in the tokenizer, along with capitalisation being preserved. Company names such as Pepsico, Rolls Royce and Deloitte are identified using multiple tokens along with common technology names such as Tableau and PySpark. The spaces used for indentation for coding are represented by one token each, and the final space is part of the next character. These whitespace characters can be useful in generating and reading code and indentation.
Flan-T5¶
Google's Flan T5 is an encoder-decoder based model that is language-independent and trained on a combination of supervised and unsupervised training. It was built for various tasks such as translation, question/answering, sequence classification etc.
show_tokens(text, "google/flan-t5-base")
No newline or whitespace tokens would make it difficult to work with code. It is also unable to identify various characters and uses the unknown token /[UNK/] for the same.
GPT 4¶
ChatGPT 4 improves on the GPT 2 model with 1.5 Billion parameters and is the most popular of all LLM models.
show_tokens(text, "Xenova/gpt-4")
GPT4 tokenizer represents four spaces as a single token which helps in understanding and writing code. It has different tokens for different combinations of white spaces and tabs. It generally uses fewer tokens to represent most words.
Starcoder 2¶
Starcoder 2 is a 15 Billion parameter model focused on creating code.
show_tokens(text, "bigcode/starcoder2-15b")
It identifies numbers like 2024 with four tokens, one for each number leading to a better representation of numbers and mathematics. Similar to GPT-4, it has a list of whitespaces encoded as a single token. While simple words and common nouns are represented by a number of tokens.
Galactica¶
Facebook's Galactica is an LLM that is focused on scientific knowledge and is trained on many scientific papers. Its primary usage is for citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction.
show_tokens(text, "facebook/galactica-125m")
Of all the examples here, this tokenizer has the maximum number of tokens and uses multiple tokens to identify javascript, Walmart, Hyderabad, Azure, Pyspark etc.
Phi-3 (and Llama-2)¶
Microsoft's phi-3 reuses the tokenizer of Lamma 2 and adds a number of special tokens. Phi-3 models are trained using both supervised fine-tuning, proximal policy optimization, and direct preference optimization and are primarily built for precise instruction adherence.
Some additional special tokens such as <|user|>, <|assistant|> and,|system|> are added for dealing with chat and conversations. <|endoftext|> is another token to denote the end of text.
show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")
Different tokenizers show different ways in which they tokenize the words, and this is driven largely by three factors:
1. Tokenization method: The most popular is Byte Pair Encoding (BPE)
2. Tokenizing parameters: Vocabulary size, Special tokens and Capitalisation 3. Domain of the data: The tokenizer behaviour is dependent on the data on which it is trained which is chosen based on the specific use case for which it is built. We can see in the above examples how tokens are created for the same words for different tokenizers that are built to identify code, text, chat, etc.
References¶
Hands-On Large Language Models by Jay Alammar, Maarten Grootendorst