LLM Tokenizers

Tokens are the basic unit of text in an LLM model, like a word, part of a word, or punctuation. Most generative AI models split the input data into tokens and output one token at a time. Different generative AI models can split the same sentence into different types of tokens. This is due to three major factors:
1. Tokenization method
2. Parameters and special tokens used to initialize the tokenizer 3. Dataset the LLM is trained on

Let us look at a specific text and compare how different popular models split it into tokens.

text = """
Harsha is a Data Scientist currently working as a Senior Consultant at Deloitte in Hyderabad, India. 

He has <span id="yearsofexp">10</span> years of data science and machine learning experience.

He built solutions for companies such as Walmart, Pepsico, Rolls-Royce, Dr-Reddys, Tesco, and Tata AIA. 

He is proficient in Python, R, SQL, Tableau, PySpark and cloud platforms such as Azure and Google Cloud.

<script type="text/javascript">
    // Define a function called diff_years that calculates the difference in years between two given dates (dt2 and dt1)
    function diff_years(dt2, dt1) 
    {
      // Calculate the difference in milliseconds between the two dates
      var diff = (dt2.getTime() - dt1.getTime()) / 1000;
      // Convert the difference from milliseconds to days
      diff /= (60 * 60 * 24);
      // Calculate the approximate number of years by dividing the difference in days by the average number of days in a year (365.25)
      return Math.abs(Math.round(diff / 365.25));
    }
    dt1 = new Date(2017, 11, 4); // October 4 2017
    dt2 = new Date(); // Today
    document.getElementById("yearsofexp").innerHTML = diff_years(dt2, dt1)
  </script>

"""

The above text contains the first few lines of my website. This contains conversational text, lots of company and technology names, numbers and JavaScript code. Let us see how different LLMs' split it as tokenizers.

from transformers import AutoTokenizer

# list of colors to show text
colors_list = ['102;194;165', '252;141;98', '141;160;203', '231;138;195', '166;216;84', '255;217;47']

def show_tokens(sentence, tokenizer_name):
    # Fuction that takes a sentence and tokenizer and prints the tokens 
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
    token_ids = tokenizer(sentence).input_ids
    for i, token in enumerate(token_ids):
        print(f'\x1b[0;30;48;2;{colors_list[i % len(colors_list)]}m'+
              tokenizer.decode(token)+
              '\x1b[0m', end=' ')

BERT base model (uncased)¶

The Google-based BERT model was introduced in 2018 and was pre-trained in a self-supervised fashion (no human labels) with two objectives:
1. Masked language modelling: Model has to predict missing words between sentences
2. Next sentence prediction: Predict the next sentence
The uncased model does not differentiate between uppercase and lowercase letters.

show_tokens(text, "bert-base-uncased")

We can see that the model ignores new lines and spaces used for coding and is thus not suitable for coding. Because of the absence of new lines, it can be difficult to understand chat logs and similar texts. Some words such as "Walmart" are split into two tokens wal and ##mart, '##' characters indicating that the token is a partial token. It does not identify company names such as "Deloitte" & "Tesco" and common technology names such as "Tableau" and "PySpark".

It starts with a [CLS] token for classification tasks and ends with a [SEP] token for the separator token. The other tokens used in the model are [PAD] padding token, [UNK] for unknown and [MASK] for masking token (used while training).

BERT base model (cased)¶

This model is similar to the BERT uncased model but it differentiates between uppercase and lowercase letters.

show_tokens(text, "bert-base-cased")

Tokens such as consultant in the uncased model have been split into Consult and ##ing in the cased model.

GPT 2¶

OpenAI's GPT 2 is a large language model built to predict the next word given all the previous words. It was built on 40GB of text and has 1.5 billion parameters. It is trained using unsupervised unlabeled data.

show_tokens(text, "gpt2")

New lines are now represented in the tokenizer, along with capitalisation being preserved. Company names such as Pepsico, Rolls Royce and Deloitte are identified using multiple tokens along with common technology names such as Tableau and PySpark. The spaces used for indentation for coding are represented by one token each, and the final space is part of the next character. These whitespace characters can be useful in generating and reading code and indentation.

Flan-T5¶

Google's Flan T5 is an encoder-decoder based model that is language-independent and trained on a combination of supervised and unsupervised training. It was built for various tasks such as translation, question/answering, sequence classification etc.

show_tokens(text, "google/flan-t5-base")

No newline or whitespace tokens would make it difficult to work with code. It is also unable to identify various characters and uses the unknown token /[UNK/] for the same.

GPT 4¶

ChatGPT 4 improves on the GPT 2 model with 1.5 billion parameters and is the most popular of all LLM models.

show_tokens(text, "Xenova/gpt-4")

png

The GPT-4 tokenizer represents four spaces as a single token, which helps in understanding and writing code. It has different tokens for different combinations of white spaces and tabs. It generally uses fewer tokens to represent most words.

Starcoder 2¶

Starcoder 2 is a 15 billion parameter model focused on creating code.

show_tokens(text, "bigcode/starcoder2-15b")

png

It identifies numbers like 2024 with four tokens, one for each number leading to a better representation of numbers and mathematics. Similar to GPT-4, it has a list of whitespaces encoded as a single token. While simple words and common nouns are represented by a number of tokens.

Galactica¶

Facebook's Galactica is an LLM that is focused on scientific knowledge and is trained on many scientific papers. Its primary usage is for citation prediction, scientific QA, mathematical reasoning, summarization, document generation, molecular property prediction and entity extraction.

show_tokens(text, "facebook/galactica-125m")

png

Of all the examples here, this tokenizer has the maximum number of tokens and uses multiple tokens to identify JavaScript, Walmart, Hyderabad, Azure, PySpark etc.

Phi-3 (and Llama-2)¶

Microsoft's Phi-3 reuses the tokenizer of LLaMA-2 and adds a number of special tokens. Phi-3 models are trained using both supervised fine-tuning, proximal policy optimization, and direct preference optimization and are primarily built for precise instruction adherence.
Some additional special tokens such as <|user|>, <|assistant|> and,|system|> are added for dealing with chat and conversations. <|endoftext|> is another token to denote the end of text.

show_tokens(text, "microsoft/Phi-3-mini-4k-instruct")

png

Different tokenizers show different ways in which they tokenize the words, and this is driven largely by three factors:
1. Tokenization method: The most popular is Byte Pair Encoding (BPE)
2. Tokenizing parameters: Vocabulary size, special tokens and capitalization 3. Domain of the data: The tokenizer behavior is dependent on the data on which it is trained, which is chosen based on the specific use case for which it is built. We can see in the above examples how tokens are created for the same words for different tokenizers that are built to identify code, text, chat, etc.

References¶

Hands-On Large Language Models by Jay Alammar, Maarten Grootendorst