Toggle light / dark theme

W/ Andrej Karpathy


The Tokenizer is a necessary and pervasive component of Large Language Models (LLMs), where it translates between strings and tokens (text chunks). Tokenizers are a completely separate stage of the LLM pipeline: they have their own training sets, training algorithms (Byte Pair Encoding), and after training implement two fundamental functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI. In the process, we will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We’ll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

Chapters:
00:00:00 intro: Tokenization, GPT-2 paper, tokenization-related issues.
00:05:50 tokenization by example in a Web UI (tiktokenizer)
00:14:56 strings in Python, Unicode code points.
00:18:15 Unicode byte encodings, ASCII, UTF-8, UTF-16, UTF-32
00:22:47 daydreaming: deleting tokenization.
00:23:50 Byte Pair Encoding (BPE) algorithm walkthrough.
00:27:02 starting the implementation.
00:28:35 counting consecutive pairs, finding most common pair.
00:30:36 merging the most common pair.
00:34:58 training the tokenizer: adding the while loop, compression ratio.
00:39:20 tokenizer/LLM diagram: it is a completely separate stage.
00:42:47 decoding tokens to strings.
00:48:21 encoding strings to tokens.
00:57:36 regex patterns to force splits across categories.
01:11:38 tiktoken library intro, differences between GPT-2/GPT-4 regex.
01:14:59 GPT-2 encoder.py released by OpenAI walkthrough.
01:18:26 special tokens, tiktoken handling of, GPT-2/GPT-4 differences.
01:25:28 minbpe exercise time! write your own GPT-4 tokenizer.
01:28:42 sentencepiece library intro, used to train Llama 2 vocabulary.
01:43:27 how to set vocabulary set? revisiting gpt.py transformer.
01:48:11 training new tokens, example of prompt compression.
01:49:58 multimodal [image, video, audio] tokenization with vector quantization.
01:51:41 revisiting and explaining the quirks of LLM tokenization.
02:10:20 final recommendations.
02:12:50??? smile

Exercises:
- Advised flow: reference this document and try to implement the steps before I give away the partial solutions in the video. The full solutions if you’re getting stuck are in the minbpe code https://github.com/karpathy/minbpe/bl

Links:

Researchers from Nanyang Technological University in Singapore (NTU Singapore) have created a low-cost tool that can capture power from wind energy as moderate as a light breeze.

The gadget can create a voltage of three volts and energy power of up to 290 microwatts when exposed to winds with speeds as low as 2 meters per second (m/s). This is enough to power a commercial sensor device and allow it to transfer data to a smartphone or computer.

“With the world growing more crowded, the great powers strive to conquer other planets. The race is on. The interplanetary sea has been charted; the first caravelle of space is being constructed. Who will get there first? Who will be the new Columbus?” A robot probe is being readied to explore the secrets of the red planet, Mars. The only component lacking: a human brain. No body. Just the brain. It is needed to deal with unexpected crises in the cold, dark depths of space. The perfect volunteer is found in Colonel Barham, a brilliant but hot-tempered astronaut dying of leukemia. But all goes awry as, stripped of his mortal flesh, Barham — or rather his disembodied brain — is consumed with a newly-found power to control…or destroy. Project psychiatrist Major McKinnon (Grant Williams) diagnoses the brain as having delusions of grandeur…but, just perhaps, Col. Barham has achieved grandeur.

Imagine an army of self-propelling, radioisotope-covered particles 2,500 to 10,000 times smaller than a speck of dust that, upon injection into the body, search for and attach themselves to cancerous tumours, destroying them. Sounds like science fiction? Not so for mice with bladder cancer.

Researchers in Spain report that nanoparticles containing radioactive iodine and which propel themselves upon reaction with urea have the ability to distinguish cancerous bladder tumours from healthy tissue. These “nanobots” penetrate the tumour’s extracellular matrix and accumulate within it, enabling the radionuclide therapy to reach its precise target. In a study conducted at the Institute for Bioengineering of Catalonia (IBEC) in Barcelona, mice receiving a single dose of this treatment had a 90% reduction in the size of bladder tumours compared with untreated animals.

This novel approach may one day revolutionize the treatment of bladder cancer. Bladder cancer is the tenth most common cancer in the world, with over 600,000 new cases diagnosed in 2022 and more than 220,000 deaths globally, according to the World Health Organization’s Global Cancer Observatory.