Newsletter 6: Decoding GenAI: Your Easy-Peasy Guide to Understanding the Future
Did you know that if Google engineers had not done two specific pieces of work in the last 10 years, and if a team of Ukrainian, Russian and Canadian born scientists had not won a computer vision competition, there would not have been this explosion in the potential and use of Generative AI?
In this letter, I am going to try and write a simple English language primer on Generative AI and the principles that form the basis of it.
I am inspired to write this not only because of a recurrent theme in the comments expressed by members of this community but also selfishly to try and emulate the Feynman Principle. This is to ensure that I understand it clearly and can express it so that even a child can understand it.
I by no means mean to insult your intelligence, clearly there are no children in this community. If you already know everything there is to know, please skip ahead. After this primer, with all of us established at the same baseline, in subsequent newsletters I will dig into specific areas to explain how these might affect various businesses.
The point: Zero to GPT-4: A Sprint in AI's Evolution
Three seemingly unrelated pieces of research combined to make this amazing potential come to life. Serendipitously, it also took the widespread use of GPUs which were primarily used for gaming cards to come into the mix. Nvidia was the market leader in that space.
First in 2012, something called AlexNet won the ImageNet Large Scale Visual Recognition Challenge on September 30, 2012. AlexNet is the name of an AI architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at University of Toronto. It was so much better than anything else that it caught everyone’s attention, especially Google’s who supported and hired the researchers. You may have heard about Hinton, now colloquially referred to as the “Godfather of AI”. The key insight that Alex had was to use GPUs to train his AI model which improved the time it took to learn exponentially and similarly ratcheted up the accuracy, creating unprecedented results.
Second in 2013, a team of researchers led by Tomáš Mikolov at Google worked on something called word2vec where they were able to read in millions of items from Google news to replace words with tokens and then organize them in relation to each other. As humans we do this too. For example, we observe that when someone is holding their hand and crying we conclude that they must be hurt. So we know crying, pain and hurt are correlated. Sometimes the hurt is not physical and we can tell the difference. That is from experience. We use similar concepts to figure out how close a place is to another using the geographic coordinate system of latitudes and longitudes. Another example, if we read a novel about an imaginary place and time, we start forming correlations between characters, place and events; that's what makes for a good story. Except we, as humans, are limited in how many correlations we can ingest depending upon how much attention we are paying, and we tend to forget with time; we are human after all. Computers have no such encumbrance. They diligently form these correlations in dimensions and ranges that are incomprehensible to us. As long as computing power is available and training data is available they can keep on going till they run out of data or computing or if the model is done.
So imagine a computer powered by GPUs replacing every word in the English language with a numerical token, and then storing these tokens with all of their respective relationships based on how they appear in its training datasets. Since words are now just numerical representations, computers can use their core competence of numerical computation to represent these associations easily and retrieve them quickly. Imagine a giant multidimensional world cloud with every word in your language represented, trained by reading the entire internet!!
Third in 2017, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin; again researchers at Google came up with a technique that allows the model to weigh the importance of different words or tokens in a sequence when processing the sequence. By applying “attention”, a common human failing and a strength of computers, this method can capture dependencies between words more effectively. This model was called a “Transformer” in a now famous paper called, “Attention is all you need”. Stacked in layers these transformers pay attention to only a particular set of input and downplay the rest to be handled by the next layer. Maybe imagine the rows of people again doing sorting of mixed coins. The first row organizes only dimes, the next one nickels, and so on and so forth. Till the last row can neatly stack the coins and bills and probably also calculate the total amount of money. Definitely faster and more efficient than an individual doing it one by one. This is because each row has multiple attention heads and there are multiple rows.
Ok now put word2vec and the transformer together, stick them on top of speedy GPUs and give them the whole internet to read and voila! It can pretty much guess what a set of responses to a set of questions can be. Or alternatively generate an image using a similar principle.
This is the basis of Generative AI. Extremely accurate guess work that is then improved by usage in a reinforced learning system. It does this for words, for pixels or, for code or really any set of data that is naturally created by human brains and is available as a training repository.
So how does this concoction of ideas actually work? Well the transformer does this through a two step process. First it has an attention step and it is followed by a feed forward step. When presented with a prompt, the transformer in its attention step looks around and discovers relationships by paying attention to how the tokens (aka words) appear. Remember also that these transformers are stacked on top of each other, each of these layers has many attention heads. Like a row of workers each performing an operation that it is paying attention to. So each attention head discovers and creates relationships in parallel. For example the first layer may look for verbs, the other one makes sense of prepositions and the next then guesses the contextual use of nouns ( a bank as a financial institution versus a bank of river). As these operations occur they discover “hidden states” that it concludes, saves and uses as it passes the work from layer to layer. Like a trick of the trade that a vast array of workers might use when working on tasks in rows not easily understandable to an outside observer.
It's followed by a feed forward step where it thinks about info from the prior step and predicts the next word. The AI therefore reasons its way to the next word using a sequence of probabilities. This feed forward step uses the imaginary giant token cloud that it concluded from its training data.
All of this is running on the turbocharged processing it gets from GPUs and does all of this in unimaginably large scale parallel processing. In fact all of this looking around and guessing is just mathematical operations that can be done in a straightforward way because everything is now just a numerical token. Math wins!!
OK now to demystify the acronyms we often hear. First ChatGPT. The GPT that we chat with stands for Generative Pre-trained Transformer, thank you Ashish et al from Google! This is called a large language model (LLM) and it is basically a type of language model notable for its ability to achieve general-purpose language understanding and generation using the techniques I describe above. One thing to add here, IBM had done some work in the 1980s leading up to building something called a “Language Model”. It basically also used probability to guess a series of words based on words given. You have gotten used to this by now through a variety of applications including, autocomplete, speech recognition, machine translation, optical character recognition, handwriting recognition and grammar induction, examples of which we all encounter in our daily lives (curse you apple autocomplete!) However the LLMs took a very different approach and significantly improved the outcomes using the ideas from word2vec, Transformers and the additional oomph from GPUs which only came of age as video games became more and more capable and gamers wanted extremely realistic graphics. This was the growth story for NVidia. A perfect confluence of happenings! Many LLMs have now come out, Meta has LLAMA and LLEMA, Google has PaLM etc, some are proprietary and some open source. And the race is on.
Let's talk about some scale and some numbers. After the Transformer work at Google in 2017, it only took a year for OPenAI to build its first GPT model in 2018. In March of 2023 they had already released GPT-4 which is the most well known of the Gen AI models. As Ron Burgundy said, “that escalated quickly!” In terms of scale, GPT3 has 96 layers, each with 96 attention heads. It performs 96x96=9216 attention operations every time it guesses a new word. And each word in the word cloud is represented by 12288 numbers to store all context and relationships. The hidden context is stored in hidden layers where it could have almost 50,000 numbers and we don't know why or how, because it is a trick of its trade. Given the mind-bogglingly large scale (GPT3 has hundreds of billions of parameters), we struggle to understand how it all works and, active research is ongoing to understand it.
How did the giant wordcloud get created in the first place?
Cue the training montage with rousing “motivational” music!
The LLM consumes the training data, like Reddit, or Twitter, Code from GitHub or Stackoverflow, or whatever it can from the entire internet or, other digitized offline information. It does this through a nifty two-step. First it does a forward pass and makes a guess at the end of its layers, like a series of hints through each layer of the network for it to show the guess at the last layer. Just like rows of humans, the first guess by each one in the last layer is probably random and garbage. Then it does a very clever thing, it walks backwards intelligently through a technique called back propagation and checks where in the network the “hint” started getting off or finding the offending human and giving them some tough feedback! Everywhere it finds the hints going off course, it makes adjustments. Since this is all mathematical, it does this by changing the weights at each node in each layer. It does this again using mathematics, calculus to be specific. Given the size of the dataset—500 billion words in the case of GPT-3—this forwards-and-backwards is a gargantuan mathematical enterprise of the order of hundreds of billions of operations. With the horsepower from these powerful chips instead of it taking years it only took them a few months. They had their magical imaginary word cloud to then use to answer our prompts in an amazingly accurate way. GPT4 is even more powerful and so on and so forth.
Now imagine if you only apply the LLM to train on data from an appliance from only one company in only one location. It becomes trivial, but hugely powerful for an appliance company. Same for say legal information or medical diagnosis, or training to take a test. So you hear about LLMs being amazing at acing those tests. GIven that the data used to train it is information only till a certain date, using information that exists and is created largely by humans. This information inherently describes the world we live in, in our own words, and so LLMs then also begin to understand not just language but our world as described by us, making them smarter than a typical tween or many teenagers, or at least appear to us as such.
There you have it, that is my best attempt at a simple English explainer of GenAI, especially the wildly popular LLMs. In a later newsletter, I can dig into applications, newer developments and potential.
There is no Counterpoint just some concerns:
Every new technology is born with enthusiasm, the hype cycle kicks in and the techno-optimists, say “what could go wrong?” Well, you know we always screw around; then we hit the “find out” phase of our lived experience!
I will leave the ecological and macroeconomic implications for a later newsletter. Lets just look at what we talked about in our letter this time.
First we have the real problem of bias. Unfortunately, humans are horribly flawed and riddled with perspectives, opinions and biases. As an example, we know that LLMs actually crawled Reddit, which is a cesspool of hate, bias and dangerous opinion. So the LLMs learned that. Now we are tuning and taking stuff out to make sure it isn't offensive, but it learned through a massive scale as I discussed before so it is highly unlikely that such themes didn't also make it into the model. We hear examples of this already where people are discovering our worst tendencies being shown within the outputs from these models. It's an area of active research and acknowledged that bias is real.
Then there are hallucinations. Hey, all of us have mistaken a rope for a snake, haven't we? Our approach to hallucinations also made it into the models. So they will make stuff up and create facts and happenings that never happened, just like some of us. However, who do you call BS on when the output is the creation of a machine?
Then there is attribution and copyright. It learned from other people's work and, even though it may not plagiarize directly, how do we acknowledge and compensate the creators whose work was used in training? The legal battles have only just started.
Lastly, there is the unnerving part of all of this. The Godfather of AI, Professor Hinton is on the record saying that AI Models are becoming so smart that they will become smarter than us in the finite future achieving Artificial General Intelligence or AGI. I always imagine Neo in the Matrix and the Terminator when I think of this. I may not have attention spans but I do have a vivid imagination!! Here is a tantalizing reason why I believe it's possible. We humans have a special capability honed over evolution to function in social cognition. You may find some people who are on the Autism spectrum (or many a low EQ executive) lacking it so you kind of understand what I am talking about. This Is essentially the capacity to reason about others' mental state and it is called Theory of Mind (TOM), yes you should Google it! So GPT-4 is about 90% accurate in TOM tasks. We don't know how to train this capacity (ahem, see reference to executives above), but given the training methods and the enormous corpus of data, this capability on GPT-4 arose autonomously, we have no idea how! So if we keep heading down this path, what else could we find that we don't understand!! Also large corporations and governments have already advanced this, and what haven't they told us?
The Aside:
As we navigate the complexities of Generative AI, we can't overlook the seminal contributions of Alan Turing, born on June 23, 1912. Turing, a British mathematician and logician, laid the groundwork for both computer science and artificial intelligence. His cryptographic work during World War II at Bletchley Park played a crucial role in deciphering the German Enigma code, thereby significantly shortening the war. Post-war, Turing introduced the concept of the Turing Machine, a theoretical framework that has shaped our understanding of computational processes.
Arguably one of Turing's most enduring contributions to AI is the Turing Test, introduced in his 1950 paper, "Computing Machinery and Intelligence." The test serves as a benchmark for machine intelligence, posing the question: Can a machine imitate human behavior indistinguishably? If a machine could converse with a human via text and the human couldn't determine whether they were interacting with a machine or another human, the machine would "pass" the Turing Test. This idea has deeply influenced AI research, acting as an aspirational goal for natural language processing and human-like cognition in machines.
Turing's life was tragically cut short in 1954 after a prosecution for "gross indecency" due to his homosexuality, which was criminalized in Britain at the time. Faced with the grim choices of imprisonment or chemical castration, he opted for the latter, leading to his death by cyanide poisoning at 41.
As we probe the frontiers of Generative AI, it's essential to remember that the innovations in this field are built on the pillars established by pioneers like Turing. Moreover, his life serves as a cautionary tale of how societal prejudices can stifle brilliance. Turing's legacy continues to provoke thought on both technological and ethical grounds, making him an enduring figure in the dialogue about AI's past, present, and future.
Take care of yourself,
-abhi