Skip to main content

Do you know Large Language Models?

Large Language Models or LLMs seem very new as these terms came to popularity with ChatGPT (2022) but for a fact have always existed (since 1948 Claude Shannon's paper "A Mathematical theory of Communications"). Although at that time the Language models were not large, the main idea is still the same.

The "A Mathematical theory of Communications" introduced a method of n-grams in which a probability for the next possible word, from a vocabulary list, was calculated based on the neighbouring 'n' words in a sentence (hence n-gram). It is based on the notion that the context (neighbouring words) has some relation to the main word in consideration. If you are aware of Convolutional Neural Networks they have the same idea but in terms of image pixels. 

In the 1980's a paper from Mathematical biophysics introduced the concept of Recurrent Neural Networks (RNNs) in which a series of mathematical neurons connected sequentially predict an output for a sequential input (each input term fed to each neuron). But this would mean predicting a single output/token/class for a sequence of inputs. What if we want a sentence or a phrase in the output? In 1997, the idea for "encoder-decoder" networks was introduced for Machine Translation. The encoder network takes an input sentence, which after processing, is then fed into the decoder network. The decoder network generates words one-by-one and takes its own generated word to predict the next word. This was one of the most important ideas in Natural Language Processing and it got the name "Auto-Regression". Such models that take their own generated part of the output to predict further are called "Auto-Regressive" models. 

Although many NLP tasks like Machine Translation, Text Summarization, Sentiment Analysis were all possible with deep 'Encoder-Decoder' RNNs and LSTMs and GRUs (LSTM and GRUs are cell types for RNNs introduced as a memory of the past inputs or generated outputs for very long sequences); the field of NLP was lagging behind Computer Vision where deep CNNs were quite effective and efficient. The generated texts by even deep RNNs were not quite good enough even for tasks like language translation.

Transformer model from "Attention is all you need" NeurIPS 2017
https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf


Fast-forward to 2017, this is the year that changed everything, Google published the first paper on Transformer models named as "Attention is all you need!". The idea of Attention layers in 'encoder-decoder' models was introduced and they called such model architecture (with attention layers) as Transformer models. The Attention layers in an encoder-decoder model assign weights to each term which are used to process the next term of the sequence. Sounds easy... right? look at the picture of transformer model above from the first paper. The next thing that happened was easy to predict, more models with multiple attention layers on top of each other were created. This opened a crack in the field of NLP, the bigger the model, the better the outcome.

Year-after-year models like BERT (2018), GPT (2018), GPT-2 (2019), GPT-3 (2020), GPT-3.5 (Also known as ChatGPT 2022), BARD (2023), GPT-4 (2023) came into existence. We don't know the architecture of the recent models but the foundation is still the same. These models have Billions of parameters (except BERT), which are trained on enormous amount of data.

The capabilities of GPT-3.5 or ChatGPT have been tested and it is something like never before. It's like a know-it-all but it does make some mistakes. It does provide incorrect information as the information that it is trained on was taken from the internet which of course as you might know can be deceptive. The GPT-4 released in March 2023, has improved even more, better summarization, better Question Answering.



Now the question is where does it end? Training bigger and bigger models is extremely expensive. The amount of electricity used to train GPT-3 with 175 Billion parameters was about 1000 MWh (1 million units, an average home in the world uses 600 units a month.). The race to build a better language model between big tech companies has become very harmful to the environment. 

But is there any limit to such style of models? or is it only limited by the Computation power and energy costs? And is this the path to AGI (Artificial General Intelligence)?


Well of course this is not the path to AGI. The prediction for the next word in a sentence is not what 'Intelligence' can be collapsed into. LLMs are probably just good writing assistants.

These Large-Language-Models might merely be finishing our .... 




P.S.  ChatGPT couldn't finish mine.




Comments

Popular posts from this blog

What is AI?

 It actually seems funny to write answer to this question (as it's so unusual to find an article about this these days 🤔).  AI is short for Artificial Intelligence or intelligence which is created by humans. But what is intelligence then? Intelligence breaks into the tasks that beings are capable of doing. Like thinking, memorising, remembering, deciding, reasoning, predicting, recognising, improving, inventing, reproducing, dreaming, assuming, surviving, feeling, hoping, coping, all these tasks ending with an 'ing' reminds us that they will never end until life (except for reproducing😉). The thing that makes beings actually alive is knowing that they are.  But are all the beings intelligent? Not all of them carry out all those tasks. As being smartest of all we humans still don't know if a mouse dreams or not (atleast Jerry does😏). But we do know, beings with a smaller brain or number of brain cells cannot carry out complex tasks. I'm sure ameoba can't recog...

Do you know Machine Learning?

 Machine Learning is like Jesus, It's everywhere... From pizzerias to Notco (a company which uses AI to make vegan food that tastes like meat) and from banks to Netflix all are using Machine Learning. But can machines actually learn something? 🧐 There are several algorithms that improve performance on a particular task with experience, that's it. By the way, if anyone asked, that was the definition of Machine Learning. The thing that computer systems can actually increase their performance or learn tasks is what AI is driven by.  Machine Learning is basically divided into 3 categories, viz, Supervised Learning, Unsupervised Learning, and Reinforcement Learning.       Supervised Learning is learning from a training set of labeled examples provided by a knowledgeable external supervisor. Each example is a description of a situation together with a specification—the label—of the correct action the system should take to that situation, which is often to ident...

Do you know steps in building a full Machine Learning model?

1. Data Collection In Machine Learning the data is the most important thing, unlike humans who look at a person's face a few times and recognize him/her, ML needs tons of data. The 2001's paper from Microsoft showed that moderate and complex models performed almost the same given sufficient data.  Apart from it, the quality of data is also important, data that does not represent appropriate relation between features and their label is of no use.  2. Data Preprocessing The preprocessing of data is very essential before feeding it to the algorithm, removing irrelevant features, merging highly correlated features, removing or manually adding missing values, and converting data to numeric values, suppose the data contains a feature representing the country and your dataset consists of many countries which might be moderately correlated to your output so you might not wanna remove it, you can convert it into a one-hot encoding  (a zero vector of length equal to the number of c...