Large Language Models or LLMs seem very new as these terms came to popularity with ChatGPT (2022) but for a fact have always existed (since 1948 Claude Shannon's paper "A Mathematical theory of Communications"). Although at that time the Language models were not large, the main idea is still the same.
The "A Mathematical theory of Communications" introduced a method of n-grams in which a probability for the next possible word, from a vocabulary list, was calculated based on the neighbouring 'n' words in a sentence (hence n-gram). It is based on the notion that the context (neighbouring words) has some relation to the main word in consideration. If you are aware of Convolutional Neural Networks they have the same idea but in terms of image pixels.
In the 1980's a paper from Mathematical biophysics introduced the concept of Recurrent Neural Networks (RNNs) in which a series of mathematical neurons connected sequentially predict an output for a sequential input (each input term fed to each neuron). But this would mean predicting a single output/token/class for a sequence of inputs. What if we want a sentence or a phrase in the output? In 1997, the idea for "encoder-decoder" networks was introduced for Machine Translation. The encoder network takes an input sentence, which after processing, is then fed into the decoder network. The decoder network generates words one-by-one and takes its own generated word to predict the next word. This was one of the most important ideas in Natural Language Processing and it got the name "Auto-Regression". Such models that take their own generated part of the output to predict further are called "Auto-Regressive" models.
Although many NLP tasks like Machine Translation, Text Summarization, Sentiment Analysis were all possible with deep 'Encoder-Decoder' RNNs and LSTMs and GRUs (LSTM and GRUs are cell types for RNNs introduced as a memory of the past inputs or generated outputs for very long sequences); the field of NLP was lagging behind Computer Vision where deep CNNs were quite effective and efficient. The generated texts by even deep RNNs were not quite good enough even for tasks like language translation.
Fast-forward to 2017, this is the year that changed everything, Google published the first paper on Transformer models named as "Attention is all you need!". The idea of Attention layers in 'encoder-decoder' models was introduced and they called such model architecture (with attention layers) as Transformer models. The Attention layers in an encoder-decoder model assign weights to each term which are used to process the next term of the sequence. Sounds easy... right? look at the picture of transformer model above from the first paper. The next thing that happened was easy to predict, more models with multiple attention layers on top of each other were created. This opened a crack in the field of NLP, the bigger the model, the better the outcome.
Year-after-year models like BERT (2018), GPT (2018), GPT-2 (2019), GPT-3 (2020), GPT-3.5 (Also known as ChatGPT 2022), BARD (2023), GPT-4 (2023) came into existence. We don't know the architecture of the recent models but the foundation is still the same. These models have Billions of parameters (except BERT), which are trained on enormous amount of data.
The capabilities of GPT-3.5 or ChatGPT have been tested and it is something like never before. It's like a know-it-all but it does make some mistakes. It does provide incorrect information as the information that it is trained on was taken from the internet which of course as you might know can be deceptive. The GPT-4 released in March 2023, has improved even more, better summarization, better Question Answering.
Now the question is where does it end? Training bigger and bigger models is extremely expensive. The amount of electricity used to train GPT-3 with 175 Billion parameters was about 1000 MWh (1 million units, an average home in the world uses 600 units a month.). The race to build a better language model between big tech companies has become very harmful to the environment.
But is there any limit to such style of models? or is it only limited by the Computation power and energy costs? And is this the path to AGI (Artificial General Intelligence)?
Well of course this is not the path to AGI. The prediction for the next word in a sentence is not what 'Intelligence' can be collapsed into. LLMs are probably just good writing assistants.
These Large-Language-Models might merely be finishing our ....
P.S. ChatGPT couldn't finish mine.

Comments
Post a Comment