Artificial Intelligence can Undoubtedly Be Considered The Most Trending Topic Of 2023. And Among All Artificial Intelligence Products, Chatgpt Is Highly Popular.
We all know ChatGPT for its ability to understand advanced texts and provide correct and accurate answers. However, few people know how this popular artificial Intelligence is created and works.
ChatGPT is a chatbot that uses artificial intelligence technology and allows us to experience conversations similar to everyday conversations with this bot. ChatGPT’s language model can answer various questions and help us do things like writing emails or articles or even coding.
But the question is, how did ChatGPT manage to understand the questions and provide accurate answers? Recently, an author in Towards Data Science has examined this issue in detail Results have been shared. In this article, we share the translation of these results with you.
Accurate and correct answers to ChatGPT result from using advanced technologies and years of research. The technology and how ChatGPT works can be complex; For this reason, we try to check the details of this chatbot.
For this purpose, we first introduce large language models. Next, we outline the GPT-3 training mechanism and examine the learning with human feedback that has led to the impressive performance of ChatGPT. To learn more about ChatGPT, stay with us until the end of the article.
Getting to know the big language model
The Large Language Model (LLM) is one of the machine learning and artificial intelligence training models created to interpret human language. LLM is a vast database and technological infrastructure that can process large amounts of textual data.
Today, with the advancement of technology and computing power, the efficiency of LLMs has become much higher than in the past; Because of the increase of the input data set and parameter space, the capabilities and functions of LLM also increase.
The standard training method for LLMs is to predict the next word in a sequence of words using a long-short-term memory model (LSTM). LSTM can work with sequential data such as text and audio.
In this teaching method, LLM should check the phrases and words before and after, and, based on the results of this check, fill the space of the term with the most appropriate word. This process is repeated many times so that the model can generate accurate answers.
The process is done in the form of Next Token Prediction (NTP) and Masked Linguistic Model (MLM). In both models, the AI must choose the best word to fill in the blank; But the location of the vacancy is different.
Limitations of training with LSTM
Training with LSTM also has limitations. Consider this example:
Ali… reading lessons (enthusiastic / opposed)
If you are asked to fill in the blank with the correct word, you must first know about “Ali”; Because people’s interests differ. So, if you know that Ali is interested in studying, you choose “enthusiastic.”
However, the model cannot value the words correctly and accurately; Therefore, he may consider the importance of “studying” more than “Ali” in this phrase. Therefore, considering many people hate studying and doing homework, the model chooses “opposed.”
Also, in this model, the input data are processed individually and sequentially instead of as a complete set; Therefore, in LSTM, understanding and processing the complexity of relationships between words and meanings is limited.
Transformer model
In response to this issue, in 2017, a team from Google Brain introduced the transformer model. Unlike LSTM, transformers can process all input data simultaneously.
Transformers also use a mechanism called Self-Attention. The self-awareness mechanism measures the relevance of a set of data components to gain a more accurate impression of the whole group.
Therefore, with the help of this mechanism, transformers can examine the different components of the sentence and phrase more accurately and understand their relationship. This feature enables transformers to understand better and process much larger data sets.
Mechanism of self-attention in GPT
OpenAI has developed ChatGPT. ChatGPT is not this company’s only artificial Intelligence and chatbot model; since 2018, the company has developed prototypes called Producer Pre-Trained Transformer (GPT) models.
The first model was named GPT-1, and its following improved versions were released in 2019 and 2020 with the names GPT-2 and GPT-3. Recently and in 2022, we have seen the unveiling of its newest models, InstructGPT and ChatGPT.
While the change from GPT-1 to GPT-2 was not much of a technological leap, GPT-3 saw significant changes. The improvements in computational efficiency enabled GPT-3 to train on more data than GPT-2 and have a more diverse knowledge base. Therefore, in the third version, GPT could perform various tasks.
All GPT models use a transformer architecture with an encoder to process the input data sequence and a decoder to produce the output sequence data.
Both encoder and decoder use the multi-head self-attention mechanism that allows the model to examine and analyze different parts of the sequence. To do this, the self-attention tool converts tokens (pieces of text that can include sentences or words, or other groups of text) into vectors that show the importance of the ticket in the phrase.
Encoder also uses Masked Language Modeling to understand the relationship between words and provide better answers. In addition to these features, the multi-head self-awareness mechanism used in GPT repeats the process several times instead of checking the importance of words once, which makes the model able to understand sub-concepts and more complex relationships of the input data.
Problems and limitations of GPT-3
Although GPT-3 significantly improved natural language processing (languages used for humans), problems and limitations were also seen in this advanced version; for example, GPT-3 has difficulty understanding users’ instructions correctly and can’t help them as it should. In addition, GPT-3 publishes incorrect or non-existent information and data.
Another critical point is that the mentioned model cannot correctly explain its performance, and users do not know how GPT-3 made conclusions and decisions. The third version also does not have appropriate filters and may publish offensive or hurtful content. These are the problems that OpenAI tried to fix in the following versions.
ChatGPT and its formation stages
To solve the problems of GPT-3 and improve the overall performance of standard LLMs, OpenAI introduced the InstructGPT language model, which later became ChatGPT.
InstructGPT has seen huge improvements compared to previous OpenAI models, and its new approach to using human feedback in the training process has resulted in much better outputs. This teaching method is called the reinforcement learning model of human feedback (RLHF), which plays a vital role in understanding the goals and expectations of humans when answering questions.
The creation of this educational model and the development of ChatGPT by OpenAI includes three general steps, which we will explain below.
Step 1: Supervised Fine Tuning (SFT) Model
In the early stages of development, to improve GPT-3, OpenAI hired forty contractors to create supervised training datasets for model learning. These input data and requests were collected from real users and data registered in OpenAI. GPT-3.5, the SFT model, was further developed with this data set.
After collecting requests and data, OpenAI asked participants to identify and categorize how users request and ask questions. The OpenAI team tried to maximize the diversity of the data set, and all data containing personally identifiable information was also removed from this data. As a result of this investigation, three main ways of ordering information were identified:
- Requests that are asked directly; For example, “Explain to me about a topic.”
- Fuschat requests that are more complex; For example, “based on the two sample stories I sent, write a new story with the same topic.”
- Continuing requests that a matter be continued; For example, “finish this story according to the introduction.”
Finally, collecting the commands registered in the OpenAI database and handwritten by the participants led to the creation of 13 thousand of input and output samples for use in the model.
Step 2: Reward model
After training SFT in the first step, the model can provide more appropriate responses to user requests. However, this model was still incomplete and needed to be improved; The improvement was made possible by the reward model and reinforcement learning.
In this method, the model tries to find the best result in different situations and shows the best performance. In reinforcement learning, the model is rewarded for making appropriate choices and performance and penalized for making wrong choices. At this stage, SFT learns to generate the best outputs based on the input data through rewards and penalties.
We need a reward model to determine which outputs are rewarded and which responses are penalized for using reinforcement learning. To train the reward model, OpenAI provided participants with 4 to 9 results of the SFT model for each input data and asked them to rank these outputs from best to worst. This scoring created a way to measure SFT’s performance and continuously improve.
Step 3: Reinforcement learning model
After creating the reward model, random inputs were given to the model in the third step to try to make outputs with the most rewards and points. Based on the reward model developed in the second step, requests and tips are reviewed and ranked, and then the results are returned to the model to improve performance.
The method used to update the model’s performance as each response is generated Proximal Policy Optimization (PPO), developed in 2017 by OpenAI co-founder John Schulman and his team.
PPO also has a KL (KL) penalty, essential in this model. In reinforcement learning, the model can sometimes learn to manipulate its reward system to achieve a desired outcome. This leads the model to generate some patterns that do not have a sound output despite having a higher score.
To solve this problem, the KL penalty is used. This feature makes it not a criterion for creating an output with only more points, and there is not much difference from the work produced by SFT in the first step.
Model evaluation
After completing the main steps of building and training the model, a series of tests are performed during training to determine whether the new model performs better than the previous model. This evaluation consists of three parts.
First, the overall performance and ability of the model to check and follow the user’s instructions are limited. According to the results of the experiments, the participants preferred the outputs of InstructGPT to GPT-3 almost 85% of the time.
The new model was more capable of providing information, and with the help of PPO, more correct and accurate information was seen in the outputs. Finally, InstructGPT’s ability to publish or prevent inappropriate, derogatory, and malicious content was also investigated.
Surveys showed that the new model can significantly reduce inappropriate content. Interestingly, when the model was asked to emit inappropriate responses intentionally, the outputs were considerably more offensive than the GPT-3 model.
By the end of the evaluation phase, InstructGPT recorded significant improvements and demonstrated its performance on the popular ChatGPT chatbot. If you have more questions about how ChatGPT is developed and works, read the official article Read published by OpenAI.