What Is Reinforcement Learning And How Does It Work?

What Is Reinforcement Learning And How Does It Work?

Machine Learning Is One Of The Main Branches Of Artificial Intelligence, Which Itself Has Different Subcategories, All Of Which Are Classified In The Form Of Three Main Paradigms Of Supervised Learning, Unsupervised Learning, And Reinforcement Learning. 

Unsupervised and supervised machine learning are very similar, while reinforcement learning performs differently from the two mentioned examples.

What is reinforcement learning?

Machine learning refers to the science of designing machines that can automatically perform assigned tasks through data or data samples delivered to them and based on previous experiences without being taught all actions through explicit programming. Machine learning algorithms are divided into three primary groups supervisor, unsupervised, and reinforcement.

Reinforcement learning is a decision science and refers to a specific type of optimal learning process in an environment to obtain maximum rewards. This optimal behavior is based on interacting with the environment and observing the domain, similar to children exploring the world around them and learning to do things that will help them achieve their goals.

Without a supervisor, the learner (model) must independently seek to discover actions to receive the maximum reward. This discovery process has a trial and error approach.

The type of reward depends on the quality of actions performed, which may be paid immediately or with a delay, which indicates that the model should make more efforts to improve the quality of its work.

Reinforcement learning is a compelling paradigm in artificial intelligence since models based on reinforcement learning can do things that are generally not visible without a supervisor or supervisor. In his book “Psychology: The Knowledge of Mind and Behavior,” Richard Gross defines learning as acquiring or modifying knowledge, behavior, skills, values, or performance.

Based on the above definition, we should say reinforcement learning. It is an essential branch of machine learning in which an agent learns how to behave in the environment by performing actions and seeing their results. In reinforcement learning, the agent learns by trial and error and tries to get the most reward by performing environmental activities. In general, reinforcement learning is used to solve reward-based problems.

Algorithms used in machine learning are divided into three main groups supervised, unsupervised, and reinforcement learning. Supervised learning relies on feedback to indicate whether a prediction is true or false, while unsupervised learning requires no feedback; The algorithm tries to classify the data based on its underlying structure.

Reinforcement learning is similar to supervised learning in that it receives feedback but does not apply to every input or state. Intelligent models are generally developed to improve their performance or behavior. In Figure 1, you can see the difference between the three machine learning models and the functional differences of each model.

figure 1

In supervised learning, a dataset with desirable labels is provided to the model; So that a function can calculate the amount of error in the case of a prediction.

 Monitoring occurs when a prediction is made, and an error (actual vs. desired) is mapped to change performance and learning.

In unsupervised learning, we have a dataset that does not contain the desired output; Hence, there is no way to monitor performance. For this reason, the function tries to divide the data set into classes so that each lesson contains a part of the data set with common characteristics.

In reinforcement learning, the algorithm tries to learn the actions that can perform on states to achieve a target state. In the above paradigm, the learning agent receives feedback as a reward or penalty after evaluating each action.

Based on this definition, we should say reinforcement learning is a general framework for learning problems that require sequential decisions and a solution for implementing a mathematically based framework for solving problems. For example, can use to find a good policy, value-based methods such as qualitative learning examine the compatibility of action with a given situation.

Feedback is not necessarily provided for all actions and is rewarded only when necessary or when great work has been done. On the other hand, applying policy-based techniques makes it possible to identify the steps that can be performed in different situations directly and regardless of the degree of coordination between the movement and the case. An important point to note is that humanistic and sequential educational approaches inspire the method based on reward and punishment.

Now let’s examine each model and explore its critical approaches and algorithms.

On what basis do reinforcement learning algorithms work?

Most likely, you have experience playing video games such as Call of duty, Battlefield, or similar examples. While playing the game, you do repetitive tasks; for example, you check the situation, make a decision and do something, and at the end, you evaluate what you have done to see if you made the right decisions.

This iterative process helps you gain experience based on what you’ve done and realize what you did well and poorly. This approach will help you gradually get better at playing the game. This repetitive process of doing things is not limited to playing video games; we use the same pattern in most daily activities.

This process is used in reinforcement learning to train machines, and the agent learns by trial and error and tries to reach the maximum reward by performing some actions in the surrounding environment.

Suppose we have a store and hire an employee. This employee can perform various tasks such as contacting customers and increasing the sales rate in return for receiving a commission. Now imagine that this employee is an agent for our hypothetical store. This agent works in the company; With this description, we should imagine the company as an environment.

The agent is in a state. Every time an operation is performed in the environment, the state of the agent changes and enters a new form. Hence, any work done will have a reward or punishment for the doer. For example, if the employee has an entirely successful sales day as planned, he will receive a commission, and if he does something wrong and the sales do not go as expected, he will not receive a commission.

The agent continuously learns to provide the best service in the above example. During the super-agent process, he knows tips about tasks and actions that lead to rewards, and gradually his performance improves to reach a final goal.

Now let’s use the above example about reinforcement learning.

In reinforcement learning, an agent exists in an environment and can do things, Just like us humans. The agent is trying to maximize his rewards. Every action he takes has consequences for him. Each activity’s result is a positive or negative reward or punishment.

Over time, the agent learns tips from these results to do his work more profitably; Therefore, we can say that reinforcement learning is feedback-based learning. In the world of artificial intelligence intelligent agent is an automatic entity that receives environmental information through its sensors, performs actions through stimuli, and directs activities toward achieving goals.

Intelligent agents use received knowledge or learning to achieve goals. These factors may be simple or complex. Be careful before implementing the deep reinforcement learning agent; researchers must have complete information about important issues such as different approaches to reinforcement learning, the idea of ​​rewarding, and the word deep in deep reinforcement learning to design and develop an accurate model.

Suppose you have no knowledge of the flame, and you approach it. The origin of the formation of reinforcement learning is based on human interaction with the environment and education based on own experiences. The critical concept in the reinforcement learning paradigm is that the agent in the background completes the learning process by interacting with it and receiving rewards for its actions.

The flame is hot, considered positive, and you get a positive feeling; now you know that fire is a positive thing. Then you try to touch the fire, and your hand burns. Now you understand that fire is a positive thing, but when you are at the proper distance, you can receive its heat, and getting too close to it will burn you.

It is how humans learn things through interacting with the environment. Reinforcement learning is a processing approach based on which the agent learns by doing something or, more precisely, actions.

Reinforcement learning and machine learning

To talk more precisely about this topic, we should say that reinforcement learning is one of the essential learning paradigms in which an agent learns to reach a goal in an uncertain and complex environment. In reinforcement, learning models face the same conditions as a video games. A computer uses trial and error to find a solution to a problem. For the machine to do what the programmer wants, there is a reward or punishment for his actions. In this case, the goal of the device is to maximize the prizes received. Although the programmer specifies policies (rules and instructions for the game) to receive bonuses, it does not provide any suggestions to the model about how to play the game. The machine must figure out how to use the results obtained in each action to achieve the final goal.

Agent, situation, and environment

Suppose an agent learns a video game like Super Mario by working through examples (Figure 2). The steps that a model based on reinforcement learning  must go through to achieve skill in playing this game are as follows:

  1.  The agent receives state S0 from the environment (in the above example, the first frame (state) of the Super Mario game (atmosphere) is received).
  2.  Based on state S0, the agent performs action A0, which is equal to moving to the right.
  3.  The environment is transferred to the frame or the new state S1.
  4.  The environment rewards the agent R1.

figure 2

This reinforcement learning loop is based on an iterative state, action, and reward process. The agent’s goal is to maximize the expected cumulative compensation.

What is the reward hypothesis?

Why is the agent’s goal to maximize aggregate reward? It is because reinforcement learning has been developed based on this theory. The maximum cumulative reward should consider in reinforcement learning to achieve the best behavior. The cumulative reward in each step t can write based on the following formula:

Gt= Rt+1 + Rt+2 + …..

which is equal to:

A subtle point that you should pay attention to regarding rewarding is that the above process is based on the calculations of the world of statistics. The sooner the reward is given (at the beginning of the game), the more likely it will happen; Because they are more predictable than future long-term rewards. To better understand the above sentence, pay attention to the following example:

It is a hypothetical example where the agent is a tiny mouse, and the competitor is a cat. The goal is for the mouse to eat as much cheese as possible before the cat can eat the mouse. As shown in Figure 3, the probability of eating cheeses near a mouse is higher than cheeses near a cat (the closer to a cat, the greater the risk).

Figure 3

In a sequence, the reward near the cat is still discounted even if it is more significant in quantity (more cheeses); Because it is hard to get them, and the agent is not sure to be able to eat them. To discount the rewards, we act as follows:

  • We define a discount rate called gamma, between 0 and 1. The larger the gamma, the smaller the discount. As a result, the learning agent gives more importance to long-term rewards. Conversely, the smaller the gamma, the greater the value; This means that the agent pays more attention to short-term rewards. The discounted cumulative expected compensation is calculated based on the following formula.

  • For simplicity, each reward is discounted by gamma to the power of the time step. As the time step increases, the cat gets closer to the mouse, making subsequent tips less likely.

Episodic or Continuing tasks

A task is an example of a reinforcement learning problem. Here, an episode consists of loosely connected parts or events. In reinforcement learning, there are two types of functions, episodic and continuous.

In the episodic tasks approach, there is a starting point and an ending point that creates an episode: a list of states, actions, rewards, and new states. For example, in our model (a Super Mario game), an attack starts with the arrival of a new Mario, and when Mario is killed or reaches the end of the stage, it’s over.

In the continuous tasks approach, such studies continue forever. In this situation, the agent must learn how to choose the best action and interact with the environment simultaneously. A concrete example related to continuous tasks is an agent that has the charge of continuously monitoring stock changes. For this task, there is no start or end point. The agent keeps working until the expert decides to stop it.

Applied terms of reinforcement learning

If you are interested in focusing on the reinforcement learning paradigm, it is better to be familiar with some essential terms in this field. Because the number of times is significant, we will mention a few essential items below:

  • Agent: It is an algorithm or a model that must do things and learn from them over time.
  • Environment: As its name implies, this word is the environment with which the agent communicates and performs tasks in that environment.
  • Action: It is what the agent does. The activities and reactions of the agent are in an environment.
  • Reward: The result of an action. Every action has a bonus. Compensation can be positive or negative.
  • State: This shows the current state of the agent in the environment. Actions performed by the agent can change its shape, like the Super Mario game we mentioned.
  • The policy is a strategy or behavior based on what things are done and what the agent must do to achieve the required result.
  • Value Function: This function informs the agent of the maximum reward he will receive for each future state.

last word

Reinforcement learning is undoubtedly the most advanced paradigm of machine learning and artificial intelligence, which has great potential for revolutionizing the world of information technology.

Statistics show that reinforcement learning is the most efficient way to instill the concept of creativity into machines because finding new and innovative ways to perform tasks is a form of creativity; Therefore, we must say that reinforcement learning may be the following process in the development of artificial intelligence.