Reinforcement Learning: A step towards Artificial General Intelligence

Yes, It's PacMan

Unlike other machine learning algorithms, that are dependent on pre-collected data, Reinforcement Learning is a type of semi-supervised machine learning method that is based on rewards determined during the training of the algorithms.

Reinforcement learning is the closest we have got to mimic the human’s ability to learn from its experiences. It is best applied to situations where algorithms have to take a decision according to their environment.

Artificial General Intelligence(AGI) is one such concept where a machine is able to perform any intellectual task that a human being can. Though we are far from achieving this feat as of now, AGI has been the most sought-after goal in the entire field of computer science. Today, most of the AI applications are problems specific, and they barely touch down to the hopes of AGI. But, with reinforcement learning, achieving the goal of AGI is quite a possibility.

AGI is a type of meta-learning which refers to the ability of a single algorithm to learn multiple tasks. It can be thought of as the job of the algorithm to learn how to learn and generalize it to acquire new skills, the way humans do.

The closest we have reached to achieve AGI is the Impala architecture, which is able to do 30 different challenging tasks that require various aspects of learning, memory, and navigation. Under the hood, Impala makes use of Reinforcement Learning. This method of learning has already been used for some amazing achievements such as endowing an AI with emotions and learning complex games like ‘Go’ and ‘Poker’.

So, How Does Reinforcement Learning (RL) Works?

Basically, RL is concerned with the environment, agent, state, action. The environment is the surrounding in which a decision has to be taken. The agent is who has to take the decision, and a state is the configuration of the environment. An action by the agent (in the given environment) reaches the new state every time to determine whether that action was good or bad. It gets the‘reward’ for each action, whether good or bad.

Putting it in a simpler way. Let’s say you(agent) want to exit a room which has a single door (environment )as quickly as possible. Every step(action) you take towards the door (state)will give you a positive reward and every step you took away from the door(new state) will lead to a negative reward. Even if you are just standing in one place, you will be getting negative rewards as your goal is to accomplish the task in a short time. In other words, an agent takes some action in its present state in the environment to reach a new state and get an award for the new state achieved.

In recent years, there have been many interesting advancements in this area. Deepmind’s AlphaZero defeated champions of chessshogi, and Go. Although these games might seem difficult for humans as it can take years and never learn to play properly, for computer games like chess and shogi where the number of states of the ways a piece can be placed is finite, are pretty simple and easy to handle. The real ability of Reinforcement Learning can be established if we consider the example of Open AI FIVE, which consists of five neural networks-based RL algorithms. It defeated the world’s top players in the game of DOTA 2. Now, DOTA 2 is a way different game than Chess and Go as it can have an infinite number of states and gameplay can be strategized absolutely in any ‘n’ number of ways. Although FIVE achieved this feat under some restrictions, it remains a remarkable one.

Neural Network-Based Reinforcement Learning Models

Traditionally RL algorithms have been based on the concept of dynamic programming, in which they used memoization, but since their introduction, neural networks have played important role in the advancement of Reinforcement Learning. Neural networks learn to map state-action pairs to reward, they use coefficients to approximate the function relating inputs to outputs, and their learning consists of finding the right coefficients, or weights, by iteratively adjusting those weights along gradients that promise less error. Combination of Convolutional Neural Networks(CNN) and Recurrent neural networks(RNN) are often used to make sense of the agent's environment and state.

Agent to play Mario

When used for a supervised problem like Image classification, if an image is provided, CNN tries to classify that image to a certain category. But, in the case of Reinforcement learning, an image represents a state of the agent and CNNs job is to rank the actions that the agent can possibly perform in that state. (Like in the image above)

Algorithms for Deep Reinforcement Learning

ATARI games

Deep Q Network (DQN)

In 2015, DQN was able to defeat humans in Atari games, but it didn't do well when the complexity of games was increased.

Seaquest game

Taking the example of game ‘Seaquest’, DQN is able to read scores, shoot the enemy, rescue divers from the raw images, and also try to come up with a strategy, all by itself.

DQN is based on the concept of Q-Learning; it determines the value of an action-value function Q(s, a), which tells how good it is to take any action at a particular stateQ-learning is a dynamic programming approach, it stores all the Q values for all the possible states. However, this approach had a shortcoming. When the action space was huge, the table created to store the Q values were initially large, making the memory and computations requirements too high. To solve this problem, DQN was introduced. DQN was able to train the network and store the experience as the weights of the connections, hence eliminating large storage requirements.

DQN architecture

In the DQN architecture, we give 4 previous video frames to the convolutional layer followed by fully connected layers, which compute Q values for each action.

Asynchronous Advantage Actor-Critic (A3C)

A3C algorithm is developed by Google’s DeepMind and has made the DQN obsolete. It is faster, simpler, more robust, and able to achieve much better scores on the standard RL tasks.

A3C high-level architecture

But, how the term ‘Asynchronous Advantage Actor-Critic’ was derived?

Asynchronous: In A3C, there is a global network with multiple worker agents, each has its own set of network parameters. Each agent interacts with its own copy of the environment at some time. Due to this, the experience of all the agents different and our final experience pool gets more diverse.

Actor-Critic: The network in A3C calculates a value function V(s) (how good a certain state is to be in) and a policy π(s) (a set of action probability outputs). The agent uses the value function as a critic to update the policy (the actor) in a much better way than other policy gradient methods.

Advantage: A3C also makes use of ‘Advantage’ — the difference between Q-value and the value function V(s), to help agents understand how good its actions were. It also conveys it how better they turned out to be than expected.

Applications of Reinforcement Learning

Although Reinforcement Learning became known due to its application in creating gaming bots for Chess, Go, and DOTA 2, it is today used in a variety of problem statements that can have a huge economic significance. Some of its application are listed below:

Thermal Soaring

Thermal soaring means using thermals or upward air currents to gaining altitude. This way more energy can be saved over the flying range. This is a difficult skill to master for a machine as it requires responding to subtle environmental cues. But, in 2016, a team of four namely Gautam Reddy, Antonio Celani, Terrence J. Sejnowski, and Massimo Vergassola successfully (2016) investigated thermal soaring policies that were effective in the strong atmospheric turbulence.

Resource management in Networking

Deep Reinforcement Learning can be effectively used to allocate and schedule computer resources automatically. This shall minimize the average job slowdown. As this allocation heavily depends on understanding the workload and environment, it gets difficult to design heuristics methods that would work efficiently.

Stock Market Prediction

Reinforcement Learning agents are now being used to trade in the financial markets. As we know RL is based on the accumulation of reward, so we have to develop policy by hand for gaining rewards(making profit) or losing(loss in the market) them. The idea is to train agents in market trading decisions while maximizing the reward.

Traffic Light Control

Often traffic congestion is caused due to unsynchronized traffic lights. This problem is solved by using a Reinforcement learning-based multi-agent system. Researchers used DQN to learn the Q value of the {state, action} pairs. This RL based method was found to be superior to the existing methods and showed great promise in designing a traffic system.


Open AI and Deep mind have been developing RL models which can be applied in making autonomous robots. Robots are trained to walk, jump, and perform other physical acts by themselves without any external help. Robots are even being trained on how to overcome injuries, obstacles in the path with RL. RL has also found its use in research for autonomous vehicles and drones.

Web System Configuration

There are more than 100 configurable parameters in a web system and the process of tuning the parameters requires a skilled operator and numerous trial-and-error tests. RL is used to do an autonomic reconfiguration of parameters in multi-tier web systems in VM-based dynamic environments.

Advertisements for Online Marketing

Market researchers and experts have been developing different types of analytics and digital tools to understand user behavior closely. Here, RL is used to determine which ad campaigns will get the most attention for online marketing. It treats this problem as a K-armed bandit problem and uses the Markov decision process to decide which add will get the most clicks when online. For this, each add is posted online for a short period of time to get a slight user response, on the basis of which the RL agent will take decisions.


RL is also used to optimize chemical reactions; it uses the LSTM based Policy function. The RL agent optimizes the reaction with Markov Decision Process, characterized by {S, A, P, R}, where S is the set of experimental conditions. A is the set of all the possible actions that can change the state of the reaction, P is the transition probability from current experiment condition to the next condition, and R — which is a function of the state — is the reward.

510 views0 comments
Never Miss a Post. Subscribe Now!

© 2020  by The AI Sorcery


  • Facebook
  • LinkedIn
  • Twitter