Reinforcement Learning

Reinforcement Learning

So far we have seen two major paradigms of machine learning supervised and unsupervised learning. Supervised learning involves learning with labeled data, where as unsupervised methods learns from unlabeled data and seeks to find the hidden structures in the data.

Reinforcement learning (RL) is neither supervised nor unsupervised, but addresses the learning problem by maximizing a reward signal. Unlike the other two methods RL is a continuous trade off between exploration and exploitation. RL could also be called action-based learning.

Figure-1:

A short history

Reinforcement learning methods has its roots in two entirely independent areas. One that is inspired by natural learning mechanisms, to learn by trial and error and originated in psychology of animal learning and training. It was used by Russian physiologist Ivan Petrovich Pavlov in the 1890s to train his dogs. Animals adjust their actions based on reward and punishment stimuli received from the environment. Reinforcement learning mechanisms operate in the human brain, where the dopamine neurotransmitter in the basal ganglia acts as a reinforcement informational signal that favors learning at the level of the neuron. Reinforcement learning implies a cause-and-effect relationship between actions and reward or punishment. It implies goal-directed behavior, at least insofar as the agent has an understanding of reward versus lack of reward or punishment.

The other developmental stream originated due to problems in optimal control theory and their solutions using value functions and dynamic programming. This method for solving optimization problems involves an actor or agent that interacts with its environment and modifies its actions, or control policies, based on stimuli received in response to its actions. Algorithms and policies are constructed on the idea that effective control decisions must be remembered, by means of a reinforcement signal, such that they become more likely to be used a second time. Learning is based on real-time evaluative information from the environment.
Reinforcement learning is closely connected from a theoretical point of view with both adaptive control and optimal control methods.

Actions, Policies and States

In certain learning situations a single output action of the learner is not important or sufficient . Instead the output of the learning system is sequence of actions.

These actions are taken based on action policies. A good policy results in improved rewards and a bad policy results in increased penalty. In such cases a machine learning program should be able to access the goodness of policies and learn from past sequences of good policies [Alpaydin]. Thus the system learns to generate a good policy sequence. Such learning methods are called reinforcement learning.

Reinforcement learning machine perceives the state of that environment as a vector of features and execute actions in every state. Different actions bring different rewards and punishments. An action results in changing the state of the environment.

The goal of a reinforcement learning algorithm is to learn a good policy or a sequence of policies. A policy is a function f (similar to the target function in supervised learning) that takes the feature vector of a state as input and outputs an optimal action to execute in that state.

Supervised Learning and RL, a short comparison
There are many situations where we don’t know the correct answers that supervised learning requires. For example, in a flight control system, the input feature vector would be the set of all sensor readings at a given time, and the answer would be how the flight control surfaces should move during the next millisecond. Supervised models e.g., a neural network can’t learn to fly the plane unless there is a set of known answers, for all possible flying conditions.

Figure-2:
Learning all flying conditions

Reinforcement Learning offers a different and more general learning approach. RL combines the fields of dynamic programming and supervised learning to yield powerful machine-learning systems. In RL, the learners task is to achieve a given goal to achieve by trial-and-error interactions with its environment. This form of machine intelligence has the possibility to solve problems that have been previously unsolvable.

Reinforcement learning solves particular kinds of problems where decision making is sequential, and the goal is long-term, such as game playing, robotics, resource management, or logistics and so on.

Figure-3:

The general frame work of RL

Reinforcement learning (RL) is a general framework where an agent learns to perform actions in an environment so as to maximize a reward. The action is optimal if it maximizes the expected average reward. The two main components are the environment, which represents the problem to be solved, and the agent, which represents the learning algorithm.

Figure-4:

The agent and environment continuously interact with each other. At each time step, the agent takes an action on the environment based on its policy π(a_t |s_t ), where s_t is the current observation from the environment, and receives a reward “r_t+1” and the next observation “s_t+1” from the environment. The goal is to improve the policy so as to maximize the sum of rewards (return). By exploiting its past experience the agent explores to select the better (note it need not always be the current best) actions in future.

In reinforcement learning, the learner is a decision-making agent that takes actions in an environment and receives reward (or penalty) for its actions in trying to solve a problem. After a set of trial-and error runs, it should learn the best policy, which is the sequence of actions that maximize the total reward.

Figure-5:
A child learn to generate a correct sequence of buttons using a socially assistive robot that monitors task performance and eeg data signals

Reinforcement Learning Example

Chess game

Assume the situation of chess game. A chess piece, or chessman, is any of the six different types of movable objects used on a chessboard to play the game of chess. The chess board has 8x8 squares. Each player has 16 pieces. The rules of chess (also known as the laws of chess) are rules governing the play of the game of chess. Each type of chess piece has its own method of movement. A piece moves to a vacant square except when capturing an opponent's piece. The goal of the game is to checkmate (threaten with inescapable capture) the opponent's king. The game is won by make a good sequence of moves against the opponent.

Suppose a child is playing chess against a computer and there is no teacher. The child is the decision making agent. The child only knows the basic rules of piece movements. The only feedback is either you win or lose the game. At any time “t ”, the environment or the state, denoted by “s_t”, is any one of a set of possible states of the game. In current example, the state of the board is the state of piece positions on the board. The decision maker has a set of actions possible: legal movement of pieces on the chess board. Once an action “a_t” is chosen and taken, the state changes. After taking several actions and getting the reward “r_t+1”, the agent would like to assess the individual actions it did in the past and find the moves that led us to win the reward so that it can record and recall them later on.

(Note: The Shannon number, named after Claude Shannon, is a conservative lower bound (not an estimate) of the game-tree complexity (no. of possible game) of chess is 10¹²⁰, based on an average of about 10³ possibilities for a pair of moves consisting of a move for White followed by one for Black, and a typical game lasting about 40 such pairs of moves. As a comparison, the number of atoms in the observable universe, to which it is often compared, is roughly estimated to be 10⁸⁰.)

Main Components of Reinforcement Learning

In reinforcement learning, the learner is a decision-making agent that takes actions in an environment and receives reward (or penalty) for its actions in trying to solve a problem.

The other main components is the environment.

Figure-6:

The agent (Actor + Critic) takes an action that changes the state of the environment and environment returns a reward. A critic differs from a teacher in that it does not tell us what to do but only how well the actor has been doing in the past and never informs in advance. The feedback from the critic may be is scarce and when it comes it could be late.

Basic setting of RL

After a set of trial and error runs, the learner should learn the best policy, which is the sequence of actions that maximize the total reward. This method is called “learning with a critic,” as opposed to learning with a teacher which we have in supervised learning.

After taking several actions and getting the reward, the individual actions taken so far in the past are assessed to find the moves that led us to win the reward so that it can be recorded and recalled later on. Example a rat learns and remembers the sequence to find cheese as reward.

Environment and its state

An environment state is one of a set of possible states—for example, the state of the chess board or the position of a rat in a maze

Figure-7:

A sequence in Chess

Figure-8:

Rat sequence in a maze

Sequence of action by the agent

The decision maker has a set of actions possible: legal movement of pieces on the chess board, or the rat moves inside the maze in many possible directions without hitting the walls, and so forth.

Reward for the agent

Once an action is chosen and taken, the state changes. A sequence of actions and state changes and feedback will follow in the form of a reward or punishment resulting in a solution to the learning task.

The reward defines the problem and is necessary if the agent has to learn. The agent learns the best sequence of actions to solve a problem where “best” is quantified as the sequence of actions that has the maximum cumulative reward.

This is the setting of reinforcement learning.

Figure-9:

Final reward for the Rat

More details of components of RL

States: A set of possible states of the environment is denoted by S = {s}. The state describes the current situation. For the cat in following figure the current state is sitting position. For a robot that is learning to walk, the state is the position of its two legs. In a chess game a certain state is the current positions of all the pieces on the board.

Figure-10:
State change for Cat

Action: The set of possible actions A = {a} a: s → s. This means the action is a mapping from state to another state. Action is what an agent can do in a particular state "s" belonging to the set of states "S" for the cat it can decide to walk. There are typically finite (or a fixed range of) actions an agent can take.

Given the state, or positions of its two legs, a robot can take steps within a certain distance. For example, a robot stride can only be, say, 0.01 meter to 1 meter. For the chess program the number of legal moves of pieces on the board at the current state.

Reward Signal

Reward function: R(.), results in reward signal which is scalar value that describes the feedback from the environment for the agent’s action. Reward signal defines whether the agent actions are successful or not.

Figure-11:

Reward as a function

The sole objective of RL agent is to maximize the reward signal. This forms the basis for altering the policy. Reward signal can also be stochastic functions of the state of the environment and the actions taken.

Reinforcement learning learns to generate an internal value for the intermediate states or actions in terms of how good they are in leading to the goal and getting to the real reward. Once such an internal reward mechanism is learned, the agent can just take the local actions to maximize it. The final solution to the task requires a sequence of actions that maximize the reward.

Figure-12:

Reward after a sequence of actions

Learning from reward and the credit assignment problem

The reward feedback does not tell the agent directly which action to take. Rather, it indicates how valuable some sequences of states and action are. The agent has to discover the right sequence of actions to optimize the reward over time. Choosing the right action of an agent is traditionally the subject of control theory, and the subject of RL is thus often discussed in the context of optimal control.
Reward learning introduces several challenges. For example, in typical circumstances reward is only received after a long sequence of actions. The problem is then how to assign the credit for the reward to specific actions. This is the temporal credit assignment problem. In some distributed systems there is, in addition, a spatial credit assignment problem which is the problem of how to assign the appropriate credit when different parts of a system contributed to a specific outcome or which state and action combinations should be given credit for the outcome.

Figure-13:

Value Function

The value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. This is different from reward because value defines what is good in the long run. Rewards determine the immediate, intrinsic desirability of environmental states. A certain state though it may fetch immediate low reward can have high value if it is followed by other states that will yield high rewards. Reinforcement learning version used as a machine learning method these days concerns itself with the long-term rewards and not just the immediate reward. The long-term reward is learned when an agent interacts with an environment through many trials and errors. The value of a state is an estimated of the probability of winning the goal. In the following figure the state is ball location.

Figure-14:
The state value function is depicted as contours on golf ground

The idea of reinforcement learning is to use the reward feedback to build up a value function that reflects the expected future payoff of visiting certain states and taking certain actions. The value function is used to make decisions of which action to take and which states to visit. This is called a policy.

Policy: Π(S) → A

A policy defines the way a learning agent behaves at a given time. It is a mapping from the perceived state of the environment to probabilities of selecting each possible action when in those state. A policy is a possible action at a certain state. An agent learns to select the optimal action for every state. This is the core characteristic of RL agent since it determines the agents behaviour. In general policies are stochastic and hence each action is associated with a probability value. Optimal policy: Π*(s)→a*, is a policy that maximize your expected reward R(s).

Figure-15:

Good vs Bad policy

Actor and Critic

The actor takes as input the state and outputs the best action. It essentially controls how the agent behaves by learning the optimal policy (policy-based). The critic, on the other hand, evaluates the action by computing the value function (value based). Those two models participate in the learning process where they both get better in their own role as the time passes. The result is that the overall architecture will learn to solve a problem more efficiently than the two methods separately.

Figure-16:

Model (Optional)

Some RL systems a model is used to mimic the behavior of the environment or it allows inferences to be made about how the environment will behave. For a given environment the model might predict the next state and next reward.

Models are used for planning, so that we can decide the course of action by considering possible future states before they actually occur. Methods for solving reinforcement learning that use models and planning are called model-based methods. Models free methods are simpler which are explicitly trial-and-error learning methods.

Figure-17:

Some aspects of reinforcement learning are closely related to search and planning issues in artificial intelligence. AI search algorithms generate a satisfactory trajectory through a graph of states. Planning operates in a similar manner, but typically within a construct with more complexity than a graph, in which states are represented by compositions of logical expressions instead of atomic symbols. These AI algorithms are less general than the reinforcement-learning methods, in that they require a predefined model of state transitions, and with a few exceptions assume determinism. On the other hand, reinforcement learning, at least in the kind of discrete cases for which theory has been developed, assumes that the entire state space can be enumerated and stored in memory--an assumption to which conventional search algorithms are not tied.

Reinforcement learning and Markov Process

To formalize the ideas of reward feedback and value functions we start with simple processes where the transitions of the environment to new states depend only on the current state. A process with such a characteristic is called a Markov process.

Markov Models

Named after Andrey Markov, Markov models can be used as a way of defining probability distributions over stochastic sequences or randomly changing systems. It can be used for stochastic dynamic system state transitions that occur from one state to another at every time step. As an example if you made a Markov chain model of a baby's behaviour, you might include "playing," "eating", "sleeping," and "crying" as states, which together with other behaviours could form a 'state space': a set of all possible configurations of a system. Besides the state space model, a Markov model describes the probability of hopping, or "transitioning," from one state to any other state---e.g., the chance that a baby currently playing will fall asleep in the next five minutes without crying first.

Markov Property and Markov Processes

The Markov philosophy is that

“future of a stochastic process is independent of the past given the exact state of the present”.

This means that given the present state the conditional probability of the distribution of future states depends only on the present state and is independent of the sequence of previous states that preceded. This is also called the first order Markov assumption.

Figure-18:

A Markov decision Process (MDP)

Markov Decision process (MDP) is to help to make decisions in a stochastic environment. An MDP is a sequential decision process where decisions are made at stages of a process or the states of environment as it evolves through time. MDPs are used to frame of the problem of learning from interaction to achieve a goal. The agent and the environment interact continually. The agent selects and implements actions. The environment responds to these actions and presenting new situations to the agent. The goal is to find a policy, which is a map that gives us all optimal actions on each state of our environment. The most important characteristic of an MDP is that the state transition and reward function depend on only the current state and the applied action. Almost all RL problems can be formalized as MDPs.

MDP and simple planning

MDP is more powerful than simple planning. Simple planning just follows the plan after you find the best strategy. MDP allows doing optimal actions even if something went wrong along the way.

Environment is observable

Formally, an MDP is used to describe an environment for reinforcement learning, where the environment is fully observable. (This is not always true, i.e., there are situations where it is only partially observable.) The environment is typically stated in the form of a Markov decision process (MDP). In this context many reinforcement learning algorithms utilize dynamic programming techniques.

Solving MDPs with Dynamic Programming

MDPs are the tools for modeling sequential decision problems. In order to solve MDPs we need Dynamic Programming, more specifically the Bellman equation. Dynamic programming provides methods for solving optimal decision problems by working backward through time. Dynamic programming is an offline solution technique that cannot be implemented online in a forward-in time fashion

It's a method that divides a problem into simpler sub-problems easier to solve, it's just really a divide and conquer strategy.

Figure-19:

Dynamic programming was introduced by Richard Bellman in 1950 to solve optimization problems. It is a broad class of efficient optimization algorithms that transforms a complex problem into a sequence of interrelated simpler problems similar to the original problem. These methods can be used for making inferences. This is a general method which requires that we solve many smaller sub-problems that recur many times, pre-computing the solution to the sub-problems, storing them, and using them to compute the values to larger problems.

The Bellman equation is the starting point for developing a family of reinforcement learning algorithms for finding optimal policies by using causal experiences received stage-wise forward in time.

The learning model is composed of a reward function “R” for an action and a state transition function “P” that results in a new state. In case we receive the reward for the next state-action pair, the MDP can be solved through a Dynamic Programming method.

Dynamic Programming for mathematical optimization and computer programming

Dynamic programming is both a mathematical optimization method and a computer programming method, but both of them follow this divide and conquer mechanism. But in mathematics it's often used as an optimization tool.

In programming is often implemented with recursion and is used on problems like find the shortest path on a graph and generation of sequences. Computer programmers use a term called memorization which is used to improve the performance of divide-and-conquer algorithms by memorizing the results of sub problems that were already solved.

Q - learning

However, for most cases, we cannot precisely predict “R” and “P”. If these functions are not known, Q-Learning is an algorithm that can be use to solve MDPs with unknown reward and transition functions.

The main idea of Q-Learning is to “explore” all possibilities of state-action pairs and estimate the long-term reward that will be received by applying an action in a state. A quality value (Q(s,a) value) is assigned for action “a” taken for at the state “s”. Eventually, Q-Learning converges to the optimal actions given some restrictions.

Q – Function and Q - learning

Assume that the value of Q function at time step “t” is represented by “Q(s_t,a_t)”, which is initially zero i.e., Q (s_t,a_t) = 0. The reward of the action scalar value is “r(s_t,a_t) ≥ 0”. We can then set Q(s_t,a_t) = r(s_t,a_t).

Q-Learning is based on the notion of a Q-function. The Q-function (a.k.a the state-action value function) of a policy “π”, Q_π(s,a), measures the expected return or discounted sum of rewards obtained from state "s" by taking action "a" first and following policy π thereafter. We define the optimal Q-function denoted by Q^*(s,a) as the maximum return that can be obtained starting from some state observation “s”, taking an action "a" and following the optimal policy thereafter.

Constant reward value

If the rewards are deterministic then the reward is always constant for the particular choice of action. We can then choose different actions and store the estimated value Q(s_t,a_t) for all pairs of s_t and a_t. We choose action a* with the maximum Q value i.e.,

$\textit{choose a* if }\mathit{Q\left ( s_{t},a_{t}^{*} \right )}=\underset{a}{max}\left ( \mathit{Q\left ( s_{t},a_{t} \right )} \right )$

Winning a task requires a sequence of good actions.

Stochastic reward

If the rewards are stochastic we get a different value of reward each time we choose the same "a" at given state "s". The probability distribution of rewards is denoted by conditional probability p(r|s,a).

In such a case we define the estimated value Q(s_t,a_t) of the action at “t”. For simplicity of notation we use Q_t(s,a). The estimated Q value is the average of all rewards or the expected value of rewards “r” received for action “a” until time “t”. This value is updated the following equation

$Q_{t+1}\left ( s,a \right )\leftarrow Q_{t}\left ( s,a \right )+\alpha \left [ r_{t+1} +\gamma \underset{a}{max}Q_{t}\left ( s,a \right )-Q_{t}\left ( s,a \right )\right ]$

where α is the learning rate and r_t₊₁ is the reward received for action “a_t” taken from s_t resulting in a state s_t₊₁at time “t+1”. For Q_t₊₁(s,a) = Q_t(s,a), for (s,a) ≠ (s_t,a_t). The factor "γ" is the discount factor for the infinite horizon discounted reward problem, in which starting from any state we try to maximize the expected value of infinite horizon discounted reward

$E\left [ \sum_{t=0}^{\infty } \gamma _{t}r_{t+1}\right ]$

This update rule (delta rule) is equivalent to

New Estimate ← Old Estimate +α[target- Old Estimate]

The learning agent learns the best sequence of actions to solve a problem where “best” is quantified as the sequence of actions that has the maximum cumulative reward. The accumulated rewards is computed as

$\mathit{Q_{t}}=\frac{r_{1}+r_{2}+....+r_{t-1}}{t-1}$

This is the setting of Q learning for RL.

As per the update equation it can be shown that the expected value of Q_t₊₁(s,a) converges to the expected value of the probability p(r|s,a) as t increases tends to infinity. The feedback, in the form of a reward, generally occurs only when the complete sequence is carried out.

Reinforcement learning is a.k.a. “learning with a critic. The critic informs only whether we are doing right or wrong. The feedback from the critic is scarce and when it comes, its credit assignment comes late. This leads to the credit assignment problem. A Markov decision process is used to model the agent which generates the sequence of actions.

Q

Q value Function and Policies

The Q-function (a.k.a the state-action value function) of a policy “π”, Q^π(s,a), measures the how valuable a state “s” is under the policy “π” for different actions of “a”. It measures the expected return or discounted sum of rewards obtained from state “s” by taking action “a” first and following policy “π” thereafter. We define the optimal Q-function Q*(s,a) as the maximum return that can be obtained starting from observation “s”, taking action “a” and following the optimal policy “π*” thereafter. The optimal Q-function obeys the following Bellman optimality equation:

$\mathit{Q^{\pi }}\left ( s,a \right )=\mathit{E\left [ r+\underset{a}{max}\left ( \mathit{Q^{\pi }}\left ( s^{'},a^{'} \right ) \right ) \right ]}$

where E[.] symbolises statistical expectation. This means that the maximum return from state “s” and action “a” is the sum of the immediate reward “r” and the return (discounted by "γ") obtained by following the optimal policy thereafter until the end of the episode (i.e., the maximum reward from the next state "s′ "). The expectation is computed both over the distribution of immediate rewards "r" and possible next states "s′ ".

The basic idea behind Q-Learning is to use the Bellman optimality equation as an iterative update

$\left ( \mathit{Q^{\pi }} \right )^{*}\left ( s,a \right )=\mathit{E\left [ r+\gamma \times \underset{ a'}{max}\left ( \mathit{Q^{\pi }}\left ( s^{'},a^{'} \right ) \right ) \right ]}$

It can be shown that this converges to the optimal Q-function, i.e. Q(t) → Q* as t → ∞.

The performance of Q-learning depends on visiting all state-action pairs in order to learn the correct Q-values. This can be easily achieved with a small number of states. In the real world, however, the number of states can be very large, particularly when there are multiple agents in the system. For example, in a maze game, a robot has at most 1,000 states (locations); this grows to 1,000,000 when it is in a game against another robot, where the state represents the joint location of two robots (1,000 x 1,000).

When the state space is large, it is not efficient to wait until we visit all state-actions pairs. A faster way to learn is called the Monte Carlo method. In statistics, the Monte Carlo method derives an average through repeated sampling. In reinforcement learning, the Monte Carlo method is used to derive Q-values after repeatedly seeing the same state-action pair. It sets the Q-value, Q(s,a), as the average reward after many visits to the same state-action pair (s, a). This method removes the need for using a learning rate or a discount rate. It depends only on large numbers of simulations. Due to its simplicity, this method has become very popular. It has been used by AlphaGo after playing many games against itself to learn about the best moves.

Another way to reduce the number of states is by using a neural network, where the inputs are states and outputs are actions, or Q-values associated with each action. A deep neural network has the power to dramatically simplify the representation of states through hidden layers.

Deep Q-Learning

For most problems, it is impractical to represent the Q-function as a table containing values for each combination of s and a. Instead, we train a function approximator, such as a neural network with parameters θ, to estimate the Q-values, i.e. Q(s,a;θ) ≈ Q*(s,a).

Figure-20:

The steps involved in reinforcement learning using deep Q-learning network.

All the past experiences are stored by the user in memory.
The next action is determined by the maximum output of the Q-network.
The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*. This can done by minimizing the following loss at each step t:

$L_{t}\left ( \mathbf{\theta }_{t} \right )=\rho (.)\left [ \left ( y_{t}-Q\left ( s,a,\mathbf{\theta_{t}} \right ) \right )^{2}\right ]$

This is basically a regression problem. The target or actual values y_t are not known as we are dealing with a reinforcement learning problem.

Going back to the Q-value update equation derived from the Bellman equation, we have

$y_{t}=\mathit{E\left [ r+\gamma \times \underset{a'}{max}\left ( \mathit{Q}\left ( s^{'},a^{'},\mathbf{\theta _{t}} \right ) \right ) \right ]}$

Here, "y_t" is called the TD (temporal difference) target, and “y_t− Q (s,a,θ_t)” is called the TD error. ρ(.) represents the behaviour distribution, the distribution over transitions s, a, r, s′ collected from the environment.

Note that the parameters from the previous iteration θ(t−1) are fixed and not updated. In practice we use a snapshot of the network parameters from a few iterations ago instead of the last iteration. This copy is called the target network.

Q-Learning is an off-policy algorithm that learns about the greedy policy while using a different behaviour policy for acting in the environment by collecting data. This behaviour policy is usually an epsilon (ϵ)-greedy policy that selects the greedy action with probability 1−ϵ and a random action with probability ϵ to ensure good coverage of the state-action space.

Experience Replay
To avoid computing the full expectation in the DQN loss, we can minimize it using stochastic gradient descent. If the loss is computed using just the last transition s,a,r,s′, this reduces to standard Q-Learning. For Atari games the whole game board can be mapped by a convolutional neural network to determine Q-values.

The Atari DQN work introduced a technique called Experience Replay to make the network updates more stable. At each time step of data collection, the transitions are added to a circular buffer called the replay buffer. Then during training, instead of using just the latest transition to compute the loss and its gradient, we compute them using a mini-batch of transitions sampled from the replay buffer. This has two advantages: better data efficiency by reusing each transition in many updates, and better stability using uncorrelated transitions in a batch.

References:

Richard S. Sutton and Andrew G. Barto, Reinforcement Learning An Introduction, Second edition, The MIT Press.
Ethem Alpaydin, Introduction to Machine Learning, 2^nd Ed, The MIT Press. \
Kevin P. Murphy, Machine Learning A Probabilistic Perspective, The MIT Press.
Mance E. Harmon - Wright Laboratory, Stephanie S. Harmon - Wright State University, Reinforcement Learning: A Tutorial.
Frank L. Lewis, Draguna Vrabie, Kyriakos G. Vamvoudakis, Reinforcement Learning and Feedback Control, IEEE CONTROL SYSTEMS MAGAZINE, December 2012.
Andrew, Markov Decision Processes: Making Decision in the Presence of Uncertainty, cs.cmu.edu.
Introduction of Reinforcement Learning, tensorflow. org.
Reinforcement Learning Explained, oreilly.com

Image Credits

Figure-1: prakhartechviz.blogspot.com

Figure-2: semanticscholar.org

Figure-3: slideshare.net

Figure-4: hk.istem.ai

Figure-5: researchgate.net

Figure-6: media.springernature.com

Figure-7: chessable.com

Figure-8: thumbs.dreamstime.com

Figure-9: conductscience.com

Figure-10: guru99.com

Figure-11: gzwq.github.io

Figure-12: leonardoaraujosantos.gitbooks.io

Figure-13: image.slidesharecdn.com

Figure-14: RL an Introduction Sutton and Barto

Figure-15: riptutorial.com

Figure-16: kinstacdn.com

Figure-17: oreilly.com

Figure-18: present5.com

Figure-19: image.slidesharecdn.com

Figure-20: s3-ap-south-1.amazonaws.com

Comments

lakshmibhucynixOctober 20, 2021 at 12:37 AM
Major thanks for the post.Really thank you! Will read on…
data science online free
Best Data Science Online Training
Deming LimSeptember 24, 2022 at 2:53 AM
Thanks for sharing this wonderful knowledge about Copier Machine. If you are looking for the top company who offer professional BUY COPIER MACHINE services at affordable prices. Then you must choose DocuConnex.

COPIER SINGAPORE

Search This Blog

Artificial Intelligence and Machine Learning Augments Human Intelligence