Reinforcement Learning
Reinforcement Learning
So far we have seen two major paradigms of machine learning supervised and unsupervised learning. Supervised learning involves learning with labeled data, where as unsupervised methods learns from unlabeled data and seeks to find the hidden structures in the data.
Reinforcement learning (RL) is neither supervised nor unsupervised, but addresses the learning problem by maximizing a reward signal. Unlike the other two methods RL is a continuous trade off between exploration and exploitation. RL could also be called action-based learning.
Reinforcement learning methods has its roots in two
entirely independent areas. One that is inspired by natural learning
mechanisms, to learn by trial and error and originated in psychology of animal
learning and training. It
was used by Russian physiologist Ivan Petrovich Pavlov in the 1890s to train
his dogs. Animals adjust their actions based on reward and punishment stimuli
received from the environment. Reinforcement learning mechanisms operate in the
human brain, where the dopamine neurotransmitter in the basal ganglia acts as a
reinforcement informational signal that favors learning at the level of the
neuron. Reinforcement learning implies a cause-and-effect relationship between
actions and reward or punishment. It implies goal-directed behavior, at least insofar
as the agent has an understanding of reward versus lack of reward or
punishment.
The other developmental stream originated due to problems
in optimal control theory and their solutions using value functions and dynamic
programming. This method for solving optimization problems involves an
actor or agent that interacts with its environment and modifies its actions, or
control policies, based on stimuli received in response to its actions. Algorithms and policies are constructed on the idea that effective control decisions must be
remembered, by means of a reinforcement signal, such that they become more
likely to be used a second time. Learning is based on real-time evaluative information
from the environment.
Reinforcement learning is closely connected from a theoretical point of view with both adaptive control and optimal control methods.
Reinforcement learning is closely connected from a theoretical point of view with both adaptive control and optimal control methods.
Actions, Policies and
States
In certain learning situations a single output
action of the learner is not important or sufficient . Instead the output of
the learning system is sequence of actions.
These actions are taken based on action policies. A good
policy results in improved rewards and a bad policy results in increased
penalty. In such cases a machine learning program should be able to access the
goodness of policies and learn from past sequences of good policies [Alpaydin].
Thus the system learns to generate a good policy sequence. Such learning
methods are called reinforcement learning.
Reinforcement
learning machine perceives the state of that environment as a vector of
features and execute actions in every state. Different actions bring
different rewards and punishments. An action results in changing the state of the environment.
The
goal of a reinforcement learning algorithm is to learn a good policy or a sequence of policies. A
policy is a function f (similar to the target function in supervised
learning) that takes the feature vector of a state as input and outputs an
optimal action to execute in that state.
Supervised Learning and RL, a short comparison
There are many situations where we don’t know the correct answers that supervised learning requires. For example, in a flight control system, the input feature vector would be the set of all sensor readings at a given time, and the answer would be how the flight control surfaces should move during the next millisecond. Supervised models e.g., a neural network can’t learn to fly the plane unless there is a set of known answers, for all possible flying conditions.
There are many situations where we don’t know the correct answers that supervised learning requires. For example, in a flight control system, the input feature vector would be the set of all sensor readings at a given time, and the answer would be how the flight control surfaces should move during the next millisecond. Supervised models e.g., a neural network can’t learn to fly the plane unless there is a set of known answers, for all possible flying conditions.
Reinforcement
Learning offers a different and more general learning approach. RL combines the
fields of dynamic programming and supervised learning to yield powerful
machine-learning systems. In RL, the learners task is to achieve a given goal
to achieve by trial-and-error interactions with its environment. This form of machine intelligence has the possibility
to solve problems that have been previously unsolvable.
Reinforcement
learning solves particular kinds of problems where decision making is
sequential, and the goal is long-term, such as game playing, robotics, resource
management, or logistics and so on.
The general
frame work of RL
Reinforcement learning (RL) is a general framework where an
agent learns to perform actions in an environment so as to maximize a reward. The
action is optimal if it maximizes the expected average reward. The two main components are the environment, which
represents the problem to be solved, and the agent, which represents the
learning algorithm.
The agent and environment continuously interact with each other.
At each time step, the agent takes an action on the environment based on
its policy π(at
|st ),
where st is the current observation
from the environment, and receives a reward “rt+1” and the next observation “st+1” from the
environment. The goal is to improve the policy so as to maximize the sum of
rewards (return). By exploiting its
past experience the agent explores to select the better (note it need not always be the current best) actions in future.
In reinforcement
learning, the learner is a decision-making agent that takes actions in an
environment and receives reward (or penalty) for its actions in trying to solve
a problem. After a set of trial-and error runs, it should learn the best
policy, which is the sequence of actions that maximize the total reward.
A child learn to generate a correct sequence of buttons using a socially assistive robot that monitors task performance and eeg data signals
Reinforcement Learning Example
Chess game
Assume the situation of chess game. A chess
piece, or chessman, is any of the six different types of
movable objects used on a chessboard to play the game of chess.
The chess board has 8x8 squares. Each player has 16 pieces. The rules
of chess (also known as the laws of chess) are rules
governing the play of the game of chess. Each type of chess piece has its
own method of movement. A piece moves to a vacant square except when capturing an opponent's piece. The goal of the game is
to checkmate (threaten with inescapable capture) the
opponent's king. The game is won by make a good sequence of moves against the
opponent.
Suppose a child is playing chess against a computer and there is no teacher.
The child is the decision making agent. The child only knows the basic rules of
piece movements. The only feedback is
either you win or lose the game. At any time “t ”, the
environment or the state, denoted
by “st”, is any one of a
set of possible states of the game. In current example, the state of the board
is the state of piece positions on the board. The decision maker has a set of actions
possible: legal movement of pieces on the chess board. Once an action “at” is chosen and taken, the
state changes. After taking several actions and getting the reward “rt+1”, the agent would like to assess the individual actions it did in the past and find the moves that led us to win the reward so that it can record and recall them later on.
(Note: The Shannon
number, named after Claude Shannon, is a conservative lower bound (not an estimate) of
the game-tree complexity (no. of
possible game) of chess is 10120,
based on an average of about 103 possibilities for a pair of
moves consisting of a move for White followed by one for Black, and a typical
game lasting about 40 such pairs of moves. As a comparison, the number
of atoms in the observable universe, to which it is
often compared, is roughly estimated to be 1080.)
Main Components of Reinforcement
Learning
In reinforcement
learning, the learner is a decision-making agent that takes actions in an
environment and receives reward (or penalty) for its actions in trying to solve
a problem.
The other main components is the environment.
The agent (Actor + Critic) takes an action that changes the state
of the environment and environment returns a reward. A critic differs from a teacher in
that it does not tell us what to do but only how well the actor has been doing
in the past and never informs in advance. The feedback from the critic may be
is scarce and when it comes it could be late.
Basic setting of RL
After a set of trial and error runs, the learner should learn the best
policy, which is the sequence of actions that maximize the total reward. This
method is called “learning with a critic,” as opposed to learning with a teacher
which we have in supervised learning.
After taking several actions and getting the reward, the
individual actions taken so far in the past are assessed to find the moves that
led us to win the reward so that it can be recorded and recalled later on. Example
a rat learns and remembers the sequence to find cheese as reward.
Environment and its state
An
environment state is one of a set
of possible states—for example, the state of the chess board or the position of a rat
in a maze
Sequence of action by the agent
The decision maker has a set of actions possible:
legal movement of pieces on the chess board, or the rat moves inside the maze in
many possible directions without hitting the walls, and so forth.
Reward for the agent
Once an action is chosen and taken, the state changes. A sequence
of actions and state changes and feedback will follow in the form of a reward or punishment resulting in a solution to the learning task.
The reward defines the problem and is necessary if the agent has to learn. The agent
learns the best sequence of actions to solve a problem where “best” is
quantified as the sequence of actions that has the maximum cumulative reward.
More details of components of RL
States: A set of possible states of the environment is denoted by S = {s}. The state describes the current situation.
For the cat in following figure the current state is sitting position. For a robot that is
learning to walk, the state is the position of its two legs. In a chess game a
certain state is the current positions of all the pieces on the board.
Action: The set of possible actions A = {a} a: s → s. This means the action is a mapping from state to another state. Action is what an agent
can do in a particular
state "s" belonging to the set of states "S" for the cat it can decide to walk. There are typically finite
(or a fixed range of) actions an agent can take.
Given the state, or positions
of its two legs, a robot can take steps within a certain distance. For example,
a robot stride can only be, say, 0.01 meter to 1 meter. For the chess program the
number of legal moves of pieces on the board at the current state.
Reward Signal
Reward function: R(.), results in reward signal which is scalar value
that describes the feedback from the environment for the agent’s action. Reward signal defines whether the agent
actions are successful or not.
The sole
objective of RL agent is to maximize the reward signal. This forms the basis
for altering the policy. Reward signal can also be stochastic functions of the
state of the environment and the actions taken.
Reinforcement learning learns to generate an internal value for the intermediate states or actions in terms of how good they are in leading to the goal and getting to the real reward. Once such an internal reward mechanism is learned, the agent can just take the local actions to maximize it. The final solution to the task requires a sequence of actions that maximize the reward.
Learning
from reward and the credit assignment problem
The reward feedback
does not tell the agent directly which action to take. Rather, it indicates how
valuable some sequences of states and action are. The agent has to discover the
right sequence of actions to optimize the reward over time. Choosing the right
action of an agent is traditionally the subject of control theory, and the subject of RL is thus often discussed in the context of optimal control.
Reward learning introduces several challenges. For example, in typical circumstances reward is only received after a long sequence of actions. The problem is then how to assign the credit for the reward to specific actions. This is the temporal credit assignment problem. In some distributed systems there is, in addition, a spatial credit assignment problem which is the problem of how to assign the appropriate credit when different parts of a system contributed to a specific outcome or which state and action combinations should be given credit for the outcome.
Reward learning introduces several challenges. For example, in typical circumstances reward is only received after a long sequence of actions. The problem is then how to assign the credit for the reward to specific actions. This is the temporal credit assignment problem. In some distributed systems there is, in addition, a spatial credit assignment problem which is the problem of how to assign the appropriate credit when different parts of a system contributed to a specific outcome or which state and action combinations should be given credit for the outcome.
Value Function
The value of a state is the total amount of reward an agent can
expect to accumulate over the future, starting from that state. This is
different from reward because value defines what is good in the long run.
Rewards determine the immediate, intrinsic desirability of environmental
states. A certain state though it may fetch immediate low reward can have high
value if it is followed by other states that will yield high
rewards. Reinforcement learning version used as a machine learning method
these days concerns itself with the long-term rewards and not just the
immediate reward. The long-term reward is learned when an agent interacts with
an environment through many trials and errors. The value of a state is an estimated
of the probability of winning the goal. In the following figure the state is ball location.
The idea of
reinforcement learning is to use the reward feedback to build up a value
function that reflects the expected future payoff of visiting certain states and
taking certain actions. The value function is used to make decisions of which
action to take and which states to visit. This is called a policy.
Policy: Π(S) → A
A policy defines the way a
learning agent behaves at a given time. It is a mapping from the perceived
state of the environment to probabilities of selecting each possible
action when in those state. A policy is a possible action at a certain state.
An agent learns to select the optimal action for every state. This is the core
characteristic of RL agent since it determines the agents behaviour. In general
policies are stochastic and hence each action is associated with a probability
value. Optimal policy: Π*(s)→a*, is a policy that maximize your expected reward R(s).
Actor and Critic
The
actor takes as input the state and outputs the best action. It essentially
controls how the agent behaves by learning the optimal
policy (policy-based). The critic, on the other hand, evaluates
the action by computing the value function (value
based). Those two models participate in the learning process where they both
get better in their own role as the time passes. The result is that the overall
architecture will learn to solve a problem more efficiently than the two
methods separately.
Model (Optional)
Some RL systems a model is used to
mimic the behavior of the environment or it allows inferences to be made about
how the environment will behave. For a given environment the model might
predict the next state and next reward.
Models are used for planning, so that we can decide the course of action by considering possible future states before they actually occur. Methods for solving reinforcement learning that use models and planning are called model-based methods. Models free methods are simpler which are explicitly trial-and-error learning methods.
Some aspects of
reinforcement learning are closely related to search and planning issues in
artificial intelligence. AI search algorithms generate a satisfactory
trajectory through a graph of states. Planning operates in a similar manner,
but typically within a construct with more complexity than a graph, in which
states are represented by compositions of logical expressions instead of atomic
symbols. These AI algorithms are less general than the reinforcement-learning
methods, in that they require a predefined model of state transitions, and with
a few exceptions assume determinism. On the other hand, reinforcement learning,
at least in the kind of discrete cases for which theory has been developed,
assumes that the entire state space can be enumerated and stored in memory--an
assumption to which conventional search algorithms are not tied.
Reinforcement
learning and Markov Process
To formalize the ideas of reward
feedback and value functions we start with simple processes where the
transitions of the environment to new states depend only on the current state.
A process with such a characteristic is called a Markov process.
Markov Models
Named
after Andrey Markov, Markov models can be used as a way of defining probability
distributions over stochastic sequences or randomly changing systems. It can be
used for stochastic dynamic system state transitions that occur from one state to
another at every time step. As an example if you
made a Markov chain model of a baby's behaviour, you might include
"playing," "eating", "sleeping," and
"crying" as states, which together with other behaviours could form a
'state space': a set of all possible configurations of a system. Besides the
state space model, a Markov model describes the probability of hopping, or
"transitioning," from one state to any other state---e.g., the chance
that a baby currently playing will fall asleep in the next five minutes without
crying first.
Markov Property and Markov
Processes
The
Markov philosophy is that
“future of a stochastic process is
independent of the past given the exact state of the present”.
This
means that given the present state the conditional probability of the
distribution of future states depends only on the present state and is
independent of the sequence of previous states that preceded. This is also called
the first order Markov assumption.
A
Markov decision Process (MDP)
Markov Decision process (MDP) is to help to make decisions in
a stochastic environment. An
MDP is a sequential decision process where decisions are made at stages of a process
or the states of environment as it evolves through time. MDPs are used to frame of the problem of learning from
interaction to achieve a goal. The agent and the environment interact continually.
The agent selects and implements actions. The environment responds to these
actions and presenting new situations to the agent. The goal is to find a policy, which is a
map that gives us all optimal actions on each state of our environment. The
most important characteristic of an MDP is that the state transition and reward
function depend on only the current state and the applied action. Almost all RL problems can be formalized as MDPs.
MDP and simple planning
MDP is more powerful than
simple planning. Simple planning just
follows the plan after you find the best strategy. MDP allows doing optimal
actions even if something went wrong along the way.
Environment is observable
Formally, an MDP is used to describe
an environment for reinforcement learning, where the environment is fully
observable. (This is not always true, i.e., there are situations where it is
only partially observable.) The environment is
typically stated in the form of a Markov decision process (MDP). In this context many reinforcement learning algorithms utilize dynamic programming techniques.
Solving MDPs with Dynamic Programming
MDPs are the tools for modeling sequential
decision problems. In order to solve MDPs we need Dynamic Programming, more
specifically the Bellman equation. Dynamic
programming provides methods for solving optimal decision problems by working
backward through time. Dynamic programming is an offline solution technique
that cannot be implemented online in a forward-in time fashion
It's a
method that divides a problem into simpler sub-problems easier to solve, it's
just really a divide and conquer strategy.
Dynamic programming
was introduced by Richard Bellman in 1950 to solve optimization problems. It is
a broad class of efficient optimization algorithms that transforms a complex
problem into a sequence of interrelated simpler problems similar to the
original problem. These methods can be used for making inferences. This is
a general method which requires that we solve many
smaller sub-problems that recur many times, pre-computing the solution to the
sub-problems, storing them, and using them to compute the values to larger
problems.
The Bellman equation is the starting point for developing
a family of reinforcement learning algorithms for finding optimal policies by
using causal experiences received stage-wise forward in time.
The learning
model is composed of a reward function “R” for an action and a state
transition function “P” that results in a new state. In case we receive the reward for the next state-action pair, the MDP
can be solved through a Dynamic Programming method.
Dynamic Programming for mathematical optimization and computer programming
Dynamic programming is both a mathematical optimization method and a computer programming method, but both of them follow this divide and conquer mechanism. But in mathematics it's often used as an optimization tool.
In programming is often implemented with recursion and is used on problems like find the shortest path on a graph and generation of sequences. Computer programmers use a term called memorization which is used to improve the performance of divide-and-conquer algorithms by memorizing the results of sub problems that were already solved.
In programming is often implemented with recursion and is used on problems like find the shortest path on a graph and generation of sequences. Computer programmers use a term called memorization which is used to improve the performance of divide-and-conquer algorithms by memorizing the results of sub problems that were already solved.
Q
- learning
However, for
most cases, we cannot precisely predict “R” and “P”. If these functions are not
known, Q-Learning is an algorithm
that can be use to solve MDPs with unknown reward and transition functions.
The main idea of Q-Learning is to “explore” all possibilities of state-action pairs and estimate the long-term reward that will be received by applying an action in a state. A quality value (Q(s,a) value) is assigned for action “a” taken for at the state “s”. Eventually, Q-Learning converges to the optimal actions given some restrictions.
Q – Function
and Q - learning
Assume that the value of Q
function at time step “t” is represented by “Q(st,at)”, which is initially zero
i.e., Q (st,at)
= 0. The reward of the action scalar
value is “r(st,at)
≥ 0”. We can then set Q(st,at) = r(st,at).
Q-Learning is based on the notion of a Q-function. The Q-function (a.k.a the state-action value function) of a policy “π”, Qπ(s,a), measures the expected return or discounted sum of rewards obtained from state "s" by taking action "a" first and following policy π thereafter. We define the optimal Q-function denoted by Q*(s,a) as the maximum return that can be obtained starting from some state observation “s”, taking an action "a" and following the optimal policy thereafter.
Constant reward value
If the rewards are deterministic then the reward is always constant for the
particular choice of action. We can then choose
different actions and store the estimated value Q(st,at) for all pairs of st and at. We choose action a* with the maximum Q value i.e.,
Winning a task requires a sequence of good actions.
Stochastic reward
If the rewards are stochastic we get a different value of reward
each time we choose the same "a" at given state "s". The
probability distribution of rewards is denoted by conditional probability p(r|s,a).
In such a case we define the estimated value Q(st,at) of the action at “t”. For simplicity of notation we use Qt(s,a). The estimated Q value is the average of all rewards or the expected value of rewards “r” received
for action “a” until time “t”. This value is updated the following equation
where α
is the learning rate and rt+1
is the reward received for action “at”
taken from st resulting in
a state st+1 at
time “t+1”. For Qt+1(s,a) = Qt
(s,a), for (s,a) ≠ (st,at). The factor "γ" is the discount factor for the
infinite horizon discounted reward problem, in which starting from any state we try to maximize the
expected value of infinite horizon discounted reward
This update rule (delta rule) is equivalent to
This update rule (delta rule) is equivalent to
New Estimate ← Old
Estimate +α[target- Old Estimate]
The learning agent learns the best sequence of actions to solve a
problem where “best” is quantified as the sequence of actions that has the
maximum cumulative reward. The accumulated rewards is computed as
This is the setting of Q learning for RL.
As per the update equation it can be shown that the expected value of Qt+1(s,a)
converges to the expected value of the probability p(r|s,a) as t increases tends to infinity. The feedback, in the form of a reward,
generally occurs only when the complete sequence is carried out.
Reinforcement learning is a.k.a. “learning with a critic. The critic informs only whether we are doing right or
wrong. The feedback from the critic is scarce and when it comes, its credit assignment comes late. This leads to the credit assignment problem.
A Markov decision process is used to model the agent which generates the
sequence of actions.
Q
Q value Function and Policies
The Q-function (a.k.a the state-action value
function) of a policy “π”, Qπ(s,a), measures
the how valuable a state “s” is
under the policy “π” for different
actions of “a”. It measures the expected
return or discounted sum of rewards obtained from state “s” by taking action “a” first and following
policy “π” thereafter. We define the
optimal Q-function Q*(s,a) as the maximum
return that can be obtained starting from observation “s”, taking action “a” and following the optimal
policy “π*” thereafter. The optimal Q-function obeys the following Bellman optimality equation:
where E[.] symbolises statistical expectation.
This means that the maximum return from state “s” and action “a” is the sum of the immediate reward “r” and the return (discounted by "γ") obtained by following the optimal policy
thereafter until the end of the episode (i.e., the maximum reward from the next
state "s′ "). The expectation is computed
both over the distribution of immediate rewards "r" and possible next states "s′ ".
The basic idea
behind Q-Learning is to use the
Bellman optimality equation as an iterative update
It can be shown that this converges to the optimal Q-function, i.e. Q(t) → Q* as t → ∞.
The performance of Q-learning depends on visiting all
state-action pairs in order to learn the correct Q-values. This can be easily achieved with a small number of
states. In the real world, however, the number of states can be very large,
particularly when there are multiple agents in the system. For example, in a
maze game, a robot has at most 1,000 states (locations); this grows to
1,000,000 when it is in a game against another robot, where the state
represents the joint location of two robots (1,000 x 1,000).
When
the state space is large, it is not efficient to wait until we visit all
state-actions pairs. A faster way to learn is called the Monte Carlo method. In statistics, the Monte Carlo method
derives an average through repeated sampling. In reinforcement learning, the
Monte Carlo method is used to derive Q-values
after repeatedly seeing the same state-action pair. It sets the Q-value, Q(s,a), as the average reward after many visits to the same
state-action pair (s, a). This method removes the need for
using a learning rate or a discount rate. It depends only on large numbers of
simulations. Due to its simplicity, this method has become very popular. It has
been used by AlphaGo after playing many games against itself to learn about the
best moves.
Another way to reduce the number of
states is by using a neural network, where the inputs are states and outputs
are actions, or Q-values associated
with each action. A deep neural network has the power to dramatically simplify
the representation of states through hidden layers.
Deep Q-Learning
Deep Q-Learning
For most
problems, it is impractical to represent the Q-function as a table containing values for each
combination of s and a. Instead,
we train a function approximator, such as a neural network with
parameters θ, to estimate the Q-values, i.e. Q(s,a;θ) ≈ Q*(s,a).
The steps
involved in reinforcement learning using deep Q-learning network.
- All the past experiences are stored by the user in memory.
- The next action is determined by the maximum output of the Q-network.
- The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*. This can done by minimizing the following loss at each step t:
This is
basically a regression problem. The target or actual values yt are not known as we are
dealing with a reinforcement learning problem.
Going back to the Q-value update equation derived from the Bellman equation, we
have
Here, "yt" is called the TD (temporal
difference) target, and “yt − Q (s,a,θt)”
is called the TD error. ρ(.) represents the behaviour distribution, the
distribution over transitions s, a, r, s′ collected from the environment.
Note that the
parameters from the previous iteration θ(t−1) are
fixed and not updated. In practice we use a snapshot of the network parameters
from a few iterations ago instead of the last iteration. This copy is called
the target network.
Q-Learning
is an off-policy algorithm that learns about the
greedy policy while using a
different behaviour policy for acting in the environment by collecting data. This
behaviour policy is usually an epsilon (ϵ)-greedy policy that selects the greedy action with
probability 1−ϵ and a
random action with probability ϵ to ensure good coverage of the state-action
space.
Experience Replay
To avoid computing the full expectation in the DQN loss, we can minimize it using stochastic gradient descent. If the loss is computed using just the last transition s,a,r,s′, this reduces to standard Q-Learning. For Atari games the whole game board can be mapped by a convolutional neural network to determine Q-values.
To avoid computing the full expectation in the DQN loss, we can minimize it using stochastic gradient descent. If the loss is computed using just the last transition s,a,r,s′, this reduces to standard Q-Learning. For Atari games the whole game board can be mapped by a convolutional neural network to determine Q-values.
The Atari DQN work introduced a technique called Experience Replay to make the network updates more stable. At each time step of data collection, the transitions are added to a circular buffer called the replay buffer. Then during training, instead of using just the latest transition to compute the loss and its gradient, we compute them using a mini-batch of transitions sampled from the replay buffer. This has two advantages: better data efficiency by reusing each transition in many updates, and better stability using uncorrelated transitions in a batch.
References:
- Richard S. Sutton and Andrew G. Barto, Reinforcement Learning An Introduction, Second edition, The MIT Press.
- Ethem Alpaydin, Introduction to Machine Learning, 2nd Ed, The MIT Press. \
- Kevin P. Murphy, Machine Learning A Probabilistic Perspective, The MIT Press.
- Mance E. Harmon - Wright Laboratory, Stephanie S. Harmon - Wright State University, Reinforcement Learning: A Tutorial.
- Frank L. Lewis, Draguna Vrabie, Kyriakos G. Vamvoudakis, Reinforcement Learning and Feedback Control, IEEE CONTROL SYSTEMS MAGAZINE, December 2012.
- Andrew, Markov Decision Processes: Making Decision in the Presence of Uncertainty, cs.cmu.edu.
- Introduction of Reinforcement Learning, tensorflow. org.
- Reinforcement Learning Explained, oreilly.com
Image
Credits
Figure-1: prakhartechviz.blogspot.com
Figure-2: semanticscholar.org
Figure-3: slideshare.net
Figure-4: hk.istem.ai
Figure-5: researchgate.net
Figure-6: media.springernature.com
Figure-7: chessable.com
Figure-8: thumbs.dreamstime.com
Figure-9: conductscience.com
Figure-10: guru99.com
Figure-11: gzwq.github.io
Figure-12: leonardoaraujosantos.gitbooks.io
Figure-13: image.slidesharecdn.com
Figure-14: RL an Introduction Sutton
and Barto
Figure-15: riptutorial.com
Figure-16: kinstacdn.com
Figure-17: oreilly.com
Figure-18: present5.com
Figure-19: image.slidesharecdn.com
Figure-20: s3-ap-south-1.amazonaws.com
Major thanks for the post.Really thank you! Will read on…
ReplyDeletedata science online free
Best Data Science Online Training
Thanks for sharing this wonderful knowledge about Copier Machine. If you are looking for the top company who offer professional BUY COPIER MACHINE services at affordable prices. Then you must choose DocuConnex.
ReplyDeleteCOPIER SINGAPORE