| tags | python,numpy,neural-network,reinforcement learning |
|---|---|
| mathjax | true |
The Bellman Equation defines the value
- the immediate reward
$$r(s, a)$$ received when executing action$$a$$ in state$$s$$ following policy$$\pi$$ - the value
$$V^\pi(s')$$ of state$$s'$$ reached by executing the policy step
This gives the following bootstrap state value evaluation rule:
for deterministic environments.
For stochastic environments, in which the same action
The next state value
Because the value of each state depends on the value of the related next state, it becomes the sum of future rewards collected while executing a given policy.
To prevent infinite state values, the sum of future rewards often gets discounted by a factor
The Q state-action value (state-action pair quality) related to state
which can be transformed into a
and the learning rate
The state transition probabilities are a property of the environment and must not be explicitly implemented by a TD learning algorithm. The transition probabilities are implicitly modelled (emerge) by doing a lot of averaging TD steps with the same state action pairs and probably different next states.
The following temporal difference learning algorithms are based on the Bellman Equation:
Please see the Gridworld Examples for details.