data:image/s3,"s3://crabby-images/16310/16310ea9cc98af3ccdf7c45cab0cd378c4747341" alt="Python Reinforcement Learning"
Deriving the Bellman equation for value and Q functions
Now let us see how to derive Bellman equations for value and Q functions.
You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.
First, we define, as a transition probability of moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/b07c6/b07c6436947326ba1a8a371fc822005a344af50a" alt=""
We define as a reward probability received by moving from state
to
while performing an action a:
data:image/s3,"s3://crabby-images/dd4bb/dd4bbf66c51d16e6ce7987c2e8d8df6846dc914e" alt=""
data:image/s3,"s3://crabby-images/c92f4/c92f44304ebdaaddbc80d823b5cb73d22f87199a" alt=""
We know that the value function can be represented as:
data:image/s3,"s3://crabby-images/36c75/36c75cdcddd56224a112105a2b060181a22b7c7c" alt=""
data:image/s3,"s3://crabby-images/a170e/a170e2731c56e5a7e4d758761b6f760eff28b2cf" alt=""
We can rewrite our value function by taking the first reward out:
data:image/s3,"s3://crabby-images/cccfb/cccfbd7b2df7153f6bcd6ffced5980861e2ee9a0" alt=""
The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.
So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:
data:image/s3,"s3://crabby-images/396f8/396f8636271091ffb01201e8fa0ee6401aabe695" alt=""
In the RHS, we will substitute from equation (5) as follows:
data:image/s3,"s3://crabby-images/e6705/e67053895311a4c8fcb5738d656850dc1e9dbf78" alt=""
Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:
data:image/s3,"s3://crabby-images/b700d/b700d02fc1e27bebbe0513413db132baa0794eb5" alt=""
So, our final expectation equation becomes:
data:image/s3,"s3://crabby-images/652f7/652f7f11586ca99714f9d51fa11e168b5dc8411a" alt=""
Now we will substitute our expectation (7) in value function (6) as follows:
data:image/s3,"s3://crabby-images/bca93/bca931b7e06fe3861ce5920c29cb09221aab2435" alt=""
Instead of , we can substitute
with equation (6), and our final value function looks like the following:
data:image/s3,"s3://crabby-images/9e391/9e3914fba586a91de9a6c39d810c09d606deb7d6" alt=""
In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:
data:image/s3,"s3://crabby-images/f4be2/f4be23e134b5796fc88d5a5f60a3cd247f3054b6" alt=""
Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.