Markov Decision Processes 03: optimal policy variation

Pt En This post is going to follow up what was discussed here , and I will show you how changing the value of the discount factor may affect what the optimal policy is. For that, consider the MDP defined below: This MDP is inspired in the example I have been using; I added an extra state and stripped the MDP of the majority of the transitions so we can focus on what is essential here. From the MDP above there is a clear contender for the title of optimal policy: $$\begin{align} &\pi(H) = E\\ &\pi(T) = D\\ &\pi(TU) = D\end{align}$$ Assume that when we enter the MDP there is a $50\%$ chance we start Hungry and a $50\%$ chance we start Thirsty. If we use the policy $\pi$ then the expected reward is $$E[R | \pi] = 0.5E[R_T | \pi] + 0.5E[R_H | \pi]$$ where $E[R_s | \pi]$ is the expected reward we get by following policy $\pi$ starting from state $s$. From my previous post we know that $E[R_T | \pi] = E[R_H | \pi] = \frac{1}{1-\gamma}$ and hence $$E[R | \pi] = \frac{1}{1-\g...