Policy Iteration as one step of Newton's method

A discrete-time dynamical system is defined by the system equation

\[s_{t+1} = f(s_t, a_t, w_t, t),\]

where \(s_{t}\in \mathcal{S}\) is the system state at time \(t\), \(a_t \in \mathcal{A}\) is the action, and \(w_t \in \mathcal{W}\) is a random disturbance which realizes the random variable \(W_t\). Actions are decided by a control policy \(\pi: \mathcal{S} \rightarrow \mathcal{A}\) and the disturbances are sampled from a probability distribution \(P(\cdot \mid s_t, a_t)\).

Each state transition is associated with a reward \(r(s_t, a_t) \in \mathbb{R}\) and a policy is said to be optimal if it maximizes the expected cumulative discounted reward \(\mathbb{E}_{(W_t)_{t=1,2,...,N}}\left[\sum_{t=1}^{N}\gamma^{t-1}r(s_t, a_t)\right],\) where \(N > 1\) is the time horizon and \(\gamma \in [0,1]\) is a discount factor.

The Bellman equation relates a policy \(\pi\) to the value function \(V^{\pi}\), which is defined as

\[V^{\pi}(s_t) \triangleq \mathbb{E}_W\left[r(s_t, \pi(s_t)) + \gamma V^{\pi}(f(s_{t}, \pi(s_t), w_t))\right],\]

for all \(s_t \in \mathcal{S}\).

As any policy that maximizes the equation above also satisfies

\[\pi^{\star}(s_t) \in arg\max_{a_t\in \mathcal{A}}\mathbb{E}_W\left[r(s_t, a_t) + \gamma V^{\pi^{\star}}(s_{t+1}) \mid s_t, a_t\right],\]

the Bellman equation effectively provides an alternative optimality condition (sufficient and necessary). (Remarks: we restrict ourselves to finite state and action spaces which ensures that a maximizer of the equation above always exists, which is why we write \(\max\) instead of \(\sup\); if the system model \(f\) or the reward function \(r\) are time-dependent, or if the time horizon \(N\) is finite, then \(\pi^{\star}\) and \(V^{\star}\) may be non-stationary, denoted as \(\pi^{\star}_t\) and \(V^{\star}_t\).)

Dynamic Programming

Dynamic programming algorithms, e.g., value iteration and policy iteration, use the Bellman equation to obtain an optimal policy through successive approximations of \(V^{\pi^{\star}}\).

Let \(T\) and \(T_{\pi}\) denote Bellman operators defined as

\[(TV)(s) \triangleq \max_{a \in \mathcal{A}}\mathbb{E}_W\left[r(s, a) + \gamma V(f(s, a, W))\right]\] \[(T_{\pi}V)(s) \triangleq \mathbb{E}_W\left[r(s, \pi(s)) + \gamma V(f(s, \pi(s), W))\right]\]

for all \(s \in \mathcal{S}\). Using these operators, value iteration can be defined through the recursion \(V_{k+1} = TV_{k}\). Similarly, policy iteration can be defined through the following two equations

\[\pi_{k+1}(s) \in arg\max_{a \in \mathcal{A}}\mathbb{E}_W\left[r(s, a) + \gamma V^{\pi_{k}}(f(s, a, W))\right]\] \[V^{\pi_{k}} = T_{\pi}V^{\pi_{k}}\]

for all \(s \in \mathcal{S}\) and \(k=1,2,...\).

If \(|\mathcal{S}| < \infty,\)

\[|\mathcal{A}| < \infty\]

and

\[\gamma \in [0,1)\]

or

\[N < \infty,\]

then the value functions produced by value iteration satisfy

\[\lim_{k\rightarrow \infty}V_{k} = V^{\pi^{\star}}\]

and the policies produced by policy iteration satisfy

\(\lim_{k\rightarrow \infty}\pi_{k} = \pi^{\star}\).

Relation Between Policy Iteration and Newton’s Method

Policy iteration can be viewed as a vector space version of Newton’s method for solving the Bellman equation for the optimal policy \(\pi^{\star}\). To see this, note that the value functions produced by policy iteration satisfy \(T_{\pi_{k}}V^{\pi_{k-1}} = TV^{\pi_{k-1}}\), where \(T_{\pi_{k}}\) is linear and \(T\) is non-linear. Hence, policy iteration effectively linearizes the Bellman operator around \(V^{\pi_{k-1}}\) and then solves for the fixed point, analogous to Newton’s method.

To demonstrate the connection to Newton’s method, we will show that the sequence of value functions \(V_{1},V_{2}, ...\) produced by repeated application of the policy improvement and policy evaluation equations can be expressed through a recursive equation on the same form as Newton’s equation for finding the root \(f(x^{\star})=0\) of a differentiable function \(f(x)\):

\[x_n = x_{n-1} - \frac{f(x_{n-1})}{\nabla f(x_{n-1})}.\]

We start from the policy evaluation step and write it in a vectorized form:

\[V_{k} = \mathbf{r}_{\pi_{k-1}} + \gamma \mathbf{P}_{\pi_{k-1}}V_{k}B\\ \implies V_{k} - \gamma \mathbf{P}_{\pi_{k-1}}V_{k} = \mathbf{r}_{\pi_{k-1}} \\ \implies \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right) V_{k} = \mathbf{r}_{\pi_{k-1}} \\ \implies V_{k} = \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right)^{-1}\mathbf{r}_{\pi_{k-1}}\]

where \(\mathbf{I}\) is the identity matrix, \(\pi_{k-1}\) is defined through the policy improvement equation, the components of the vector \(\mathbf{r}_{\pi_{k-1}}\) correspond to \(r(s, \pi_{k-1}(s))\) for different states and \(\mathbf{P}_{\pi_{k-1}}\), is a matrix defining the state-transition probabilities induced by \(\pi_{k-1}\).

Next, we rewrite the above vector equation in terms of the Bellman operator \(BV \triangleq TV - V\):

\[V_{k} = V_{k-1} + \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right)^{-1}BV_{k-1}\]

Now let \(\nabla_{V_{k-1}}\) denote a vector-space generalization of the gradient operator, we then obtain:

\[\nabla_{V_{k-1}}BV_{k-1} = \nabla_{V_{k-1}}(TV_{k-1}-V_{k-1}) \\ = \nabla_{V_{k-1}}\left(\mathbf{r}_{\pi_{k-1}} + \gamma \mathbf{P}_{\pi_{k-1}} V_{k-1} \right) \\ = 0 + \gamma \mathbf{P}_{\pi_{k-1}} - \mathbf{I}.\]

Then we have that

\[\implies \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right)^{-1}= -\left(\nabla_{V_{k-1}}BV_{k-1}\right)^{-1}\notag\]

Plugging the above expression into the expression \(V_{k-1} + \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right)^{-1}BV_{k-1}\) above, we obtain: \(\implies V_{k} = V_{k-1} + \left(\mathbf{I} - \gamma \mathbf{P}_{\pi_{k-1}}\right)^{-1}BV_{k-1} \\ = V_{k-1}- \left(\nabla_{V_{k-1}}BV_{k-1}\right)^{-1}BV_{k-1}\)

which is of the same form as Newton’s equation.