can you please upload a solution for the exam?

Thanks

]]>Can the Moed A exam and its solution be uploaded?

Thanks

]]>Stability, implies that the optimal control has finite cost, but does to guarantee that we can reach any state. (For example, we always move to x=0)

We like to reach x=0, and should think that we "normalize" the system so that the origin (x=0) is the desired state of operation. ]]>

In question 2 Moed B you are told that the trajectories are using pi, and asked to run Monte Carlo to learn V of pi. ]]>

In the LQR lecture we defined controllability as a sufficient condition for solving the ARE equations.

Then we defined stability which basically tell us if our system will explode or not depending on the eigenvalues of the proposed optimal solution.

Can someone explain how are the two related ?

We can reach every state but then cannot stay there? we will try to reach it but the system will be very unstable?

Also it says that a good system is a system where the eigenvalues are lower than 1 hence x_t goes to 0, why is it good?

We want x_t to be a specific state and not zero.

Thanks!

]]>In the exams you published there are questions that provide traces and ask us to compute the V or Q function via some method.

My question is, how do we know if the traces were produced via on-policy or by off-policy?

This changes dramatically the computation of the estimated Q/V function.

Thanks

]]>In exercise 2 we handle the 'wait' action for the first time, so when we estimate the Q value we should consider different weights than the ones that were used for 'harvest' action. ]]>

equation 22- it's not required to assume that alpha>1. ]]>

Thanks

]]>When do we update the entry Q(5,5)?

Since it is the target (room 5), it seems it can be updated only when the episod is starting with that state?

If it would have been stayed zero, we would have never reached Q values greater than 100

It holds since in the first round we start by pulling each arm one time.

Shouldn't we add this to the regret?

Hence the regret should have an extra term: sum i 1 to n of delta_i ]]>

\begin{align} exp( \frac{-2 \epsilon ^2 m^2}{\sum ^m _{i=1} (b_i - a_i)^2} )=exp( \frac{-2 \epsilon ^2 m^2}{\sum ^m _{i=1} (1 - 0)^2} )=exp( \frac{-2 \epsilon ^2 m^2}{m}) \end{align}

where $m=T_i$ because we get the mean $\hat \mu _{i,T_i}$ on $T_i$ samples

]]>