In question 2 Moed B you are told that the trajectories are using pi, and asked to run Monte Carlo to learn V of pi. ]]>

In the exams you published there are questions that provide traces and ask us to compute the V or Q function via some method.

My question is, how do we know if the traces were produced via on-policy or by off-policy?

This changes dramatically the computation of the estimated Q/V function.

Thanks

