-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log on action and prob for off-policy evaluation #43
log on action and prob for off-policy evaluation #43
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #43 +/- ##
==========================================
- Coverage 99.90% 99.90% -0.01%
==========================================
Files 55 56 +1
Lines 7455 7366 -89
==========================================
- Hits 7448 7359 -89
Misses 7 7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
else: | ||
ope_reward = [ sum(p*float(R.eval(a)) for p,a in zip(P,A)) for P,A,R in zip(on_probs,log_actions,log_rewards) ] | ||
on_action, on_prob = zip(*[sample_actions(actions, probs) for actions, probs in zip(log_actions, on_probs)]) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to do this for continuous actions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For continuous actions we just need to call on_action,on_prob = predict(log_context, log_actions)[:2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe on line 246? I don't think we need to have separate processing for batched and non-batched. Man I hate all this batched logic. It's all here for neural network stuff we do where backpropagation with mini-batches can give huge gains in computation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to add support for continuous actions but struggle to make some tests pass, see below.
if record_context: out['context'] = log_context | ||
if record_actions: out['actions'] = log_actions | ||
if record_rewards: out['rewards'] = log_rewards | ||
|
||
out.update({k: interaction[k] for k in interaction.keys()-OffPolicyEvaluator.IMPLICIT_EXCLUDE}) | ||
|
||
if record_ope_loss: out['ope_loss'] = get_ope_loss(learner) | ||
if record_ope_loss: out['ope_loss'] = get_ope_loss(learner) if not batched else [get_ope_loss(learner)] * len(log_context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make OPE loss work for batched evaluation
I = [self._get_pmf_index(p) for p in pred] | ||
A = [ a[i] for a,i in zip(actions,I) ] | ||
P = [ p[i] for p,i in zip(pred,I) ] | ||
A, P = list(map(list, zip(*[sample_actions(a, p, self._rng) for a, p in zip(actions, pred)]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could remove list(map(list,
if it was ok to return a tuple instead of a list.
I can't seem to re-run tests.
|
Totally agree. Are these the only tests you're having problems with? I've tried to bump down a lot of tests over time. |
Sync upstream
raise Exception() | ||
return 0.5 | ||
|
||
def predict(self, context, actions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Struggling to make this test pass.
The processing thinks it's of AX
format and then always fills in None
s for the probability. I haven't worked with continuous actions before and I am not quite sure about all the different formats and in the SafeLearner
Any advice, @mrucker?
The main change is for the off-policy evaluator to log the action and probability of the learning model rather than that of the logged data (which is identical for all models and not very useful when trying to compare different learners).
Introduced a helper function for sampling the actions and used if in some other code placed to avoid redundancy.