Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log on action and prob for off-policy evaluation #43

Merged

Conversation

jonastim
Copy link
Contributor

@jonastim jonastim commented Aug 23, 2023

The main change is for the off-policy evaluator to log the action and probability of the learning model rather than that of the logged data (which is identical for all models and not very useful when trying to compare different learners).

Introduced a helper function for sampling the actions and used if in some other code placed to avoid redundancy.

@codecov
Copy link

codecov bot commented Aug 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (7499d8b) 99.90% compared to head (3c0686b) 99.90%.

❗ Current head 3c0686b differs from pull request most recent head dd3b34b. Consider uploading reports for the commit dd3b34b to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #43      +/-   ##
==========================================
- Coverage   99.90%   99.90%   -0.01%     
==========================================
  Files          55       56       +1     
  Lines        7455     7366      -89     
==========================================
- Hits         7448     7359      -89     
  Misses          7        7              
Flag Coverage Δ
99.90% <100.00%> (-0.01%) ⬇️
ubuntu-latest 99.90% <100.00%> (-0.01%) ⬇️
unittest 99.90% <100.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

else:
ope_reward = [ sum(p*float(R.eval(a)) for p,a in zip(P,A)) for P,A,R in zip(on_probs,log_actions,log_rewards) ]
on_action, on_prob = zip(*[sample_actions(actions, probs) for actions, probs in zip(log_actions, on_probs)])
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to do this for continuous actions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For continuous actions we just need to call on_action,on_prob = predict(log_context, log_actions)[:2]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe on line 246? I don't think we need to have separate processing for batched and non-batched. Man I hate all this batched logic. It's all here for neural network stuff we do where backpropagation with mini-batches can give huge gains in computation time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to add support for continuous actions but struggle to make some tests pass, see below.

if record_context: out['context'] = log_context
if record_actions: out['actions'] = log_actions
if record_rewards: out['rewards'] = log_rewards

out.update({k: interaction[k] for k in interaction.keys()-OffPolicyEvaluator.IMPLICIT_EXCLUDE})

if record_ope_loss: out['ope_loss'] = get_ope_loss(learner)
if record_ope_loss: out['ope_loss'] = get_ope_loss(learner) if not batched else [get_ope_loss(learner)] * len(log_context)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make OPE loss work for batched evaluation

I = [self._get_pmf_index(p) for p in pred]
A = [ a[i] for a,i in zip(actions,I) ]
P = [ p[i] for p,i in zip(pred,I) ]
A, P = list(map(list, zip(*[sample_actions(a, p, self._rng) for a, p in zip(actions, pred)])))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could remove list(map(list, if it was ok to return a tuple instead of a list.

@jonastim
Copy link
Contributor Author

I can't seem to re-run tests.
Looks like these unrelated tests should be less sensitive

======================================================================
FAIL: test_DM (coba.tests.test_environments_filters.OpeRewards_Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/coba/coba/coba/tests/test_environments_filters.py", line 2390, in test_DM
    self.assertAlmostEqual(new_interactions[0]['rewards'].eval('c'),.79699, places=4)
AssertionError: 0.7970473766326904 != 0.79699 within 4 places (5.7376632690453455e-05 difference)

======================================================================
FAIL: test_DM_action_not_hashable (coba.tests.test_environments_filters.OpeRewards_Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/coba/coba/coba/tests/test_environments_filters.py", line 2404, in test_DM_action_not_hashable
    self.assertAlmostEqual(new_interactions[0]['rewards'].eval(['c']),.79699, places=4)
AssertionError: 0.7970473766326904 != 0.79699 within 4 places (5.7376632690453455e-05 difference)

----------------------------------------------------------------------

@jonastim jonastim marked this pull request as ready for review August 24, 2023 21:55
@mrucker
Copy link
Collaborator

mrucker commented Sep 4, 2023

I can't seem to re-run tests. Looks like these unrelated tests should be less sensitive

======================================================================
FAIL: test_DM (coba.tests.test_environments_filters.OpeRewards_Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/coba/coba/coba/tests/test_environments_filters.py", line 2390, in test_DM
    self.assertAlmostEqual(new_interactions[0]['rewards'].eval('c'),.79699, places=4)
AssertionError: 0.7970473766326904 != 0.79699 within 4 places (5.7376632690453455e-05 difference)

======================================================================
FAIL: test_DM_action_not_hashable (coba.tests.test_environments_filters.OpeRewards_Tests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/coba/coba/coba/tests/test_environments_filters.py", line 2404, in test_DM_action_not_hashable
    self.assertAlmostEqual(new_interactions[0]['rewards'].eval(['c']),.79699, places=4)
AssertionError: 0.7970473766326904 != 0.79699 within 4 places (5.7376632690453455e-05 difference)

----------------------------------------------------------------------

Totally agree. Are these the only tests you're having problems with? I've tried to bump down a lot of tests over time.

raise Exception()
return 0.5

def predict(self, context, actions):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Struggling to make this test pass.
The processing thinks it's of AX format and then always fills in Nones for the probability. I haven't worked with continuous actions before and I am not quite sure about all the different formats and in the SafeLearner
Any advice, @mrucker?

Screenshot 2023-11-06 at 12 32 28 PM

@mrucker mrucker merged commit d9367a7 into VowpalWabbit:master Nov 19, 2023
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants