Confident Off-Policy Evaluation and Selection through Self-Normalized Importance Weighting
Csaba Szepesvári (University of Alberta and DeepMind)
Abstract: Off-policy evaluation is the problem of predicting the value of a policy given some batch of data. In the language of statistics, this is also called counterfactual estimation. Batch policy optimization refers to the problem of finding a good policy, again, given some logged data. In this talk, I will consider the case of contextual bandits, give a brief (and incomplete) review of the approaches proposed in the literature and explain why this problem is difficult. Then, I will describe a new approach based on self-normalized importance weighting. In this approach, a semi-empirical Efron-Stein concentration inequality is combined with Harris' inequality to arrive at non-vacuous high-probability value lower bounds, which can then be used in a policy selection phase. On a number of synthetic and real datasets this new approach is found to be significantly superior than its main competitors, both in terms of tightness of the confidence intervals and the quality of the policies chosen.
The talk is based on joint work with Ilja Kuzborskij, Claire Vernade and Andras Gyorgy.
machine learningmathematical physicsinformation theoryoptimization and controlprobabilitystatistics theory
Audience: researchers in the topic
( video )
Mathematics, Physics and Machine Learning (IST, Lisbon)
Series comments: To receive the series announcements, please register in:
mpml.tecnico.ulisboa.pt
mpml.tecnico.ulisboa.pt/registration
Zoom link: videoconf-colibri.zoom.us/j/91599759679
Organizers: | Mário Figueiredo, Tiago Domingos, Francisco Melo, Jose Mourao*, Cláudia Nunes, Yasser Omar, Pedro Alexandre Santos, João Seixas, Cláudia Soares, João Xavier |
*contact for this listing |