Reproducibility in Learning
Jessica Sorrell (UCSD)
Abstract: Reproducibility is vital to ensuring scientific conclusions are reliable, but failures of reproducibility have been a major issue in nearly all scientific areas of study in recent decades. A key issue underlying the reproducibility crisis is the explosion of methods for data generation, screening, testing, and analysis, where, crucially, only the combinations producing the most significant results are reported. Such practices (also known as p-hacking, data dredging, and researcher degrees of freedom) can lead to erroneous findings that appear to be significant, but that don’t hold up when other researchers attempt to replicate them.
In this talk, we introduce a new notion of reproducibility for randomized algorithms. This notion ensures that with high probability, an algorithm returns exactly the same output when run with two samples from the same distribution. Despite the exceedingly strong demand of reproducibility, there are efficient reproducible algorithms for several fundamental problems in statistics and learning. We show that any statistical query algorithm can be made reproducible with a modest increase in sample complexity, and we use this to construct reproducible algorithms for finding approximate heavy-hitters and medians. Using these ideas, we give the first reproducible algorithm for learning halfspaces via a reproducible weak learner and a reproducible boosting algorithm. We initiate the study of lower bounds and inherent tradeoffs for reproducible algorithms, giving nearly tight sample complexity upper and lower bounds for reproducible versus non-reproducible SQ algorithms. Finally, we discuss connections to other well-studied notions of algorithmic stability, such as differential privacy.
Joint work with Russell Impagliazzo (UCSD), Rex Lei (UCSD), and Toniann Pitassi (Columbia University), to appear in STOC 2022.
computational complexitycomputational geometrycryptography and securitydiscrete mathematicsdata structures and algorithmsgame theorymachine learningquantum computing and informationcombinatoricsinformation theoryoptimization and controlprobability
Audience: researchers in the topic
( paper )
Series comments: Description: Theoretical Computer Science
People can register to a talk via our webpage sites.google.com/site/plustcs/livetalk , or subscribe to our calendar and mailing list at sites.google.com/site/plustcs/rss-feeds
| Organizers: | Clément Canonne*, Anindya De, Sumegha Garg, Gautam Kamath, Ilya Razenshteyn, Oded Regev, Tselil Schramm, Thomas Vidick, Erik Waingarten |
| *contact for this listing |
