Understanding Neural Net Training Dynamics in Tractable Slices

Dr. Cyril Zhang (Microsoft Research)

18-Feb-2022, 18:00-19:00 (4 years ago)

Abstract: The ability of deep neural networks to successfully train and generalize seems to evade precise characterization by classical theory from optimization and statistics. Through some of my recent work, I'll present some perspectives on how this theory-practice gap inconveniently manifests itself, and discuss how theoretically-grounded algorithm design can still deliver near-term improvements and new tools in this space:

- Self-stabilizing unstable learning rates. We investigate a lesser-known variant of accelerated gradient descent in convex optimization, which eschews Nesterov/Polyak momentum in favor of plain gradient descent with a carefully selected "fractal" schedule of unstable learning rates. We prove stronger stability properties for this "fractal Chebyshev schedule", and try it out on neural networks. [https://arxiv.org/pdf/2103.01338v2.pdf]

- Learning rate grafting. We propose an experiment which interpolates between optimizers (e.g. SGD and Adam), to better understand their training dynamics. We find that Adam isn't necessary to pretrain a state-of-the-art BERT model after all: instead, simply find its implicit (per-layer) learning rate schedule via grafting, then transfer this schedule to SGD. [https://openreview.net/pdf?id=FpKgG31Z_i9]

- Inductive biases of attention models. I'll briefly mention some very recent work, which begins to quantify the inductive biases of the self-attention-based models used in {NMT, GPT-{1,2,3}, BERT, AlphaFold, Codex, ...}. We propose a theoretical mechanism by which a bounded-norm Transformer can represent a concise circuit near the statistical limit. [https://arxiv.org/pdf/2110.10090.pdf]

Bio: Cyril Zhang is a Senior Researcher at Microsoft Research NYC. He works on theory and algorithms for prediction and decision-making in dynamical systems, large-scale learning, and language modeling. He completed a Ph.D. in Computer Science from Princeton under the supervision of Prof. Elad Hazan, during which he was a Student Researcher at Google AI. Before that, he worked on Laplacian solvers and exoplanets at Yale.

machine learningdynamical systemsoptimization and controldata analysis, statistics and probability

Audience: researchers in the topic


e-Seminar on Scientific Machine Learning

Series comments: Please, sign up for our SciML google group (https://groups.google.com/g/sciml) in order to receive updates and announcements!

Organizers: N. Benjamin Erichson*, Hessam Babaee, Soon Hoe Lim, Michael W. Mahoney
*contact for this listing

Export talk to