BEGIN:VCALENDAR
VERSION:2.0
PRODID:researchseminars.org
CALSCALE:GREGORIAN
X-WR-CALNAME:researchseminars.org
BEGIN:VEVENT
SUMMARY:Dr. Cyril Zhang (Microsoft Research)
DTSTART:20220218T180000Z
DTEND:20220218T190000Z
DTSTAMP:20260423T021529Z
UID:SciML/3
DESCRIPTION:Title: <a href="https://researchseminars.org/talk/SciML/3/">Un
 derstanding Neural Net Training Dynamics in Tractable Slices</a>\nby Dr. C
 yril Zhang (Microsoft Research) as part of e-Seminar on Scientific Machine
  Learning\n\n\nAbstract\nThe ability of deep neural networks to successful
 ly train and generalize seems to evade precise characterization by classic
 al theory from optimization and statistics. Through some of my recent work
 \, I'll present some perspectives on how this theory-practice gap inconven
 iently manifests itself\, and discuss how theoretically-grounded algorithm
  design can still deliver near-term improvements and new tools in this spa
 ce:\n\n- Self-stabilizing unstable learning rates. We investigate a lesser
 -known variant of accelerated gradient descent in convex optimization\, wh
 ich eschews Nesterov/Polyak momentum in favor of plain gradient descent wi
 th a carefully selected "fractal" schedule of unstable learning rates. We 
 prove stronger stability properties for this "fractal Chebyshev schedule"\
 , and try it out on neural networks. [https://arxiv.org/pdf/2103.01338v2.p
 df]\n\n- Learning rate grafting. We propose an experiment which interpolat
 es between optimizers (e.g. SGD and Adam)\, to better understand their tra
 ining dynamics. We find that Adam isn't necessary to pretrain a state-of-t
 he-art BERT model after all: instead\, simply find its implicit (per-layer
 ) learning rate schedule via grafting\, then transfer this schedule to SGD
 . [https://openreview.net/pdf?id=FpKgG31Z_i9]\n\n- Inductive biases of att
 ention models. I'll briefly mention some very recent work\, which begins t
 o quantify the inductive biases of the self-attention-based models used in
  {NMT\, GPT-{1\,2\,3}\, BERT\, AlphaFold\, Codex\, ...}. We propose a theo
 retical mechanism by which a bounded-norm Transformer can represent a conc
 ise circuit near the statistical limit. [https://arxiv.org/pdf/2110.10090.
 pdf]\n\nBio: Cyril Zhang is a Senior Researcher at Microsoft Research NYC.
  He works on theory and algorithms for prediction and decision-making in d
 ynamical systems\, large-scale learning\, and language modeling. He comple
 ted a Ph.D. in Computer Science from Princeton under the supervision of Pr
 of. Elad Hazan\, during which he was a Student Researcher at Google AI. Be
 fore that\, he worked on Laplacian solvers and exoplanets at Yale.\n
LOCATION:https://researchseminars.org/talk/SciML/3/
END:VEVENT
END:VCALENDAR
