Weight Decay Controls Implicit Regularization: Insights on Generalization and Sparsity
Tom Jacobs (CISPA Helmholtz Center for Information Security)
Abstract: Classical statistics teaches us that overparameterization causes overfitting, which prevents good generalization. However, highly overparameterized neural network architectures generalize surprisingly well. This is because the training of these models tends towards low rank or sparse solutions, without requiring explicit constraints. This preference is known as implicit regularization, and it can be found in a variety of contexts, including attention layers, LoRA, matrix sensing, and diagonal linear networks. As a result, implicit regularization helps explain how overfitting is avoided and generalization is improved in neural networks.
In this work I will show how weight decay controls implicit regularization beyond its explicit role of constraining the model capacity. For instance, it moves the implicit regularizer from $L_2$ to $L_1$, which leads to more sparsity in the model. This demonstrates how weight decay not only serves as a model constraint, but also has an implicit effect. By turning off weight decay during training, only the implicit effect remains, resulting in better generalization overall. Besides better generalization, I use these insights to induce sparsity in deep neural networks. Sparsity aims to reduce model size and inference time by removing as many weights as possible. This results in a new method: PILoT (Parameteric Implicit Lottery Ticket)(Our previous work), a sparsification method based on overparameterization and weight decay that uses the transition of the implicit regularization from $L_2$ to $L_1$ to gradually sparsify, achieving high sparsity with a smaller performance drop.
Theoretically, we use and extend the connection between reparameterizations (specific overparameterization) and mirror flows (Riemannian gradient flow) and extend this to time-varying mirror flows. The mirror flow controls the implicit bias and with that the weight decay controls the time-varying mirror flow.
machine learningoptimization and control
Audience: learners
( paper )
Tropical mathematics and machine learning
Series comments: Tropical mathematics, machine learning, category theory and anything tech+math are welcome.
| Organizer: | Eric Dolores-Cuenca* |
| *contact for this listing |
