Learning rate warm-up
Learning rate warm-up is a technique used in training deep neural networks, especially large models like transformers, to gradually increase the learning rate at the beginning of training. The idea behind learning rate warm-up is to allow the model to start with a small learning rate, which helps it stabilize during the initial phase of training when the weights are random and the gradients are high. As training progresses, the learning rate is increased to speed up convergence.
Here's how learning rate warm-up is typically implemented:
Choose an Initial Learning Rate: Start with a small initial learning rate, often referred to as
learning_rate_min
.Choose a Warm-up Period: Decide on the number of training steps or epochs during which the learning rate will be gradually increased. This is referred to as the "warm-up period."
Gradually Increase Learning Rate: Linearly increase the learning rate from
learning_rate_min
to your desired maximum learning rate, which is typically referred to aslearning_rate_max
, over the warm-up period. The increase is typically linear, but other schedules are possible.Use the Maximum Learning Rate: After the warm-up period, continue training with the maximum learning rate for the remaining training steps or epochs.
Here's an example of how you might implement learning rate warm-up in PyTorch using a linear warm-up schedule:
In this example, we use the LambdaLR
scheduler to implement a linear warm-up schedule. The learning rate increases linearly from learning_rate_min
to learning_rate_max
over the specified number of warm-up steps. After the warm-up period, training continues with the maximum learning rate.
Learning rate warm-up can help stabilize training and improve convergence, especially when training large models on challenging tasks. However, the specific warm-up schedule and parameters may vary depending on the model architecture, dataset, and training task, so it's often necessary to experiment with different settings to find what works best for your specific scenario.
Last updated
Was this helpful?