Learning rate warm-up

Learning rate warm-up is a technique used in training deep neural networks, especially large models like transformers, to gradually increase the learning rate at the beginning of training. The idea behind learning rate warm-up is to allow the model to start with a small learning rate, which helps it stabilize during the initial phase of training when the weights are random and the gradients are high. As training progresses, the learning rate is increased to speed up convergence.

Here's how learning rate warm-up is typically implemented:

  1. Choose an Initial Learning Rate: Start with a small initial learning rate, often referred to as learning_rate_min.

  2. Choose a Warm-up Period: Decide on the number of training steps or epochs during which the learning rate will be gradually increased. This is referred to as the "warm-up period."

  3. Gradually Increase Learning Rate: Linearly increase the learning rate from learning_rate_min to your desired maximum learning rate, which is typically referred to as learning_rate_max, over the warm-up period. The increase is typically linear, but other schedules are possible.

  4. Use the Maximum Learning Rate: After the warm-up period, continue training with the maximum learning rate for the remaining training steps or epochs.

Here's an example of how you might implement learning rate warm-up in PyTorch using a linear warm-up schedule:

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

# Define the warm-up parameters
warmup_steps = 1000  # Number of warm-up steps or epochs
learning_rate_min = 1e-5  # Small initial learning rate
learning_rate_max = 0.1  # Maximum learning rate

# Create an optimizer and a linear warm-up learning rate scheduler
optimizer = optim.Adam(model.parameters(), lr=learning_rate_min)
scheduler = LambdaLR(optimizer, lr_lambda=lambda step: min(1.0, step / warmup_steps))

# Training loop
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward and backward passes
        loss = compute_loss(batch)
        loss.backward()
        optimizer.step()
        scheduler.step()  # Update the learning rate

# Continue training with the maximum learning rate
for epoch in range(num_epochs_after_warmup):
    for batch in dataloader:
        # Forward and backward passes
        loss = compute_loss(batch)
        loss.backward()
        optimizer.step()

In this example, we use the LambdaLR scheduler to implement a linear warm-up schedule. The learning rate increases linearly from learning_rate_min to learning_rate_max over the specified number of warm-up steps. After the warm-up period, training continues with the maximum learning rate.

Learning rate warm-up can help stabilize training and improve convergence, especially when training large models on challenging tasks. However, the specific warm-up schedule and parameters may vary depending on the model architecture, dataset, and training task, so it's often necessary to experiment with different settings to find what works best for your specific scenario.

Last updated

Was this helpful?