The CAME Optimizer: Lightweight, Effective, But Not Perfect

Recently I’ve been experimenting with optimizers other than AdamW and its kin using Mistral-7B-V0.3 as a basis. Of them one that showed a lot of promise was CAME. It offers memory overhead similar to Adafactor but with convergence like AdamW. In my usage, I’ve found that it works well for most tasks. However, certain challenging datasets won’t converge when using it as a drop-in replacement for AdamW.

What Is an Optimizer as It Relates to LLMs?

So you want to finetune a pretrained language model, you’ve gathered a lot of very diverse data because you want your model to be able to handle a wide variety of tasks with rich prose. Now, consider that each of these data points is going to be demanding something from the model when gradient descent is applied. If you let every data point have its say every step the model would be yanked in a different direction along every single one of its parameters. Now sure, its possible that the model would converge to a good solution, but it would take a long time, be very noisy, and likely get stuck in a poor local minimum.

In simple terms an optimizer supervises the process of updating the model weights and makes some considerations about what the current data point is demanding as compared to prior data points. The way this happens differs from optimizer to optimizer but by and large they are all trying to smooth out the noise of the data points and divine a direction that makes all your data points happy.

What is CAME?

CAME (Confidence-guided Adaptive Memory Efficient Optimization) is a clever optimizer that gives you Adam-like convergence while using way less memory like Adafactor. What makes it special is how it measures ‘confidence’ in updates by looking at the difference between the moving average and current update - taking smaller steps when confidence is low and bigger steps when confidence is high. It saves memory by using matrix factorization to compress second-moment estimates, dropping requirements from O(nm) to O(n+m), which is huge when you’re squeezing large models onto limited GPUs. Check out the original paper if you want the technical details.

Where CAME Succeeded For Me

I’ve tried CAME for instruct finetuning as well with lighter style LoRAs where the memory savings were particularly valuable. For more typical use cases like finetuning on the Dans PersonalityEngine datamix, CAME does provide AdamW-like convergence while using significantly less memory.

Where I Had Issues

Recently I’ve been working on a novel dataset that is proving to be a bit of a challenge for CAME. The dataset involves sections of text where the first half is low quality and the second half transitions to high quality while still talking about the same topic. The low quality portion has a loss mask applied to it so that the model only learns to produce the high quality data despite the preceding low quality context. The issue is that CAME seems to be struggling to converge on this dataset, while AdamW is able to learn from it albeit slowly. More information on the dataset can be found in a prior post.

Why CAME Might Be Struggling Here

First off, CAME has several parameters that can be adjusted, including β₁, β₂, and β₃ which control the exponential moving averages for various components of the optimizer. While I’ve left these at their defaults (as they worked for my other training runs), there’s likely room for improvement for this specific task.

Additionally, I did not adjust my other hyperparameters when switching from AdamW to CAME. I suspect that a batch size of 32 combined with the dataset is working against the optimizer in ways it cannot handle effectively. The confidence mechanism in CAME relies on measuring the deviation between the exponential moving average of updates and the current update. The dataset’s unusual nature might create update patterns that cause CAME to poorly estimate confidence. By increasing the batch size you might be able to smooth out these patterns and give CAME a better chance to learn.

The factorized second-moment estimates, while memory-efficient, may also struggle to capture the complex correlation structure between parameters when dealing with this unusual learning scenario.

So, It’s Trash Then, Right?

Not at all! CAME is a great optimizer for most tasks and I intend to keep using it when testing softball datasets. However, I think it is important to understand that it is not a drop-in replacement for AdamW. It has its own strengths and weaknesses and should be treated as such. I do recommend keeping it in your toolbox for when you need to squeeze out every last bit of memory from your GPU.