1. Problem statement
Exploding adoption of Deep Learning in recent years led to skyrocketing energy consumption and carbon emission from fleets of GPUs training Deep Neural Networks (DNNs). Answering the call for sustainability, existing solutions move training jobs to greener geographical locations and/or time frames to reduce carbon footprint. However, it is not always possible to readily migrate DNN training jobs to other locations due to large dataset sizes or data regulations. Moreover, deferring training jobs to greener time frames may not be an option either, since DNNs must be trained with the most up-to-date data and quickly deployed to production for the highest accuracy. Therefore, our aim is to design and implement a solution that does not conflict with the aforementioned realistic constraints while still reducing the carbon footprint of DNN training.
2. What our solution does
GPUs, the de facto hardware for training DNNs, allow users to set their power limit at any time via software. For instance, lowering the power limit of a GPU from 300 Watts to 200 Watts will make the GPU draw less than 200 Watts at any moment while making it run slightly slower. Therefore, when the current time frame has high carbon intensity, our solution automatically tunes down the GPU power limit to use less electricity during that period. On the other hand, when clean electricity is accessible, our solution increases the power limit and speeds up the GPU to make more progress during this period.
3. How our solution uses the API
We periodically query the API for the forecasted carbon intensity of the current location, and plug the number into a mathematical optimization problem to determine the optimal power limit. That way, our solution automatically adapts the GPU power limit to changes in carbon intensity. Also, we query the API for the actual carbon intensity of the past period to account and report the total carbon footprint generated by the training job.
4. Impact of our solution in terms of CO2 reductions and user reach
In order to quantify the gain of our solution, we trained a DNN called ResNet50 on the ImageNet dataset with one NVIDIA A40 GPU. Compared with an ordinary DNN training job with identical settings ('No-Zeus'), our solution ('Carbon-Aware Zeus') reduces total carbon footprint by 24%, while only increasing training time by 3%. The large gain is from using less electricity and dynamically adapting the power limit to chase clean electricity.
Our solution can be applied to any DNN training job with just a few lines of mechanical code change, allowing users to easily opt into sustainable DNN training. Moreover, our solution incurs almost no training time increase, allowing even time-critical DNN training jobs to start reducing carbon emission immediately.
With large amounts of per-job carbon reduction combined with general applicability to any DNN training job, the impact of our solution is immense.
5. Making our solution production-ready
Our solution is an extension of the open source project Zeus, whose aim is to reduce the energy consumption of DNN training. We plan to polish the code for carbon-awareness right after the end of the Hackathon and upstream the feature to Zeus. The code quality of Zeus is already nearly production-grade.
Moreover, to ease the adoption of Zeus into industry, we are currently working on integrating Zeus with KubeFlow, the largest open source MLOps platform. We expect to upstream this integration by early next year, at which point developers in industry can directly deploy Zeus onto their KubeFlow cluster, or take that as an example to quickly integrate Zeus to their internal MLOps platform.
As of November 2022, Zeus is the first and only work to offer transparent energy/carbon reduction for arbitrary DNNs and GPU types. With its theoretical guarantees backed by a research paper in a top conference (paper link), we believe Zeus is the best choice for anyone interested in sustainable DNN training.
6. Our vision for changing the world
We have one simple overarching philosophy for designing sustainable computing solutions: Any attempt to enhance the sustainability of a system must never neglect existing goals.
Deep Learning today is powering numerous intelligent applications, and highly accurate DNNs trained in a timely manner directly translate to enormous business value. Therefore, our belief is that solutions that neglect existing performance goals of DNN training are not attractive; their chance of actually being adopted is low. That said, with its internal mathematical optimization problem, Carbon-Aware Zeus optimizes both total carbon emission and training time at the same time, where one is never completely defeated in favor of the other. Moreover, the user can choose the trade-off between carbon and time using a single configuration knob.
The world changes when the majority adopts sustainable solutions, and we firmly believe that Zeus is an attractive solution to the majority.