Tuesday, 02 January 2024 12:17 GMT

From Chips To Clusters: Scaling AI Efficiently By Design


(MENAFN- 3BL) New foundational technologies that change the world often require massive infrastructure to achieve scale and global adoption. Historically, innovation follows this path: Railroads, highways and global communications networks each enabled new modes of commerce and connection, but only after significant investments in infrastructure.

It's no different for artificial intelligence.

This Earth Day, as AI infrastructure rapidly expands, one message is clear: Energy efficiency will largely shape the long-term impacts of this transformational technology. In the years ahead, the bottleneck for AI is becoming less about compute and more about real-world constraints on power, cooling, water and grid capacity. Global data center electricity demand is projected to more than double by 2030, reaching about 945 terawatt hours per year, about the current electricity consumption of Japan.

We are at a pivotal moment for technology companies, data center operators, policymakers and standard setting bodies to accelerate adoption of open standards, modular designs and system‐level efficiency. At AMD, we are committed to these principles and are laser-focused on maximizing compute performance per watt of energy, especially in the data center.

For more than a decade, we have set and achieved bold, public, time-bound goals that scale from chips to accelerated compute nodes to full server racks.i Our current goal is to deliver a 20x improvement in rack‐scale energy efficiency for AI training and inference between 2024 and 2030. ii What does that mean in practice? Training an average AI model in 2025 that may require several hundred server racks could require roughly one rack by 2030, using 95% less electricity and producing a fraction of the carbon emissions.iii

It's an ambitious goal. And Earth Day is a fitting moment to explain how we plan to get there.

Start with Efficiency at the Core

In digital infrastructure, inefficiency compounds. Energy wasted at the processor ripples outward through the server, cluster, data center and, ultimately, the grid – driving additional demand for cooling, power conversion, redundancy and transmission, magnifying inefficiency across the ecosystem.

When chips and servers deliver more performance per watt, the benefits cascade across the entire system. At a time when demand for AI compute far exceeds supply, maximizing existing data center and grid infrastructure is imperative. This is also an area where AI itself can help.

At AMD we applied AI‐driven automation and analytics our own internal IT grid infrastructure, reducing operational and maintenance costs by an average of 20% to 25%.iv By using AI to predict demand, optimizing utilization in real time and automating issue resolution through intelligent workflows and chatbots, we shifted from reactive infrastructure management to a more adaptive, self‐optimizing model. The effort demonstrates how AI can unlock efficiency gains in products and across digital infrastructure itself.

Scale Through Modular and Open Design

Unlocking the next step‐change in efficiency increasingly depends on deep industry collaboration and transparency across hardware, software and systems integration. Design choices at the chip and rack level directly affect cooling, power distribution, facility design and grid demand. When these elements are optimized in isolation or locked into proprietary systems, the ecosystem bears the cost.

That is why AMD is committed to an open ecosystem. Industry alignment around open standards and interoperable designs allows innovation to scale rapidly, deployments to accelerate and energy efficiency gains to compound.

We are proud to hold leadership roles in the Open Compute Project, where companies share design specifications to enable interoperable systems, modular building blocks and component-level upgrades to extend lifecycles. We also help lead organizations like The Green Grid to advance common definitions and system-level thinking on energy and water efficiency, supporting more consistent design choices, benchmarking and transparency across the data center ecosystem. In software, we support open-sourced development through the AMD ROCmTM platform. We believe continued innovation at the software and AI-model level will act as a powerful force multiplier, amplifying our 20x energy-efficiency goal by up to fivefold and together enabling a potential 100x increase in the energy efficiency of AI training by 2030.v

Manage Resources Across the Life Cycle

Modern AI server racks contain tens of thousands of components, weigh several thousand pounds and can embody tons of carbon emissions before they are ever powered on. Transporting, decommissioning and recycling introduce additional costs and emissions over the hardware life cycle. Servers that operate more efficiently and for longer deliver more useful compute within real-world economic and environmental constraints.

Viewed through a circular economy lens, responsible life cycle management prioritizes extracting the greatest practical value from every component across every stage of their life cycle. It starts with modular, open designs that enable interoperable, repairable and upgradable systems. It includes extending system lifetimes, adapting to changing workloads and minimizing waste. And when systems reach end-of-use, component recovery and high-value recycling can return materials to the supply chain while helping create space for new generations of more energy-efficient servers.

These practices can allow organizations to consolidate equipment and reduce energy use or increase compute performance without expanding their physical footprint. This can deliver total cost of ownership benefits that include electricity and carbon emissions (Scope 2), while reducing or deferring upstream and downstream value chain emissions (Scope 3). Managing resources across the life cycle helps defer both financial and environmental costs.

Design for Sustainability

Taken together, efficiency at the compute layer, openness at the system level and responsibility across the full life cycle form a powerful flywheel. Performance-per-watt gains cascade across data centers, value chains and the grid.

As AI continues to scale, it is increasingly evident that innovation and sustainability are most effective when they are intentionally designed to move together.

Footnotes

[i] Statement based on AMD setting its 25x20 goal in 2014 and demonstrating public energy efficiency goals to the current period in 2026: ;

[ii] AMD based advanced racks for AI training/inference in each year (2024 to 2030) based on AMD roadmaps, also examining historical trends to inform rack design choices and technology improvements to align projected goals and historical trends. The 2024 rack is based on the MI300X node, which is comparable to the Nvidia H100 and reflects current common practice in AI deployments in 2024/2025 timeframe. The 2030 rack is based on an AMD system and silicon design expectations for that time frame. In each case, AMD specified components like GPUs, CPUs, DRAM, storage, cooling, and communications, tracking component and defined rack characteristics for power and performance. Calculations do not include power used for cooling air or water supply outside the racks but do include power for fans and pumps internal to the racks.

FLOPS HBM BW Scale-up BW
Training 70.0% 10.0% 20.0%
Inference 45.0% 32.5% 22.5%

Performance and power use per rack together imply trends in performance per watt over time for training and inference, then indices for progress in training and inference are weighted 50:50 to get the final estimate of AMD projected progress by 2030 (20x). The performance number assumes continued AI model progress in exploiting lower precision math formats for both training and inference which results in both an increase in effective FLOPS and a reduction in required bandwidth per FLOP.
We commissioned Dr. Koomey to analyze historical industry data and projected AMD data on compute performance and power consumption. We then worked with Dr. Koomey to develop a goal methodology aligned with industry-accepted best-practices for efficiency assessments. This methodology allows us to compare our goal to historical industry gains, track our progress against the goal over time, and to estimate environmental benefits of achieving the goal in real world AI deployment.

[iii] AMD estimated the number of racks to train a typical notable AI model based on EPOCH AI data (). For this calculation we assume, based on these data, that a typical model takes 1025 floating point operations to train (based on the median of 2025 data), and that this training takes place over 1 month. FLOPs needed = 10^25 FLOPs/(seconds/month)/Model FLOPs utilization (MFU) = 10^25/(2.6298*10^6)/0.6. Racks = FLOPs needed/(FLOPS/rack in 2024 and 2030). The compute performance estimates from the AMD roadmap suggests that approximately 276 racks would be needed in 2025 to train a typical model over one month using the MI300X product (assuming 22.656 PFLOPS/rack with 60% MFU) and <1 fully utilized rack would be needed to train the same model in 2030 using a rack configuration based on an AMD roadmap projection. These calculations imply a >276-fold reduction in the number of racks to train the same model over this six-year period. Electricity use for a MI300X system to completely train a defined 2025 AI model using a 2024 rack is calculated at ~7GWh, whereas the future 2030 AMD system could train the same model using ~350 MWh, a 95% reduction. AMD then applied carbon intensities per kWh from the International Energy Agency World Energy Outlook 2024 []. IEA's stated policy case gives carbon intensities for 2023 and 2030. We determined the average annual change in intensity from 2023 to 2030 and applied that to the 2023 intensity to get 2024 intensity (434 CO2 g/kWh) versus the 2030 intensity (312 CO2 g/kWh). Emissions for the 2024 baseline scenario of 7 GWh x 434 CO2 g/kWh equates to approximately 3000 metric tC02, versus the future 2030 scenario of 350 MWh x 312 CO2 g/kWh equates to around100 metric tCO2.

[iv]

[v] Regression analysis of achieved accuracy/parameter across a selection of model benchmarks, such as MMLU, HellaSwag, and ARC Challenge, show that improving efficiency of ML model architectures through novel algorithmic techniques, such as Mixture of Experts and State Space Models for example, can improve their efficiency by roughly 5x during the goal period. Similar numbers are quoted in Patterson, D., J. Gonzalez, U. Hölzle, Q. Le, C. Liang, L. M. Munguia, D. Rothchild, D. R. So, M. Texier, and J. Dean. 2022. "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink." Computer. vol. 55, no. 7. pp. 18-28.” Therefore, assuming innovation continues at the current pace, a 20x hardware and system design goal amplified by a 5x software and algorithm advancements can lead to a 100x total gain by 2030.

MENAFN22042026007202015466ID1111018430



3BL

Legal Disclaimer:
MENAFN provides the information “as is” without warranty of any kind. We do not accept any responsibility or liability for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this article. If you have any complaints or copyright issues related to this article, kindly contact the provider above.

Search