Cooling AI Data Centers

How important are AI data centers? In just months, Elon Musk’s xAI team converted a factory outside Memphis into a cutting-edge, 100,000-GPU center for training the Colossus supercomputer—home to the Grok chatbot.

Initially powered by temporary gas turbines (later replaced by grid power), Colossus installed its first 100,000 chips in only 19 days, drawing praise from NVIDIA CEO Jensen Huang. Today, it operates 200,000 GPUs, with plans to reach 1 million GPUs by the end of 2025. [1]

Figure 1 – Elon Musk’s 1 Million Sq Ft xAI Colossus Supercomputer Facility near Memphis, TN. [1]

There are about 12,000 data centers throughout the world, nearly half of them in the United States. Now, more and more of these are being built or retrofitted for AI-specific workloads. Leaders include Musk’s xAI, Microsoft, Meta, Google, Amazon, OpenAI, and others.

High power is essential for such operations, and like computational electronics of all sizes heat issues need to be resolved.

GenAI

A key driver of data center growth is Generative AI (GenAI)—AI that creates text, images, audio, video, and code using deep learning. Chatbots and large language model ChatGPT are examples of GenAI, along with text-to-image models that generate images from written descriptions.

Managing all this is possible from new generations of processors, mainly GPUs. They all draw on higher levels of power and generate higher amounts of heat.

Figure 2 – Advanced AI Processor, the NVIDIA GH200 Grace Hopper Superchip with Integrated CPU to Increase Speed and Performance. [2,3]

AI data centers prioritize HPC hardware: GPUs, FPGAs, ASICs, and ultra-fast networking. Compared to CPUs (150–200 W), today’s AI GPUs often run >1,000 W.  . To handle massive datasets and complex computations in real-time they need significant power and cooling infrastructure.

Data Center Cooling Basics

Traditional HVAC was sufficient for older CPU-driven data centers. Today’s AI GPUs demand far more cooling, both at the chip level and facility-wide. This has propelled a need for more efficient thermal management systems at both the micro (server board and chip) and macro (server rack and facility) levels. [4]

Figure 3 – The Colossus AI Supercomputer Now Runs 200,000 GPUs. It Operates at 150MW Power, Equivalent to 80,000 Households. [5]

At Colossus, Supermicro 4U servers house NVIDIA Hopper GPUs cooled by:

  • Cold plates
  • Coolant distribution manifolds (1U between each server)
  • Coolant distribution units (CDUs) with redundant pumps at each rack base [6]

Each 4U server is equipped with eight NVIDIA H100 Tensor Core GPUs. Each rack contains eight 4U servers, totaling 64 GPUs per rack.

Between every server is a 1U manifold for liquid cooling. They connect with CDUs, heat-exchanging Coolant Distribution Units at the bottom of each rack that include a redundant pumping system. The choice of coolant is determined by a range of hardware and environmental factors.

Figure 4 – Each Colossus Rack Contains Eight 4U Servers, Totaling 64 GPUs Per Rack. Between Each Server is a 1U Manifold for Liquid Cooling. [7]
Figure 5 – The Base of Each Rack Has a 4U CDU Pumping System with Redundant Liquid Cooling. [7]

Role of Cooling Fans

Fans remain essential for DIMMs, power supplies, controllers, and NICs.

Figure 6 – Rear Door Liquid-Cooled Heat Exchangers. [7]

At Colossus, fans in the servers pull cooler air from the front of the rack, and exhaust the air at the rear of the server. From there, the air is pulled through rear door heat exchangers. The heat exchangers pass warm air through a liquid-cooled, finned heat exchanger/radiator, lowering its temperature before it exits the rack.

Direct-to-Chip Cooling

NVIDIA’s DGX H100 and H200 server systems feature eight GPUs and two CPUs that must run between 5°C and 30°C. An AI data center with a high rack density houses thousands of these systems performing HPC tasks at maximum load. Direct liquid cooling solutions are required.

Figure 7 – An NVIDIA DGX H100/H200 System Featuring Eight GPUs [8]
Figure 8 – The NVIDIA H100 SmartPlate Connects to a Liquid Cooling System to Bring Microconvective Chip-Level Cooling That Outperforms Air Cooling by 82%. [9]

Direct liquid cooling (cold plates contacting the GPU die) is the most effective method—outperforming air cooling by 82%. It is preferred for high-density deployments of the H100 or GH200.

Scalable Cooling Modules

Colossus represents the world’s largest liquid-cooled AI cluster, using NVIDIA + Supermicro technology. For smaller AI data centers, Cooling Distribution Modules (CDMs) provide a compact, self-contained solution.

Figure 9 – The iCDM-X Cooling Distribution Module from ATS Includes Pumps, Heat Exchanger and Liquid Coolant for Managing Heat from AI GPUs and Other Components. [10]

Most AI data centers are smaller, and power and cooling needs are lower, but essential. Many heat issues can be resolved using self-contained Cooling Distribution Modules.

The compact iCDM-X cooling distribution module provides up to 1.6MW of cooling for a wide range of AI GPUs and other chips. The module measures and logs all important liquid cooling parameters. It uses using just 3kW of power, and no external coolant is required.

These modules include:

•         Pumps

•         Heat exchangers

•         Cold plates

•         Digital monitoring (temp, pressure, flow)

Their sole external component is one or more cold plates removing heat from AI chips. ATS provides an industry-leading selection of custom and standard cold plates, including the high-performing ICEcrystal series.

Figure 10 – The ICEcrystal Cold Plates Series from ATS Provide 1.5 kW of Jet Impingement Liquid Cooling Directly onto AI Chip Hotspots.

Cooling Edge AI and Embedded Applications

AI isn’t just for big data centers—edge AI, robotics, and embedded systems (e.g., NVIDIA Jetson Orin, AMD Kria K26) use processors running under 100 W. These are effectively cooled with heat sinks and fan sinks from suppliers like Advanced Thermal Solutions. [11]

Figure 11 – High Performance Heat Sinks for NVIDIA and AMD AI Processors in Embedded and Edge Applications. [11]

NVIDIA also partners with Lenovo, whose 6th-gen Neptune cooling system enables full liquid cooling (fanless) across its ThinkSystem SC777 V4 servers—targeting enterprise deployments with NVIDIA Blackwell + GB200 GPUs. [12]

Figure 12 – Lenovo’s Neptune Direct Water Cooling Removes Heat from Power Supplies, for Completely Fanless Operation. [12]

Benefits gained from the Neptune system include:

  • Full system cooling (GPUs, CPUs, memory, I/O, storage, regulators)
  • Efficient for 10-trillion-parameter models
  • Improved performance, energy efficiency, and reliability

Conclusion

With surging demand, AI data centers are now a major construction focus. Historically, cooling problems are the #2 cause of data center downtime (behind power issues). With the high power needed for AI computing, these builds should carefully fit with their local communities in terms of electrical needs and sources, and water consumption. [13]

AI workloads will increase U.S. data center power demand by 165% by 2030 (Goldman Sachs), with nearly double 2022 levels (IBM/Newmark). Sustainable design and resource-conscious cooling are essential for the next wave of AI infrastructure. [14,15]

References

1. The Guardian, https://www.theguardian.com/technology/2025/apr/24/elon-musk-xai-memphis

2. Fibermall, https://www.fibermall.com/blog/gh200-nvidia.htm

3. NVIDA, https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip?ncid=no-ncid

4. ID Tech Ex, https://www.idtechex.com/en/research-report/thermal-management-for-data-centers-2025-2035-technologies-markets-and-opportunities/1036

5. Data Center Frontier, https://www.datacenterfrontier.com/machine-learning/article/55244139/the-colossus-ai-supercomputer-elon-musks-drive-toward-data-center-ai-technology-domination

6. Supermicro, https://learn-more.supermicro.com/data-center-stories/how-supermicro-built-the-xai-colossus-supercomputer

7. Serve The Home, https://www.servethehome.com/inside-100000-nvidia-gpu-xai-colossus-cluster-supermicro-helped-build-for-elon-musk/2/

8. Naddod, https://www.naddod.com/blog/introduction-to-nvidia-dgx-h100-h200-system

9. Flex, https://flex.com/resources/flex-and-jetcool-partner-to-develop-liquid-cooling-ready-servers-for-ai-and-high-density-workloads

10. Advanced Thermal Solutions, https://www.qats.com/Products/Liquid-Cooling/iCDM

11. Advanced Thermal Solutions, https://www.qats.com/Heat-Sinks/Device-Specific-Freescale

12. Lenovo, https://www.lenovo.com/us/en/servers-storage/neptune/?orgRef=https%253A%252F%252Fwww.google.com%252F

13. Deloitte, https://www2.deloitte.com/us/en/insights/industry/technology/technology-media-and-telecom-predictions/2025/genai-power-consumption-creates-need-for-more-sustainable-data-centers.html

14.GoldmanSachs, https://www.goldmansachs.com/insights/articles/ai-to-drive-165-increase-in-data-center-power-demand-by-2030

15. Newmark, https://www.nmrk.com/insights/market-report/2023-u-s-data-center-market-overview-market-clusters

Cooling Hot Electronics with Cold Plates

Cold plates have been used for thermal management since the Apollo moon missions in the 1960s. Today, they serve a critical role in cooling high-performance electronics across many industries.

Power electronic devices generate significant heat, and if their chips exceed safe temperature limits, system reliability and longevity are compromised. Effective thermal management is essential, as lowering a chip’s junction temperature by just 10°C can double its operational life.

Cold plates offer highly efficient, localized cooling by transferring heat from hot components—such as power semiconductors—into a liquid coolant flowing through the plate. The heated liquid then moves to a remote heat exchanger, where it cools before recirculating back to the cold plate.

Compared to forced-air cooling, cold plates deliver superior thermal performance. They are typically smaller, quieter, and lighter than fan-based systems, making them an attractive solution in compact or noise-sensitive environments.

Most cold plates consist of thin-walled aluminum or copper blocks with internal channels or tubing for coolant flow. As liquid moves through the plate, it absorbs heat from the attached components and carries it away for external dissipation. Modern designs often use mini-channels instead of traditional tubing. These intricate internal pathways maximize surface contact with the coolant, improving heat transfer and cooling efficiency.

Figure 1 – Cold Plates are Part of a Liquid Cooling Loop that Includes a Pump for Fluid Circulation and a Heat Exchanger to Remove Heat from the Flowing Coolant. [ATS]

More advanced cold plates feature mini-channels in place of tubing. These designs can better match applications and more efficiently transfer heat into the coolant.

Figure 2 – A Cold Plate with Internal Mini-Channels Provides a High Rate of Thermal Transfer to Remove More Component Heat. [ATS]

Tubed Cold Plates Cool Hot Electronics

Embedded tube designs are the simplest cold plates. They feature a stainless steel or copper tube coiled and set into grooves inside a metal base plate. The tubes can be routed in different pathways to optimize thermal transfer performance. The flowing coolant moves heat from the component, away from the cold plate and over to a heat exchanger where it is cooled before being pumped back to the plate.

Figure 3 – A Tubed Cold Plate Can Consist of Copper or Stainless-Steel Tubing Pressed or Embedded in a Metal Plate. [ATS]

These tubed cold plates are cost-effective solutions for low- to moderate-power applications and are ideal for use in automotive, instrumentation, and UPS systems. ATS offers models ranging from 57–914 mm in length and 57–198 mm in width, with push-to-connect fittings for easy installation.

A variation of this design features thermally conductive epoxy completely covering the tubing and flush with the plate’s surface. This not only improves thermal contact but also provides environmental protection by sealing the tube within the plate.

Figure 4 A Cold Plate’s Tubing Can be Buried and Covered with a Thermally Conductive Epoxy Layer. [1]

Custom Cold Plates Provide Best Cooling Solutions

For more demanding applications—such as cooling BGAs, LEDs, or high-power modules—custom cold plates offer the best performance. These can include embedded tubing or submerged internal fins, which increase surface area and create turbulence to enhance heat transfer.

Figure 5. Custom Liquid Cold Plate with Inlaid Copper Tubing Provides Heat Transfer Away from Hot Electronics [2]

One example uses tightly spaced aluminum pin fins to generate turbulence with minimal pressure drop, achieving high thermal performance while keeping the plate compact. Another design incorporates internal turbulators tailored for IGBT modules, further improving coolant flow and heat dissipation.

ATS designs and manufactures custom cold plates in collaboration with customers or based on in-house thermal analysis. These designs can include complex internal geometries such as microchannels or gyroid lattices, like the 3D-printed cold plate created for race car IGBT cooling—an approach that improves flow guidance while reducing weight.

Figure 6. Close-spaced Pin Fins with Complex Geometry Create Turbulence with Low Flow Rate Values Inside Submerged Fin Cold Plates. [3]
Figure 7. A Custom IGBT-Cooling Cold Plate Features Internal Turbulators to Optimize Coolant Turbulence and Heat Transfer. [4]

ATS constructs cold plates to customer designs and those developed in partnership with our own thermal engineers. ATS coolant-based cold plates can include internal tubing and microchannels in closed loop systems.

Figure 8. Dual-Sided Cold Plates Cool Components on Both of Their Mounting Sides. [ATS]

Dual-sided high-flow cold plates provide equal cooling performance for components mounted on both sides of the plate, increasing efficiency, space savings and economy. The cold plates can be used with coolant flow rates up to 4 gal/min, and provide thermal resistance as low as 0.0021°C/W. [5]

Figure 9 – This 3D-Printed Metal Cold Plate Cools IGBTs on a Race Car. It Features a Gyroid Lattice That Guides Internal Coolant Flow While Reducing the Overall Weight [6]

DIY Cold Plates Optimize Component Cooling

ATS also offers DIY (do it yourself) cold plates with modular dimensions and pre-defined drill zones. Engineers can customize mounting locations to match specific components. Once the ideal configuration is determined, ATS can mass-produce the cold plate to match exact specifications.

Figure 10 – Do It Yourself Cold Plates from ATS Feature Drill Zones for Precision Matching to Heat Sources, and No Drill Zones to Protect Internal Coolant Flow Channels. [ATS]

The Complete Liquid Cooling Loop

Cold plates are just one part of a complete liquid cooling system. As electronics demand more efficient cooling, liquid-based systems are increasingly replacing air-based solutions. A functional loop includes a pump, reservoir, and heat exchanger to remove heat from the circulating fluid.

Figure 11 – A Liquid Cooling Loop Featuring Cold Plates. This is Implemented in Avionics on F-16 Fighting Falcon Jets. [7]

These systems are becoming more cost-effective and safer, making liquid cooling viable for a broader range of applications. Cold plates serve as a critical stage in these loops, offering simplicity, versatility, and high thermal performance.

Plug and Play Liquid Loops

For streamlined implementation, ATS offers the Industrial Cooling Distribution Module™ (iCDM™)—a fully integrated liquid cooling loop in one compact, portable unit. It includes the pump, heat exchanger, reservoir, precision controls, and monitoring displays, eliminating the need to purchase and configure components separately.

Figure 12 – The New Industrial Cooling Distribution Module Contains the Pump, Heat Exchanger, Reservoir and Controls for Managing Coolants in Liquid Cooling Loops. [ATS]

The iCDM connects directly to cold plates or chassis-based cooling systems. It supports models with cooling capacities of 10 kW and 20 kW, each holding up to 2 liters of coolant. The system is compatible with a wide range of wetted materials, allowing flexible deployment across industries. A next-generation iCDM, available soon, is fully automated, with significantly increased cooling capacities up to 1.4MW.

Conclusion

Cold plates can provide essential electronics cooling because of their design versatility and the power of liquid cooling. AI chip cooling cold plates will soon be available for this growing and demanding arena.

ATS engineers are experts in matching thermal solutions to system needs, offering a wide portfolio that includes cold plates, vapor chambers, coolant chillers, and complete liquid loop systems. Whether liquid cooling is the best solution depends on the specific application, and ATS provides detailed analysis to help customers make informed decisions.

References

  1. Wakefield Thermal, http://www.wakefield-vette.com/products/liquid-cooling/liquid-cold-plates/standard-liquid-cold-plates.aspx
  2. Baknor, https://www.baknorthermal.com/liquid-cold-plates-various-channel-options/
  3. COOLTECH, http://www.cooltech.it/products/liquid-cold-plates/
  4. Boyd Corp., https://www.boydcorp.com/thermal/liquid-cooling-systems/liquid-cold-plates.html
  5. ATS, https://www.qats.com/Products/Liquid-Cooling/Dual-Sided-Cold-Plates
  6. nTop, https://www.ntop.com/resources/case-studies/cold-plate-automotive-power-electronics/
  7. ThermOmegaTech, https://www.tot-ad.com/avionics-cooling/

Cooling Quantum Computer Chips

Near Absolute Zero Temperatures are Essential to Their Proper Function

https://theconversation.com/quantum-computing-just-got-hotter-1-degree-above-absolute-zero-226401

Quantum computing can solve problems beyond the most powerful classical computers. But it faces many hard engineering challenges to evolve. Among the toughest is providing an extremely cold environment. [1]

Qubits or Bust

Qubits, or quantum bits, differ significantly from classical computing bits. A qubit can be both 0 and 1 simultaneously, a property known as superposition. Another fundamental trait is entanglement, where the state of one qubit is intrinsically linked to another, regardless of distance.

These properties enable exponential scalability, allowing quantum computers to solve complex problems much faster than classical computers. Years of processing time by conventional computers are replaced by just minutes using quantum computing. [2]

Working with qubits is challenging. Made from superconducting circuits, photons, and other methods, they are extremely susceptible to noise, which here includes background radiation, phone and wi-fi signals, and the slightest bits of heat energy. Tiny disturbances can lead to calculation errors.

One way of preventing these effects is to keep qubits near absolute zero temperatures where thermal vibrations are minimized. Here, they can be encoded in two distinct states: grounded and excited. Each qubit can exist in one or the other of these two states, or in a quantum superposition of the two. Putting qubits into superposition states allows a quantum processor to simultaneously examine many potential solutions to a problem, a dramatic improvement in computational power over a classical computer.

Figure 1. Google’s 53-Qubit Sycamore Processor can Perform Calculations that Would Take Classical Computers Thousands of Years. [3]
Figure 2. Microsoft’s Majorana Quantum Chip is Designed to Provide More Qubit Stability. It Will Allow the Creation of Quantum Systems that Scale to a Million Qubits on a Single Chip. [4]

How powerful is quantum computing? Google has stated that their state-of-the-art processors can solve a computing problem in under five minutes, which, by comparison would take today’s fastest supercomputers about 10 septillion years (1025 ) or equal to “more time than the history of the universe.” [5]

Cooling the Qubits

Liquid and air systems cool conventional electronics, but cooling quantum chips brings thermal engineering into the very cold world of cryogenics. While quantum chip power consumption is very low, they need to be kept at extremely low temperatures. Even tiny temperature increases can make a system unworkable.

Much engineering goes into providing the cryogenic-level low temperatures that support qubits. The chandelier-like structures associated with quantum computers are part of these cooling systems, with the chips installed at the bottom. Nearby are the tanks, electronics and many connections that feed and power these intricate cooling systems.

Figure 3. Today’s Quantum Computers Feature a Maze of Metal Tubes, Wires, and Gleaming Components. [6]

The Very Cold World of Cryogenics

Kelvins are the standard for coldness measurements in cryogenics. Zero Kelvin is equal to absolute zero (-273.15°C or -459.67°F). Theoretically this is the lowest temperature possible. At absolute zero, atoms have no kinetic energy and are at rest. There is no motion or heat. By comparison, the average temperature of outer space is a balmy 2.7 K (-270.45°C or -455°F).

Quantum computer cooling from dilution refrigerator systems, the most common cryo technology, can bring qubits to about 50 millikelvins above absolute zero. Millikelvins (mK) are a unit of temperature measurement, i.e., a thousandth of a Kelvin (1/1000th of a Kelvin), used in the realm of cryogenics and quantum physics. 

Note that cryogenics deals with temperatures from near absolute zero up to -150°C (or 123.15 K). The higher end of this range, from −150°C to -190°C is where naturally occurring biological processes are halted. Here is where many biologicals, e.g., cell and gene therapies, are kept for long term storage.

Dilution Refrigerator Systems 

Google, IBM, Amazon, and others have been building quantum computers using large, complex, expensive systems known as dilution refrigerators with multiple stages of cooling to chill circuits to 1 kelvin or below. The complexity of these refrigerators is greatest at the coldest stage, which involves mixing different isotopes of liquid helium.

Figure 4. A Dilution Refrigerator System for Cooling Quantum Chips. The Refrigerator Features Several Stages, Getting Colder as It Moves Down. This Intel Unit Uses a Mixture of Helium Isotopes at Its Refrigerant. It is Housed in a Shielding Metal Container. [7]

Dilution refrigeration leverages properties from mixing helium-3 and helium-4 isotopes. Basically, helium-3 is continuously diluted into the helium-4 in the cooling system, causing the system to extract heat from the surroundings, down to near absolute zero temperatures.

This function is preceded by precooling and evaporative cooling stages in the cylinder, the hanging, chandelier-shaped structure. Each stage downward reduces the temperature until the near zero temp is reached at the bottom stage, which connects to the chip. [8]

Figure 5. Temperatures Are Lowered at Each Downward Stage on a Quantum Computer Cryostat Structure. [9]

In Figure 5, A is the highest and warmest stage of the cooling system. This initial pre-cooling may use standard refrigeration techniques. At B, evaporative cooling via liquid helium may be used to cool the environment further. At this stage, the system is cooled to just a few degrees above absolute zero. At C actual dilution refrigeration occurs. Helium-3 and -4 are mixed and circulated to draw away heat from the system. Finally, at D, qubits in the quantum chip are maintained at temperatures near absolute zero. [10]

Figure 6. The Cabling Schematic of a Bluefors XLD Dilution Refrigerator. [11]

The numerous wires and connectors in quantum computers are used for controls and readouts, power, and cryogenics.

Developing Technology

As more resources are provided for quantum computing, all aspects of the technology will improve. Useful quantum applications, including drug design, cyber security, and battery development will further cultivate and commercialize this exciting computing field.

Improved processes are already easing the cryonic temperature requirements, though only very slightly at this point. Alternative cooling systems now include the use of magnets, high pressure waves, and tuned laser light.

Other areas of quantum computing are seeing improvements and innovations. These include advanced error correction techniques, and expanded cloud-based quantum computing platforms that will bring quantum computing to more users

References

  1. https://www.ibm.com/think/topics/quantum-computing
  2. https://www.bluequbit.io/quantum-volume#:~:text=The%20property%20of%20superposition%20grants,problem%2Dsolving%20and%20technological%20innovation.
  3. https://en.wikipedia.org/wiki/Sycamore_processor#:~:text=Sycamore%20is%20a%20transmon%20superconducting,ordered%20state%2C%20with%2031%20qubits
  4. https://azure.microsoft.com/en-us/solutions/quantum-computing/
  5. https://cybernews.com/tech/googles-quantum-chip-willow-achieves-once-elusive-benchmark-researchers-say/
  6. https://exoswan.com/quantum-computer-visual-guide
  7. https://www.datacenterdynamics.com/en/analysis/cooling-quantum-computers/
  8. https://blog.google/technology/research/behind-the-scenes-google-quantum-ai-lab/?utm_source=tw&utm_medium=social&utm_campaign=og&utm_content=&utm_term=
  9. https://www.sharetechnote.com/html/QC/QuantumComputing_HW_Structure.html#Reference
  10. https://idstch.com/technology/quantum/the-cool-path-to-quantum-computing-dilution-refrigeration-technology-and-superconducting-qubits/#:~:text=Dilution%20refrigerators%20are%20the%20unsung,stability%20and%20functionality%20of%20qubits.
  11. https://epjquantumtechnology.springeropen.com/articles/10.1140/epjqt/s40507-019-0072-0/figures/3

IoT Electronics and Thermal Management

The IoT – Internet of Things – includes all devices connected to the Internet, a fast growing world, expanding well past PCs and smartphones. IoT devices are our every day and special purpose items like appliances, sensors, and motors.

Per one authority, an IoT product combines hardware and software, measures real-world signals, connects to the Internet, transfers data to a centralized location, and provides value to a customer. [1] There are many such products, already in the tens of billions of IoT connected devices.

Consumer Uses

Among the most common IoT devices are smart speakers, headphones, appliances, and leak detectors. At the enterprise level are smart lighting and security systems in factories, office buildings and public places.

Figure 1. Common IoT Devices. Smart Doorbell Camera Connects Via Mobile App; Wearable Multi-functional Smart Watch; and Smart Water Usage/Leak Detecting Monitor Sends Alerts of Unusual Water Activity. (Kangaroo, StreamLabs, GoodWorkLabs) [2,3,4]

Automotive Use

Secure, strong IoT connection is essential for today’s auto infotainment systems. It enables everything from music platforms to navigation aids to car maintenance and diagnostics. Connected cars have more and more software-reliant components in their cabins, under their hoods and just about everywhere. With IoT-connectivity these components can be updated with OTA (over the air) software fixes without visiting a garage or dealership.

Figure 2. IoT-connected Infotainment System in a Toyota Crown. (PlanetDave) [5]

Industrial Use

The IIoT – Industrial Internet of Things – consists mainly of sensors uploading localized data to monitor and control manufacturing processes. It is active in manufacturing, transportation, and other  areas, from facility energy usage to equipment performance.  IIoT devices collect, analyze, and share data with other devices and with everyone needing to know.

Figure 3. Left: Wireless Weight Sensor for Semi-Trailers Communicates with Smart Phones. Center: Sound Monitor Wirelessly Alerts Excessive Noise Levels in Work and Public Areas. Right: Digital Controller for Environmental Chamber Allows Smartphone Access. (Intellia, Iotsens, Weisstechnik) [6,7,8]

IoT Connections

Devices like these described so far are one end of the IoT technology stack. They are the public-facing “things,” but just one layer of the stack. The data they send or receive travels up and down layers of connection points and software programs. At the stack’s other end are the device master applications and data storage residing in the datacenter-held cloud.  Because IoT things/devices differ so widely by function, there are many different software platforms in use. The largest telecom and data companies are all active IoT developers in this continuously evolving arena.

Figure 4. The IoT Technology Stack. At Its Base are Billions of Connected Things Whose Data Travels Along Multiple Connections and is Ultimately Managed by Cloud-based Applications. (IoT Business) [9]

Thermal Management Issues in the IoT

Connected consumer IoT devices, or things, are typically very low power. No added thermal management is needed. Power levels are more likely to increase with devices serving the Industrial IoT, though passive cooling, e.g. heat sinks, remedy most heat issues.

Figure 5. Inside an IIoT Air-Cooled Gateway that Provides Long-Range Communications for Monitoring Urban, Farming, and Transportation Systems. (RAKwireless) [10]
Figure 6. A Cold Plate Attached to a Datacenter PCB Directs Jets of Liquid onto Hot Spots on Rack-Stored Processors. (JetCool) [11]

To effectively manage generated heat, cloud-hosting datacenters are using air-cooling, direct-liquid, and immersion cooling (submerged server) systems. As transistor densities increase on smaller package chips, the centers must enhance their thermal management capabilities, while managing power consumption and operation costs.  

Figure 7. Air Cooled Chillers on a Datacenter Rooftop Treat the Water Circulating Through the Center to Cool Its Server Electronics. (Engineered Systems) [12]
Figure 8. Microsoft’s Project Natick Immersed a Datacenter off the Scottish Seashore as a Lower Cost Method for Cooling Its Servers (Microsoft) [13]

IoT – What Comes Next?

These days the IoT is focused on two more letters: AI. In fact, the Artificial Intelligence of Things (AIoT) takes the I (Internet) for granted and enables smart devices and systems to analyze data, make decisions, and act on that data without any interference of humans.

Figure 9. AIoT, Artificial Intelligence of Things, Enables Non-human Data Analytics for Deeper Insights and Higher Performance. (Tealcom) [14]

Most AIoT applications are currently retail product-oriented and focused on the implementation of cognitive computing in consumer appliances. For instance, computer vision systems can leverage facial recognition to recognize customers, and compile demographic and preferences data about them. Also, Tesla’s autopilot systems are using radars, sonars, GPS, and cameras to glean data about driving conditions. Then an AI system makes decisions about the data the internet of things devices are collecting to optimize the car’s piloting. [15]

References

  1. DanielElizalde, https://danielelizalde.com/what-is-the-internet-of-things/
  2. Kangaroo, https://heykangaroo.com/products/doorbell-camera-chime
  3. GoodWorkLabs, https://www.goodworklabs.com/apple-watch-app-is-the-smart-watch-revolution-finally-making-sense/
  4. StreamLabs, https://streamlabswater.com/products/streamlabs-water-monitor
  5. PlanetDave, https://planetdave.com/2023/11/a-delightful-drive-the-2023-toyota-crown-platinum/
  6. Biz4intellia, https://www.biz4intellia.com/iot-sensors/
  7. IoTsens, https://www.iotsens.com/en/product/sound-monitor/
  8. Weisstechnik, https://weiss-na.com/product/webseason-the-controller-designed-by-and-for-end-users/
  9. https://iotbusinessnews.com/2022/07/13/86750-what-is-the-iot-technology-stack/
  10. RAKwireless, https://news.rakwireless.com/wisgate-connect-why-did-we-did-it/
  11. JetCool, https://www.datacenterknowledge.com/power-and-cooling/liquid-cooling-adoption-data-centers-becoming-zero-sum-game
  12. Engineered Systems, https://www.esmagazine.com/articles/100400-air-cooled-chillers-are-back-in-data-centers-and-they-mean-business
  13. Microsoft, https://news.microsoft.com/source/features/sustainability/project-natick-underwater-datacenter/
  14. Tealcom, https://tealcom.io/post/the-intersection-of-ai-iot-and-connectivity/
  15. IndustryWired, https://industrywired.com/how-artificial-intelligence-can-help-manage-flood-of-iot-data/

Thermal Management for AI Chips

The use of artificial intelligence (AI) programs is growing very quickly, despite some concerns and precautions. It’s being spurred by powerful new hardware from companies like Nvidia, and by new lower-cost, open source large language model (LLM) software like those from DeepSeek.

Likewise, AI chip sales are soaring, and more powerful and specialized AI chips are being steadily introduced. At this writing, Nvidia, the leading AI chip provider is now the third-most-valuable company in the world, valued at over $2.2 trillion. Nvidia is both developing new AI chips and acquiring smaller AI companies that design processors and develop AI applications. [1]

Figure 1. The Nvidia A100 AI Chip and DeepSeek’s Free to Low-Cost AI Model are Major Reasons for AI’s Fast-Growing Deployment and Use in Nearly All Industries. (Nvidia, DeepSeek)

Other chip companies are also thriving. Micron Technology is reporting record sales, much of them from supplying memory chips to Nvidia. AMD provides chips that rival Nvidia’s flagship AI machine learning chip. And Intel is getting $8.5 billion from a US federal program (CHIPS) to support its goal to build the largest AI chip manufacturing site in the world. [2,3]

Innovative Chips Bring High Heat

AI chip technology, which evolved from the graphics processing units, GPUs, developed for the data needs of video games, may be the most understood part of the AI world. But this evolution is remarkable. The Nvidia A100 AI Chip features 54 billion transistors. By comparison, an AMD Ryzen 7 1700 gaming processor for a contemporary PC has 4.8 million transistors. [4]

By leveraging parallel processing capabilities, AI chips effectively handle large datasets, allowing multiple tasks to be executed simultaneously. These chips interact with special ASICs, FPGAs, TPUs and VPUs to perform machine learning and neural network processing. The AI networks can solve complex algorithms and are teaching computers to process data in a way that is inspired by the human brain.

For example, AI can use inference – combining reasoning and decision-making based on available information – to apply real-world knowledge for facial recognition, gesture identification, natural language interpretation, image searching  and much more. [5]

Figure 2. Intel FPGAs Support Real-Time Deep Learning Inference for Embedded Systems and Data Centers. (Intel/Mirabilis) [6,7]

AI chips demand high power to support increased processing demand. As a result, excessive waste heat can degrade performance or trigger system failure. AI system designers depend on thermal management solutions to manage AI processor temperatures. Cooling resources at both chip-level and facility, e.g. data center scale are needed to keep AI chips functioning at proper temperatures.

Liquid Cooling AI Chips

The heavy lifting in AI processing is done in data centers, which are the focus of most technical developments. Their high concentration of high-power chips presents formidable heat management challenges, especially when the thermal design power of the GPU has increased over the past two decades, rising from 150 watts to more than 700 watts.

Now consider the recently unveiled 1200 watt Nvidia Blackwell B200 tensor core chip—the company’s most powerful single-chip GPU, with 208 billion transistors—which Nvidia says can reduce AI inference operating costs (such as running ChatGPT) and energy. Two of these B200 chips are combined with an Nvidia Grace CPU to complete the newly released, even higher-performing GB200. Its total projected power draw: up to 2,700 watts.

Figure 3. The Nvidia GB200 NVL72 Data Center Computer System Combines 36 GB200s (72 B200 GPUs and 36 Grace CPUs total). (Nvidia) [6]

The GB200 chip is a key part of Nvidia’s new GB200 NVL72, a liquid-cooled data center computer system designed specifically for AI training and inference tasks. Amazon Web Services, Dell Technologies, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, and xAI, are expected to adopt the Blackwell platform. [8]

The ever increasing number of transistors attached to data center PCBs translates to higher performance but also more heat than ever before. Liquid cooling systems, like in the Nvidia data center system can significantly reduce energy consumption. This leads to lower operating expenses in the long run. It also produces less noise and, for direct on chip cooling, takes up less space. 

Direct to chip or node cooling involves circulating a coolant directly over heat-generating components, including AI chips. This method significantly increases cooling efficiency by removing heat directly at the source. These systems can use a variety of coolants, including water, dielectric fluids, or refrigerants, depending on the application’s needs and the desired cooling capacity. [8]

Figure 4. Direct-to-Chip Liquid Cooling. Note the Small Heat Sinks to Air-Cool Other Components. (AnD Cable Products) [9]
Figure 5. Data Center Racks Cooled by Lenovo’s Direct to Node Liquid Cooling System. (ServeTheHome) [10]

Immersion cooling takes liquid cooling a step further by submerging the entire server, or parts of it, in a non-conductive liquid. This technique can be highly efficient as it ensures even and thorough heat absorption from all components. Immersion cooling is particularly beneficial for high-performance computing (HPC) and can dramatically reduce the space and energy required for cooling.

Figure 6. Immersion Cooling Submerges All or Parts of a Server in a Non-Conductive Liquid Coolant. (GIGABYTE) [11]

Air Cooling AI Chips

Nvidia’s Jetson chips bring accelerated AI performance to IoT and Edge applications in a power-efficient and compact form factor (smaller than 100mm x 100mm). Less power-consuming (up to 75 watts) than data center AI chips, thermal management is still needed. Jeston components are typically cooled with heat sinks, which can be configured as active (with attached fan) or passive (fanless). 

Figure 7. Passive (Fanless) and Active (Fan-assisted) Heat Sinks Designed to Cool Nvidia Jeston AI Chips in Embedded and Edge Devices.
(Advanced Thermal Solutions, Inc.) [12]
Figure 8. The Edge Ultrastar Transportable Edge Server from Western Digital Features a 70 Watt Nvidia Tesla GPU Cooled by Four Internal 60mm Fans. (Western Digital) [13]

Conclusion

Artificial intelligence is hot in the marketplace. So are AI chips. As complex as they are, simply surpassing a heat threshold can affect their proper function. Their thermal management is essential.

Figure 9. AI Deployment in Industry will Generate Strong Demand for Smart Devices.
One Example is Sensors of All Types. (Fierce Electronics) [14]

The world will experience the impacts of artificial intelligence. Like the Internet and mobile technology, it will become pervasive, far beyond deep fakes and term papers, instead driving development of more capable tools for industry, medicine and more, and for managing our daily live3. With care, this revolution should be benign, and greatly improve our lives and our world.
 
References
1. Motley Fool, https://www.fool.com/investing/2024/03/21/nvidia-just-bought-5-ai-stocks-2-stand-out-most/
2. Yahoo Finance, https://finance.yahoo.com/news/amd-dethrone-nvidia-artificial-intelligence-112400772.html
3. Quartz, https://qz.com/intel-ai-chip-factory-world-chips-act-funds-1851358125
4. Nvidia, https://www.nvidia.com/en-us/data-center/a100/
5. OurCrowd, https://www.ourcrowd.com/learn/what-is-an-ai-chip
6. Intel, https://www.intel.com/content/www/us/en/software/programmable/fpga-ai-suite/overview.html
7. Mirabilis Design, https://www.mirabilisdesign.com/intel-fpga-neural-processor-ai/
8. Ars Technica, https://arstechnica.com/information-technology/2024/03/nvidia-unveils-blackwell-b200-the-worlds-most-powerful-chip-designed-for-ai 9. AnD Cable Products, https://andcable.com/data-center-trends/data-center-liquid-cooling/
10. ServeTheHome, https://www.servethehome.com/lenovo-sd650-v2-and-sd650-n-v2-liquid-cooling-intel-xeon-nvidia-a100-neptune/
11. GIGABYTE, https://www.gigabyte.com/Solutions/gigabyte-single-phase
12. Advanced Thermal Solutions, Inc., https://www.qats.com/eShop.aspx?q=Device%20Specific%20-%20NVIDIA
13. Western Digital, https://www.westerndigital.com/en-ap/products/data-center-platforms/ultrastar-transporter?sku=1ES2562
14. Fierce Electronics, https://www.fierceelectronics.com/components/sensors-artificial-intelligence-and-concepts-you-may-want-to-know-i