TensorWave's AMD Adoption

Published 7 months ago

Emerging specialist cloud operators, proficient in handling power-intensive GPUs and AI infrastructure, are making a strategic shift. Operators like CoreWeave, Lambda, and Voltage Park have built their infrastructure using tens of thousands of Nvidia GPUs. However, others are now choosing AMD over Nvidia, with bit barn startup TensorWave leading the way.

TensorWave’s AMD Adoption

TensorWave began equipping its systems with AMD’s Instinct MI300X earlier this month. The startup plans to lease these chips at a fraction of the cost charged for accessing Nvidia accelerators. Jeff Tatarchuk, TensorWave’s co-founder, is optimistic about AMD’s latest accelerators. TensorWave has secured a large allocation of AMD’s accelerators and plans to deploy 20,000 MI300X accelerators across two facilities by the end of 2024.

AMD’s MI300X: A Worthy Competitor

The MI300X, AMD’s most advanced accelerator, was launched at the AMD’s Advancing AI event in December. It offers superior performance compared to Nvidia’s H100, with 32 percent faster speed, higher floating point performance, and larger memory.

Cooling Challenges & Solutions

TensorWave is working on bringing additional liquid-cooled systems online next year. The company is targeting four nodes with a total capacity of around 40kW per rack, cooled using rear door heat exchangers. However, supply chain challenges, particularly around rear door heat exchangers, have been a concern. TensorWave COO Piotr Tomasik acknowledged these issues, but expressed confidence in their ability to deploy them.

Looking Ahead

TensorWave is also looking at direct-to-chip cooling for the second half of the year. However, there are concerns about the performance of AMD’s products. While customers are enthusiastic about AMD as an alternative to Nvidia, they are uncertain about the performance comparability. TensorWave aims to launch its MI300X nodes using RDMA over Converged Ethernet (RoCE).

Scaling Up Strategy

TensorWave plans to introduce a cloud-like orchestration layer for resource provisioning and implement GigaIO’s PCIe 5.0-based FabreX technology. This technology will allow the connection of up to 5,750 GPUs in a single domain with more than a petabyte of high bandwidth memory - a move that could reshape the cloud operator landscape.