Wafer-scale Integration Chip
Designing the giant AI chip needed specialized manufacturing processes to fabricate 1.2 trillion transistors, 400,000 processor cores, 18 gigabytes of SRAM

TSMC AI Wafer Chip

Computers using Cerebras Wafer-scale Integration Chip will train these AI neural network systems in hours instead of weeks.

A start-up AI tech company called Cerebras Systems wants to break the limited Learning speed of conventional AI chips like Nvidia graphics processors. To realize this AI learning speed boost Andrew Feldman, CEO and co-founder of the AI computer startup company wants to make a silicon monster that is almost 22 centimeters—roughly 9 inches—on each side, making it the largest computer chip ever lithographed for production by TSMC, Taiwan Semiconductor Manufacturing Co.

“it’s the largest chip he has ever seen,” says Brad Paulsen, a senior vice president at TSMC that also has chip production contracts with Apple and Nvidia. Manufacturing equipment at the chip factory is adopted to making postage sized chips. Making Cerebras’ giant chip required TSMC to adapt its equipment to make one continuous design, instead of a grid of many separate ones, Paulsen says. Cerebras’ chip is the largest square that can be cut from a 300-millimeter wafer. “I think people are going to see this and say ‘Wow that’s possible? Maybe we need to explore in that direction,’” he says

Big, Weird Chip

Jim McGregor, founder of Tirias Research expects the largest tech companies, who see their destinies riding on competing in AI—firms like Facebook, Amazon, and Baidu—to take a serious look at Cerebras’ big, weird chip. “For them, it could make a lot of sense,” he says

Wafer-scale Integration Chip compare
Cerebras’ chip, left, is many times the size of an Nvidia graphics processor, right, popular with AI researchers. Photo: Cerebras Systems.

Using today’s AI-enhanced computers and AI computer clusters is that training runs for big neural networks can take as long as six weeks complains customers’. At that rate, they are able to train only around six neural networks in a year. Says Feldman, “The idea is to test more ideas, If you can [train a neural network, not in weeks but] instead 2 or 3 hours, you can run thousands of ideas.”

Wafer-scale Integration Chip computer
CS-1 Cerebras Inside: The cooling system takes up most of the computer. The WSE chip is in the back left corner Photo: Cerebras Systems.

CS-1 Computer

Enter CS-1 computer the most powerful single chip AI computer. The CS-1 computer has an equivalent 1.2 trillion transistors, 400,000 processor cores, 18 gigabytes of SRAM, and interconnects capable of moving 100 million billion bits per second. According to the company, a single CS-1 computer can perform more than three times faster than the 2nd generation Google 10-rack TPU2 computer cluster, and consume five times as much power and takes up 30 times as much space.“ Being able to quickly train or retrain is really important (even as) the [neural-network] models are becoming more complex,” says Mike Demler, a senior analyst with the Linley Group, in Mountain View, Calif

Super computing giant Argonne National Laboratory, is using multiple CS-1 machines and can potentially usher in a sudden upsurge in the number of neural networks utilization.

As explained by Feldman they the founders of Cerebras wanted to build an AI computer that can handle the complexities of modern AI workloads. This means that the AI computers must be able to move a lot of data quickly, they need memory that is close to the processing core, and those cores don’t need to work on data that other cores are crunching.

Optimization Software

Thus the creation of the CS-1’s software that allows users to write their machine learning models using standard frameworks such as PyTorch and TensorFlow. It then sets about devoting variously sized portions of Cerebras’s Wafer Scale Engine chip to layers of the neural network. It does this by solving an optimization problem. This algorithm ensures that the layers all complete their work at roughly the same pace. And they are also contiguous with their neighbors, so information can flow through the network without any holdups.

The software can also perform that optimization problem across multiple computers. Thereby allowing a cluster of computers to act as one big machine. Cerebras have linked as many as 16 CS-1s together to get a roughly 16-fold performance increase. This contrasts with how clusters based on graphics processing units behave, says Feldman. The software solves the core optimization issue. The next issue was the chips’ neural network cores fabrication in the wafer itself.

Neural Network Architecture

The company’s veteran computer architects, including Gary Lauterbach, its chief technical officer proposed things that they could do. Firstly, use thousands and thousands of small specialized cores designed to do the relevant neural-network computations. Thereby using fewer more general-purpose cores. Second, each core should link together with an interconnected system. Consequently, this allows for rapid data communication between cores at low power consumption. And finally, all the needed data should be on the same processor chip itself, not in separate memory chips. This proximity of the memory chip module to the cores enabled exponentially faster data transport. That is between the data storage module and core processors and between cores. Not mentioning the power savings benefit of this kind of chip arrangement.

“It wasn’t obvious we could do that, that’s for sure,” says Feldman. But “it was fairly obvious that there were big benefits.” For making as big a chip as possible.

Wafer Chip Power Distribution

Notwithstanding the seemingly efficient use of the power of such a huge chip. However, the power delivery system was one of the biggest challenges in the computer’s chip development. The 1.2 trillion transistors designed to operate at about 0.8 volts each. However, it needs a total of 20,000 amperes of current to enable the entire chip to run these transistors. The company engineers’ solution was to distribute it vertically from above. They designed a fiberglass circuit board holding hundreds of special-purpose chips for power control. And one million copper posts bridge the millimeter or so from the fiberglass board. Consequently, these posts placed in specific power points on the WSE chip. 

Heat and Coefficient of Thermal Expansion

The huge power needed for the titanic chip to operate created heat proportional to its size. To cool it down the chip designers placed below it a water-cooled heat dissipation plate. Still, residual heat within the chip created another mechanical problem. Namely the coefficient of thermal expansion of the different materials used in fabricating the chip. For instance, copper expands the most, silicon the least, and the fiberglass somewhere in between.

“The challenge of [coefficient of thermal expansion] mismatch with the motherboard was a brutal problem,” says Lauterbach. But in the end, the engineers had to invent one themselves. Consequently, the endeavor took a year and a half to accomplish

Final Words

Lastly one benefit of this chip is low power consumption due to efficient chip power utilization. Unfortunately, Lauterbach doubts that this feature of cutting power consumption will be much of a selling point in data centers. “While a lot of data centers talk about [conserving] power, when it comes down to it…they don’t care,” he says. “They want performance.” And that’s something a processor nearly the size of a dinner plate can certainly provide.

Source: Cerebras’s Giant Chip Will Smash Deep Learning’s Speed Barrier