Microsoft is one of the largest end-user of FPGAs for datacenter applications, accelerating a wide space of their massive computing infrastructure and applications on Bing and Azure. To demonstrate their resulting prowess, Microsoft unveiled Project Brainwave, a scalable acceleration platform for deep learning, which can provide real time responses for cloud-based AI services.
The Microsoft Brainwave mezzanine card extends each server with an Intel Altera Stratix 10 FPGA accelerator, synthesized to act as a “Soft DNN Processing Unit,” or DPU, and a fabric interconnect that enables datacenter-scale persistent neural networks.
Microsoft’s Project Brainwave consists of three components:
1. A high-performance systems architecture that pools accelerators for datacenter-wide services and scale. By linking their accelerators across a high bandwidth, low-latency fabric, Microsoft can dynamically allocate these resources to optimize their utilization while keeping latencies very low.
A “soft” DNN processor (DPU) that is programmed, or synthesized, on Altera FPGAs.
3. A compiler and run-time environment to support efficient deployment of trained neural network models using CNTK, Microsoft’s DNN platform.
A fully custom chip, or ASIC, can give companies like Google a very fast machine learning accelerator at lower per-unit costs, but the development process can be cost-prohibitive, lengthy, and result in a fixed function chip, impeding one’s ability to quickly adapt silicon implementations as algorithms evolve. Microsoft pointed to this tradeoff in their announcement as a primary driver for their FPGA-based strategy. By using an FPGA instead of an ASIC for their “soft” DPU, Microsoft believes it can better optimize their hardware for their software at lower cost and with greater flexibility over time.
A great example of the advantage of FPGAs in machine learning is the ability to customize the level of precision required for a particular layer in a deep neural network. With an FPGA, a neural net designer could model each layer in the net with the optimal number of bits, which can have a significant impact on performance and efficiency.