






Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Prepara tus exámenes
Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Prepara tus exámenes con los documentos que comparten otros estudiantes como tú en Docsity
Encuentra los documentos específicos para los exámenes de tu universidad
Estudia con lecciones y exámenes resueltos basados en los programas académicos de las mejores universidades
Responde a preguntas de exámenes reales y pon a prueba tu preparación
Consigue puntos base para descargar
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Comunidad
Pide ayuda a la comunidad y resuelve tus dudas de estudio
Ebooks gratuitos
Descarga nuestras guías gratuitas sobre técnicas de estudio, métodos para controlar la ansiedad y consejos para la tesis preparadas por los tutores de Docsity
IA , son papers para estudio de la asignatura
Tipo: Apuntes
1 / 10
Esta página no es visible en la vista previa
¡No te pierdas las partes importantes!







I. INTRODUCTION
Just as last year, the pace of new announcements, releases, and deployments of artificial intelligence (AI) and machine learning (ML) accelerators from startups and established tech- nology companies has been modest. This is not unreason- able; for many companies that have released an accelerator report having spent three or four years researching, analyzing, designing, verifying, and validating their accelerator design trade-offs and building the software stack to program the accelerator. For those who have released subsequent versions of their accelerator, they have reported shorter development cycles, though it is still at least two or three years. The focus of these accelerators continues to be on accelerating deep neural network (DNN) models, and the application space spans from very low power embedded voice recognition and image clas- sification to data center scale training, while the competition for defining markets and application areas continues as part of a much larger industrial and technology shift in modern computing to machine learning solutions. AI ecosystems bring together components from embed- ded computing (edge computing), traditional high perfor- mance computing (HPC), and high performance data analy- sis (HPDA) that must work together to effectively provide capabilities for use by decision makers, warfighters, and analysts [1]. Figure 1 captures an architectural overview of such end-to-end AI solutions and their components. On the left side of Figure 1, structured and unstructured data sources provide different views of entities and/or phenomenology.
This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.
Fig. 1: Canonical AI architecture consists of sensors, data con- ditioning, algorithms, modern computing, robust AI, human- machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and systems.
These raw data products are fed into a data conditioning step in which they are fused, aggregated, structured, accumulated, and converted into information. The information generated by the data conditioning step feeds into a host of supervised and unsupervised algorithms such as neural networks, which extract patterns, predict new events, fill in missing data, or look for similarities across datasets, thereby converting the input information to actionable knowledge. This actionable knowledge is then passed to human beings for decision- making processes in the human-machine teaming phase. The phase of human-machine teaming provides the users with useful and relevant insight turning knowledge into actionable intelligence or insight.
Underpinning this system are modern computing systems. Moore’s law trends have ended [2], as have a number of related laws and trends including Denard’s scaling (power density), clock frequency, core counts, instructions per clock cycle, and instructions per Joule (Koomey’s law) [3]. Taking a page from the system-on-chip (SoC) trends first seen in automotive applications, robotics, and smartphones, advancements and innovations are still progressing by developing and integrating accelerators for often-used operational kernels, methods, or functions. These accelerators are designed with a different balance between performance and functional flexibility. This includes an explosion of innovation in deep machine learning processors and accelerators [4]–[8]. In this series of survey papers, we explore the relative benefits of these technologies since they are of particular importance to applying AI to domains under significant constraints such as size, weight, and
power, both in embedded applications and in data centers.
This paper is an update to IEEE-HPEC papers from the past
three years [9]–[11]. As in past years, this paper continues
with last year’s focus on accelerators and processors that are
geared toward deep neural networks (DNNs) and convolutional
neural networks (CNNs) as they are quite computationally in-
tense [12]. This survey focuses on accelerators and processors
for inference for a variety of reasons including that defense
and national security AI/ML edge applications rely heavily on
inference. And we will consider all of the numerical precision
types that an accelerator supports, but for most of them, their
best inference performance is in int8 or fp16/bf16 (IEEE 16-
bit floating point or Google’s 16-bit brain float).
There are many surveys [13]–[24] and other papers that
cover various aspects of AI accelerators. For instance, the first
paper in this multi-year survey included the peak performance
of FPGAs for certain AI models; however, several of the
aforementioned surveys cover FPGAs in depth so they are
no longer included in this survey. This multi-year survey
effort and this paper focus on gathering a comprehensive list
of AI accelerators with their computational capability, power
efficiency, and ultimately the computational effectiveness of
utilizing accelerators in embedded and data center applica-
tions. Along with this focus, this paper mainly compares
neural network accelerators that are useful for government
and industrial sensor and data processing applications. A few
accelerators and processors that were included in previous
years’ papers have been left out of this year’s survey. They
have been dropped because they have been surpassed by
new accelerators from the same company, they are no longer
offered, or they are no longer relevant to the topic.
II. SURVEY OF PROCESSORS
Many recent advances in AI can be at least partly cred-
ited to advances in computing hardware [6], [7], [25], [26],
enabling computationally heavy machine-learning algorithms
and in particular DNNs. This survey gathers performance and
power information from publicly available materials including
research papers, technical trade press, company benchmarks,
etc. While there are ways to access information from com-
panies and startups (including those in their silent period),
this information is intentionally left out of this survey; such
data will be included in this survey when it becomes publicly
available. The key metrics of this public data are plotted in
Figure 2, which graphs recent processor capabilities (as of July
dash-dotted box depicts the very dense region that is zoomed
in and plotted in Figure 3.
The x-axis indicates peak power, and the y-axis indicate
peak giga-operations per second (GOps/s), both on a loga-
rithmic scale. The computational precision of the processing
capability is depicted by the geometric shape used; the com-
putational precision spans from analog and single-bit int1 to
four-byte int32 and two-byte fp16 to eight-byte fp64. The
precisions that show two types denotes the precision of the
multiplication operations on the left and the precision of
the accumulate/addition operations on the right (for example,
fp16.32 corresponds to fp16 for multiplication and fp32 for accumulate/add). The form factor is depicted by color, which shows the package for which peak power is reported. Blue corresponds to a single chip; orange corresponds to a card; and green corresponds to entire systems (single node desktop and server systems). This survey is limited to single motherboard, single memory-space systems. Finally, the hollow geometric objects are peak performance for inference-only accelerators, while the solid geometric figures are performance for acceler- ators that are designed to perform both training and inference. The survey begins with the same scatter plot that we have compiled for the past three years. As we did last year, to save space, we have summarized some of the important metadata of the accelerators, cards, and systems in Table I, including the label used in Figure 2 for each of the points on the graph; many of the points were brought forward from last year’s plot, and some details of those entries are in [9]. There are several additions which we will cover below. In Table I, most of the columns and entries are self explana- tory. However, there are two Technology entries that may not be: dataflow and PIM. Dataflow processors are custom- designed processors for neural network inference and training. Since neural network training and inference computations can be entirely deterministically laid out, they are amenable to dataflow processing in which computations, memory accesses, and inter-ALU communications actions are explicitly/statically programmed or “placed-and-routed” onto the computational hardware. Processor in memory (PIM) accelerators integrate processing elements with memory technology. Among such PIM accelerators are those based on an analog computing technology that augments flash memory circuits with in-place analog multiply-add capabilities. Please refer to the references for the Mythic and Gyrfalcon accelerators for more details on this innovative technology. Finally, a reasonable categorization of accelerators follows their intended application, and the five categories are shown as ellipses on the graph, which roughly correspond to perfor- mance and power consumption: Very Low Power for speech processing, very small sensors, etc.; Embedded for cameras, small UAVs and robots, etc.; Autonomous for driver assist services, autonomous driving, and autonomous robots; Data Center Chips and Cards; and Data Center Systems. For most of the accelerators, their descriptions and commen- taries have not changed since last year so please refer to last two years’ papers for descriptions and commentaries. There are, however, several new releases that were not covered by past papers that are covered here.
TABLE I: List of accelerator labels for plots.
Company Product Label Technology Form Factor References Achronix VectorPath S7t-VG6 Achronix dataflow Card [27] Aimotive aiWare3 Aimotive dataflow Chip [28] AIStorm AIStorm AIStorm dataflow Chip [29] Alibaba Alibaba Alibaba dataflow Card [30] AlphaIC RAP-E AlphaIC dataflow Chip [31] Amazon Inferentia AWS dataflow Card [32], [33] ARM Ethos N77 Ethos dataflow Chip [34] Axelera Axelera Test Core Axelera dataflow Chip [35] Baidu Baidu Kunlun 818-300 Baidu dataflow Chip [36]–[38] Bitmain BM1880 Bitmain dataflow Chip [39] Blaize El Cano Blaize dataflow Card [40] Canaan Kendrite K210 Kendryte CPU Chip [41] Cerebras CS-1 CS-1 dataflow System [42] Cerebras CS-2 CS-2 dataflow System [43] Cornami Cornami Cornami dataflow Chip [44] Enflame Cloudblazer T10 Enflame CPU Card [45] Google TPU Edge TPUedge dataflow System [46] Google TPU1 TPU1 dataflow Chip [47], [48] Google TPU2 TPU2 dataflow Chip [47], [48] Google TPU3 TPU3 dataflow Chip [47]–[49] Google TPU4i TPU4i dataflow Chip [49] Google TPU4 TPU4 dataflow Chip [50] GraphCore C2 GraphCore dataflow Card [51], [52] GraphCore C2 GraphCoreNode dataflow System [53] GraphCore Colossus Mk2 GraphCore2 dataflow Card [54] GraphCore Bow-2000 GraphCoreBow dataflow Card [55] GreenWaves GAP8 GAP8 dataflow Chip [56], [57] GreenWaves GAP9 GAP9 dataflow Chip [56], [57] Groq Groq Node GroqNode dataflow System [58] Groq Tensor Streaming Processor Groq dataflow Card [51], [59] Gyrfalcon Gyrfalcon Gyrfalcon PIM Chip [60] Gyrfalcon Gyrfalcon GyrfalconServer PIM System [61] Habana Gaudi Gaudi dataflow Card [62], [63] Habana Goya HL-1000 Goya dataflow Card [63], [64] Hailo Hailo Hailo-8 dataflow Chip [65] Horizon Robotics Journey2 Journey2 dataflow Chip [66] Huawei HiSilicon Ascend 310 Ascend-310 dataflow Chip [67] Huawei HiSilicon Ascend 910 Ascend-910 dataflow Chip [68] Intel Arria 10 1150 Arria FPGA Chip [69], [70] Intel Mobileye EyeQ5 EyeQ5 dataflow Chip [40] Kalray Coolidge Kalray manycore Chip [71], [72] Kneron KL720 KL720 dataflow Chip [73] Maxim Max 78000 Maxim dataflow Chip [74]–[76] Mythic M1076 Mythic76 PIM Chip [77]–[79] Mythic M1108 Mythic108 PIM Chip [77]–[79] NovuMind NovuTensor NovuMind dataflow Chip [80], [81] NVIDIA Ampere A10 A10 GPU Card [82] NVIDIA Ampere A100 A100 GPU Card [83] NVIDIA Ampere A30 A30 GPU Card [82] NVIDIA Ampere A40 A40 GPU Card [82] NVIDIA DGX Station DGX-Station GPU System [84] NVIDIA DGX-1 DGX-1 GPU System [84], [85] NVIDIA DGX-2 DGX-2 GPU System [85] NVIDIA DGX-A100 DGX-A100 GPU System [86] NVIDIA H100 H100 GPU Card [87] NVIDIA Jetson AGX Xavier XavierAGX GPU System [88] NVIDIA Jetson NX Orin OrinNX GPU System [89], [90] NVIDIA Jetson AGX Orin OrinAGX GPU System [89], [90] NVIDIA Jetson TX1 Jetson1 GPU System [91] NVIDIA Jetson TX2 Jetson2 GPU System [91] NVIDIA Jetson Xavier NX XavierNX GPU System [88] NVIDIA DRIVE AGX L2 AGX-L2 GPU System [92] NVIDIA DRIVE AGX L5 AGX-L5 GPU System [92] NVIDIA Pascal P100 P100 GPU Card [93], [94] NVIDIA T4 T4 GPU Card [95] NVIDIA Volta V100 V100 GPU Card [94], [96] Perceive Ergo Perceive dataflow Chip [97] Preferred Networks MN-3 Preferred-MN-3 multicore Card [98], [99] Quadric q1-64 Quadric dataflow Chip [100] Qualcomm Cloud AI 100 Qcomm dataflow Card [101], [102] Rockchip RK3399Pro RK3399Pro dataflow Chip [103] SiMa.ai SiMa.ai SiMa.ai dataflow Chip [104] Syntiant NDP101 Syntiant PIM Chip [105], [106] Tachyum Prodigy Tachyum CPU Chip [107] Tenstorrent Tenstorrent Tenstorrent multicore Card [108] Tesla Tesla Full Self-Driving Computer Tesla dataflow System [109], [110] Texas Instruments TDA4VM TexInst dataflow Chip [111]–[113] Toshiba 2015 Toshiba multicore System [114] Untether TsunAImi TsunAImi PIM Card [115]
with a second wafer that greatly improves power and clock distribution throughout the CG200 chip [55]. This translates into 40% better performance and 16% better performance-per-Watt.
Next, we must mention accelerators that do not appear on Figure 2 yet. Each has been released with some benchmark results but either no peak performance numbers or no peak power numbers.
Finally there is one departure to the report this year. Last year, Centaur Technology announced a x86 CPU with an integrated AI accelerator, which was realized as a 4,096 byte- wide SIMD unit. The performance estimates were competitive, but VIA Technologies, the parent company of Centaur, sold off the USA-based engineering team of the processor to Intel, Corp. and seems to have ended the development of the CNS processor [128].
III. OBSERVATIONS AND TRENDS
There are several observations comments for us to appreci- ate on Figure 2.
encryption engines, network interfaces, etc. These ad- ditional features of the SoCs do not change the peak performance metric, but they do have a direct impact on the peak power reported for the chip, so please keep this in mind when comparing them.
A. Broader Trends
We also collected release dates, fabrication technology, and
peak performance for multiple precisions for a smaller subset
of accelerators listed in Table I. We were curious about the
trends of peak performance over the past ten years and how
numerical precision and fabrication technology influenced it.
These data are plotted in Figure 4. Figure 4a plots the release
date of a number of accelerators versus their peak performance
for one or more precision formats. There are marked gains in
peak performance for each of the precision formats, but within
each format the maximum gain is 1.5 orders of magnitude
over the 10-year period. In Figure 4b, we plot the release
date versus the fabrication technology used for the accelerator.
The default precision for the peak performance values is int8;
however, there are a number of accelerators (e.g., NVIDIA
K20, K80 and AMD Mi8) which did not have int8 support.
For these accelerators, the peak performance is reported for
the lowest precision that the accelerator supported. This plot
shows that much performance has been gained over the past
ten years by supporting lower precision formats; it is partic-
ularly interesting to observe how support for lower precision
formats was included in these accelerators as research and
industry explore the effectiveness of lower floating point and
integer formats in CNN/DNN inference and training.
We have several more observations and trends that are not
yet captured in graphs. First, the exploration for the best
LLSC Overview - 3
MIT LINCOL S U P E R C O M P
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Release Date
100
101
102
103
Peak Performance (TOps/sec - Log)^ AMD-MI
AMD-MI
AMD-MI
AMD-MI100AMD-MI100AMD-MI
AMD-MI
AMD-MI210AMD-MI210AMD-MI
AMD-MI
AMD-MI250AMD-MI250AMD-MI
AMD-MI
Baidu-Kunlun
Baidu-Kunlun
Baidu-Kunlun
Baidu-Kunlun
TPU
TPU
TPU3 TPU4i
GraphCore1 GraphCore
Groq-TSP
Habana-Goya Intel-Xe-HPC
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-A
NV-P100 NV-A
NV-T
NV-T
NV-T
NV-V
NV-K80^ NV-V
NV-K
Qcomm
Qcomm
Tenstorrent
Neural Network Peak Performance
Past Decade – Precision Comparison
Computation Precision int int BFloat fp
LLSC Overview - 4
MIT LINCO S U P E R C O M
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Release Date
100
101
102
103
Peak Performance (TOps/sec - Log)^ AMD-MI
AMD-MI
AMD-MI100AMD-MI
Baidu-Kunlun^ AMD-MI
Baidu-Kunlun
TPU
TPU
TPU3 TPU4i
GraphCore1 GraphCore
Groq-TSP
Habana-Goya
Habana-Gaudi^ Intel-Xe-HPC
NV-A
NV-A
NV-A
NV-P
NV-V100^ NV-T
NV-K
NV-K
Qcomm (^) Tenstorrent
Neural Network Peak Performance
Past Decade – Fab Technology Comparison
From: Albert Reuther, MIT LL Supercomputing Center
Precision int int BFloat fp fp
Fab Tech GF- 12 GF- 28 Samsung- 7 Samsung- 14 TSMC- 6 TSMC- 7 TSMC- 8 TSMC- 12 TSMC- 16 TSMC- 28
Fig. 4: Trends with respect to release date for subset of publicly announced AI accelerators and processors.
numerical formats for inference and training continue. For inference, some discussion continues whether int4 will be acceptable for embedded inference, and the Maxim MAX 78000 SoC solution supports 1-bit, 2-bit, 4-bit, and 8-bit integer weights [75]. On the training side, it has been an- nounced that NIVIDA Hopper, Intel Gaudi2 and a future GraphCore accelerator will support the lower precision FP numerical format [130]. GraphCore posted an analysis paper on FP8 [131], including trade-off analyses of scaled integer versus floating point representations, different 8-bit floating point representations, and mixed representation DNN model performance.
Another trend that has caught our attention is that math- ematical kernels other than DNN/CNN models have been implemented on several dataflow accelerators. These dataflow accelerators generally handle each data item independently (i.e., there are no cache lines), and data movement and com- putational operations are explicitly/statically programmed or
[8] Y. LeCun, “Deep Learning Hardware: Past, Present, and Future,” in 2019 IEEE International Solid- State Circuits Conference - (ISSCC), feb 2019, pp. 12–19. [9] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “AI Accelerator Survey and Trends,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC), sep 2021, pp. 1–9. [10] ——, “Survey of Machine Learning Accelerators,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC), 2020, pp. 1–12. [11] ——, “Survey and Benchmarking of Machine Learning Accelerators,” in 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019. Institute of Electrical and Electronics Engineers Inc., sep
[27] G. Roos, “FPGA Acceleration Card Delivers on Bandwidth, Speed, and Flexibility,” nov 2019. [Online]. Available: https://www.eetimes.com/ fpga-acceleration-card-delivers-on-bandwidth-speed-and-flexibility/ [28] “aiWare3 Hardware IP Helps Drive Autonomous Vehicles To Production,” oct 2018. [Online]. Available: https://aimotive.com/news/ content/ [29] R. Merritt, “Startup Accelerates AI at the Sensor,” feb 2019. [Online]. Available: https://www.eetimes.com/startup-accelerates-ai- at-the-sensor/ [30] T. Peng, “Alibaba’s New AI Chip Can Process Nearly 80K Images Per Second,” 2019. [Online]. Available: https://medium.com/syncedreview/alibabas-new-ai-chip- can-process-nearly-80k-images-per-second-63412dec22a [31] P. Clarke, “Indo-US Startup Preps Agent-based AI Processor,” aug
[50] O. Peckham, “Google Cloud’s New TPU v4 ML Hub Packs 9 Exaflops of AI,” may 2022. [Online]. Available: https://www.hpcwire.com/2022/ 05/16/google-clouds-new-tpu-v4-ml-hub-packs-9-exaflops-of-ai/ [51] L. Gwennap, “Groq Rocks Neural Networks,” Micropro- cessor Report, Tech. Rep., jan 2020. [Online]. Avail- able: http://groq.com/wp-content/uploads/2020/04/Groq-Rocks-NNs- Linley-Group-MPR-2020Jan06.pdf [52] D. Lacey, “Preliminary IPU Benchmarks,” oct 2017. [Online]. Available: https://www.graphcore.ai/posts/preliminary- ipu-benchmarks-providing-previously-unseen-performance-for-a- range-of-machine-learning-applications [53] “Dell DSS8440 Graphcore IPU Server,” Graphcore, Tech. Rep., feb 2020. [Online]. Available: https://www.graphcore.ai/hubfs/ Leadgenassets/DSS8440IPUServerWhitePaper 2020.pdf [54] S. Ward-Foxton, “Graphcore Takes on Nvidia with Second-Gen AI Accelerator,” jul 2020. [Online]. Available: https://www.eetimes.com/ graphcore-takes-on-nvidia-with-second-gen-ai-accelerator/ [55] M. Tyson, “Graphcore Bow IPU Introduces TSMC 3D Wafer-on-Wafer Processor,” mar 2022. [Online]. Available: https://www.tomshardware. com/news/graphcore-tsmc-bow-ipu-3d-wafer-on-wafer-processor [56] “GAP Application Processors,” 2020. [Online]. Available: https: //greenwaves-technologies.com/gap8 gap9/ [57] J. Turley, “GAP9 for ML at the Edge,” jun 2020. [Online]. Available: https://www.eejournal.com/article/gap9-for-ml-at-the-edge/ [58] N. Hemsoth, “Groq Shares Recipe for TSP Nodes, Systems,” sep
vol. 9, pp. 93 422–93 432, feb 2020. [Online]. Available: https: //arxiv.org/abs/2002.03260v [140] T. Lu, T. Marin, Y. Zhuo, Y. F. Chen, and C. Ma, “Accelerating MRI Reconstruction on TPUs,” 2020 IEEE High Performance Extreme Computing Conference (HPEC 2020), sep 2020. [Online]. Available: https://arxiv.org/abs/2006.14080v [141] F. Belletti, D. King, K. Yang, R. Nelet, Y. Shafi, Y.-F. Chen, and J. Anderson, “Tensor Processing Units for Financial Monte Carlo,” in Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, jun 2019, pp. 12–23. [Online]. Available: https://arxiv.org/abs/1906.02818v [142] K. Yang, Y. F. Chen, G. Roumpos, C. Colby, and J. Anderson, “High performance Monte Carlo simulation of ising model on TPU clusters,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society, nov
[157] J. Wilson, “The Multiply and Fourier Transform Unit: A Micro-Scale Optical Processor,” Optalysys, Tech. Rep., dec 2020. [Online]. Available: https://optalysys.com/s/Multiply and Fourier Transform white paper 12 12 20.pdf [158] D. Schneider, “A Neural-Net Based on Light Could Best Digital Computers,” jun 2019. [Online]. Available: https://spectrum.ieee.org/ a-neural-net-based-on-light-could-best-digital-computers [159] C. Q. Choi, “Photonic Chip Performs Image Recognition at the Speed of Light,” jun 2022. [Online]. Available: https://spectrum.ieee.org/ photonic-neural-network