Docsity
Docsity

Prepara tus exámenes
Prepara tus exámenes

Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity


Consigue puntos base para descargar
Consigue puntos base para descargar

Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium


Orientación Universidad
Orientación Universidad


Inteligencia artificial LLM ..., Apuntes de Inteligencia Artificial

IA , son papers para estudio de la asignatura

Tipo: Apuntes

2019/2020

Subido el 21/09/2023

alexander-s-2
alexander-s-2 🇪🇸

1 / 10

Toggle sidebar

Esta página no es visible en la vista previa

¡No te pierdas las partes importantes!

bg1
1
AI and ML Accelerator Survey and Trends
Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner
MIT Lincoln Laboratory Supercomputing Center
Lexington, MA, USA
{reuther,pmichaleas,michael.jones,vijayg,sid,kepner}@ll.mit.edu
Abstract—This paper updates the survey of AI accelerators
and processors from past three years. This paper collects and
summarizes the current commercial accelerators that have been
publicly announced with peak performance and power consump-
tion numbers. The performance and power values are plotted on
a scatter graph, and a number of dimensions and observations
from the trends on this plot are again discussed and analyzed.
Two new trends plots based on accelerator release dates are
included in this year’s paper, along with the additional trends
of some neuromorphic, photonic, and memristor-based inference
accelerators.
Index Terms—Machine learning, GPU, TPU, dataflow, accel-
erator, embedded inference, computational performance
I. INTRODUCTION
Just as last year, the pace of new announcements, releases,
and deployments of artificial intelligence (AI) and machine
learning (ML) accelerators from startups and established tech-
nology companies has been modest. This is not unreason-
able; for many companies that have released an accelerator
report having spent three or four years researching, analyzing,
designing, verifying, and validating their accelerator design
trade-offs and building the software stack to program the
accelerator. For those who have released subsequent versions
of their accelerator, they have reported shorter development
cycles, though it is still at least two or three years. The focus of
these accelerators continues to be on accelerating deep neural
network (DNN) models, and the application space spans from
very low power embedded voice recognition and image clas-
sification to data center scale training, while the competition
for defining markets and application areas continues as part
of a much larger industrial and technology shift in modern
computing to machine learning solutions.
AI ecosystems bring together components from embed-
ded computing (edge computing), traditional high perfor-
mance computing (HPC), and high performance data analy-
sis (HPDA) that must work together to effectively provide
capabilities for use by decision makers, warfighters, and
analysts [1]. Figure 1 captures an architectural overview of
such end-to-end AI solutions and their components. On the
left side of Figure 1, structured and unstructured data sources
provide different views of entities and/or phenomenology.
This material is based upon work supported by the Assistant Secretary
of Defense for Research and Engineering under Air Force Contract No.
FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations
expressed in this material are those of the author(s) and do not necessarily
reflect the views of the Assistant Secretary of Defense for Research and
Engineering.
Fig. 1: Canonical AI architecture consists of sensors, data con-
ditioning, algorithms, modern computing, robust AI, human-
machine teaming, and users (missions). Each step is critical
in developing end-to-end AI applications and systems.
These raw data products are fed into a data conditioning step
in which they are fused, aggregated, structured, accumulated,
and converted into information. The information generated by
the data conditioning step feeds into a host of supervised
and unsupervised algorithms such as neural networks, which
extract patterns, predict new events, fill in missing data, or
look for similarities across datasets, thereby converting the
input information to actionable knowledge. This actionable
knowledge is then passed to human beings for decision-
making processes in the human-machine teaming phase. The
phase of human-machine teaming provides the users with
useful and relevant insight turning knowledge into actionable
intelligence or insight.
Underpinning this system are modern computing systems.
Moore’s law trends have ended [2], as have a number of related
laws and trends including Denard’s scaling (power density),
clock frequency, core counts, instructions per clock cycle,
and instructions per Joule (Koomey’s law) [3]. Taking a page
from the system-on-chip (SoC) trends first seen in automotive
applications, robotics, and smartphones, advancements and
innovations are still progressing by developing and integrating
accelerators for often-used operational kernels, methods, or
functions. These accelerators are designed with a different
balance between performance and functional flexibility. This
includes an explosion of innovation in deep machine learning
processors and accelerators [4]–[8]. In this series of survey
papers, we explore the relative benefits of these technologies
since they are of particular importance to applying AI to
domains under significant constraints such as size, weight, and
arXiv:2210.04055v1 [cs.AR] 8 Oct 2022
pf3
pf4
pf5
pf8
pf9
pfa

Vista previa parcial del texto

¡Descarga Inteligencia artificial LLM ... y más Apuntes en PDF de Inteligencia Artificial solo en Docsity!

AI and ML Accelerator Survey and Trends

Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner

MIT Lincoln Laboratory Supercomputing Center

Lexington, MA, USA

{reuther,pmichaleas,michael.jones,vijayg,sid,kepner}@ll.mit.edu

Abstract—This paper updates the survey of AI accelerators

and processors from past three years. This paper collects and

summarizes the current commercial accelerators that have been

publicly announced with peak performance and power consump-

tion numbers. The performance and power values are plotted on

a scatter graph, and a number of dimensions and observations

from the trends on this plot are again discussed and analyzed.

Two new trends plots based on accelerator release dates are

included in this year’s paper, along with the additional trends

of some neuromorphic, photonic, and memristor-based inference

accelerators.

Index Terms—Machine learning, GPU, TPU, dataflow, accel-

erator, embedded inference, computational performance

I. INTRODUCTION

Just as last year, the pace of new announcements, releases, and deployments of artificial intelligence (AI) and machine learning (ML) accelerators from startups and established tech- nology companies has been modest. This is not unreason- able; for many companies that have released an accelerator report having spent three or four years researching, analyzing, designing, verifying, and validating their accelerator design trade-offs and building the software stack to program the accelerator. For those who have released subsequent versions of their accelerator, they have reported shorter development cycles, though it is still at least two or three years. The focus of these accelerators continues to be on accelerating deep neural network (DNN) models, and the application space spans from very low power embedded voice recognition and image clas- sification to data center scale training, while the competition for defining markets and application areas continues as part of a much larger industrial and technology shift in modern computing to machine learning solutions. AI ecosystems bring together components from embed- ded computing (edge computing), traditional high perfor- mance computing (HPC), and high performance data analy- sis (HPDA) that must work together to effectively provide capabilities for use by decision makers, warfighters, and analysts [1]. Figure 1 captures an architectural overview of such end-to-end AI solutions and their components. On the left side of Figure 1, structured and unstructured data sources provide different views of entities and/or phenomenology.

This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA8702-15-D-0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.

Fig. 1: Canonical AI architecture consists of sensors, data con- ditioning, algorithms, modern computing, robust AI, human- machine teaming, and users (missions). Each step is critical in developing end-to-end AI applications and systems.

These raw data products are fed into a data conditioning step in which they are fused, aggregated, structured, accumulated, and converted into information. The information generated by the data conditioning step feeds into a host of supervised and unsupervised algorithms such as neural networks, which extract patterns, predict new events, fill in missing data, or look for similarities across datasets, thereby converting the input information to actionable knowledge. This actionable knowledge is then passed to human beings for decision- making processes in the human-machine teaming phase. The phase of human-machine teaming provides the users with useful and relevant insight turning knowledge into actionable intelligence or insight.

Underpinning this system are modern computing systems. Moore’s law trends have ended [2], as have a number of related laws and trends including Denard’s scaling (power density), clock frequency, core counts, instructions per clock cycle, and instructions per Joule (Koomey’s law) [3]. Taking a page from the system-on-chip (SoC) trends first seen in automotive applications, robotics, and smartphones, advancements and innovations are still progressing by developing and integrating accelerators for often-used operational kernels, methods, or functions. These accelerators are designed with a different balance between performance and functional flexibility. This includes an explosion of innovation in deep machine learning processors and accelerators [4]–[8]. In this series of survey papers, we explore the relative benefits of these technologies since they are of particular importance to applying AI to domains under significant constraints such as size, weight, and

arXiv:2210.04055v1 [cs.AR] 8 Oct 2022

power, both in embedded applications and in data centers.

This paper is an update to IEEE-HPEC papers from the past

three years [9]–[11]. As in past years, this paper continues

with last year’s focus on accelerators and processors that are

geared toward deep neural networks (DNNs) and convolutional

neural networks (CNNs) as they are quite computationally in-

tense [12]. This survey focuses on accelerators and processors

for inference for a variety of reasons including that defense

and national security AI/ML edge applications rely heavily on

inference. And we will consider all of the numerical precision

types that an accelerator supports, but for most of them, their

best inference performance is in int8 or fp16/bf16 (IEEE 16-

bit floating point or Google’s 16-bit brain float).

There are many surveys [13]–[24] and other papers that

cover various aspects of AI accelerators. For instance, the first

paper in this multi-year survey included the peak performance

of FPGAs for certain AI models; however, several of the

aforementioned surveys cover FPGAs in depth so they are

no longer included in this survey. This multi-year survey

effort and this paper focus on gathering a comprehensive list

of AI accelerators with their computational capability, power

efficiency, and ultimately the computational effectiveness of

utilizing accelerators in embedded and data center applica-

tions. Along with this focus, this paper mainly compares

neural network accelerators that are useful for government

and industrial sensor and data processing applications. A few

accelerators and processors that were included in previous

years’ papers have been left out of this year’s survey. They

have been dropped because they have been surpassed by

new accelerators from the same company, they are no longer

offered, or they are no longer relevant to the topic.

II. SURVEY OF PROCESSORS

Many recent advances in AI can be at least partly cred-

ited to advances in computing hardware [6], [7], [25], [26],

enabling computationally heavy machine-learning algorithms

and in particular DNNs. This survey gathers performance and

power information from publicly available materials including

research papers, technical trade press, company benchmarks,

etc. While there are ways to access information from com-

panies and startups (including those in their silent period),

this information is intentionally left out of this survey; such

data will be included in this survey when it becomes publicly

available. The key metrics of this public data are plotted in

Figure 2, which graphs recent processor capabilities (as of July

  1. mapping peak performance vs. power consumption. The

dash-dotted box depicts the very dense region that is zoomed

in and plotted in Figure 3.

The x-axis indicates peak power, and the y-axis indicate

peak giga-operations per second (GOps/s), both on a loga-

rithmic scale. The computational precision of the processing

capability is depicted by the geometric shape used; the com-

putational precision spans from analog and single-bit int1 to

four-byte int32 and two-byte fp16 to eight-byte fp64. The

precisions that show two types denotes the precision of the

multiplication operations on the left and the precision of

the accumulate/addition operations on the right (for example,

fp16.32 corresponds to fp16 for multiplication and fp32 for accumulate/add). The form factor is depicted by color, which shows the package for which peak power is reported. Blue corresponds to a single chip; orange corresponds to a card; and green corresponds to entire systems (single node desktop and server systems). This survey is limited to single motherboard, single memory-space systems. Finally, the hollow geometric objects are peak performance for inference-only accelerators, while the solid geometric figures are performance for acceler- ators that are designed to perform both training and inference. The survey begins with the same scatter plot that we have compiled for the past three years. As we did last year, to save space, we have summarized some of the important metadata of the accelerators, cards, and systems in Table I, including the label used in Figure 2 for each of the points on the graph; many of the points were brought forward from last year’s plot, and some details of those entries are in [9]. There are several additions which we will cover below. In Table I, most of the columns and entries are self explana- tory. However, there are two Technology entries that may not be: dataflow and PIM. Dataflow processors are custom- designed processors for neural network inference and training. Since neural network training and inference computations can be entirely deterministically laid out, they are amenable to dataflow processing in which computations, memory accesses, and inter-ALU communications actions are explicitly/statically programmed or “placed-and-routed” onto the computational hardware. Processor in memory (PIM) accelerators integrate processing elements with memory technology. Among such PIM accelerators are those based on an analog computing technology that augments flash memory circuits with in-place analog multiply-add capabilities. Please refer to the references for the Mythic and Gyrfalcon accelerators for more details on this innovative technology. Finally, a reasonable categorization of accelerators follows their intended application, and the five categories are shown as ellipses on the graph, which roughly correspond to perfor- mance and power consumption: Very Low Power for speech processing, very small sensors, etc.; Embedded for cameras, small UAVs and robots, etc.; Autonomous for driver assist services, autonomous driving, and autonomous robots; Data Center Chips and Cards; and Data Center Systems. For most of the accelerators, their descriptions and commen- taries have not changed since last year so please refer to last two years’ papers for descriptions and commentaries. There are, however, several new releases that were not covered by past papers that are covered here.

  • Acelera, a Dutch embedded system startup, reported the results of an embedded test chip that they have produced [35]. They claim both digital and analog design capabilities, and this test chip was made to test the extent of the digital design capabilities. They expect to add analog (probably flash) design elements in upcoming efforts.
  • Maxim Integrated has released a system-on-chip (SoC) for ultra low power applications called the MAX78000 [74]–[76], which includes an ARM CPU core, a RISC-V CPU core and an AI accelerator. The

TABLE I: List of accelerator labels for plots.

Company Product Label Technology Form Factor References Achronix VectorPath S7t-VG6 Achronix dataflow Card [27] Aimotive aiWare3 Aimotive dataflow Chip [28] AIStorm AIStorm AIStorm dataflow Chip [29] Alibaba Alibaba Alibaba dataflow Card [30] AlphaIC RAP-E AlphaIC dataflow Chip [31] Amazon Inferentia AWS dataflow Card [32], [33] ARM Ethos N77 Ethos dataflow Chip [34] Axelera Axelera Test Core Axelera dataflow Chip [35] Baidu Baidu Kunlun 818-300 Baidu dataflow Chip [36]–[38] Bitmain BM1880 Bitmain dataflow Chip [39] Blaize El Cano Blaize dataflow Card [40] Canaan Kendrite K210 Kendryte CPU Chip [41] Cerebras CS-1 CS-1 dataflow System [42] Cerebras CS-2 CS-2 dataflow System [43] Cornami Cornami Cornami dataflow Chip [44] Enflame Cloudblazer T10 Enflame CPU Card [45] Google TPU Edge TPUedge dataflow System [46] Google TPU1 TPU1 dataflow Chip [47], [48] Google TPU2 TPU2 dataflow Chip [47], [48] Google TPU3 TPU3 dataflow Chip [47]–[49] Google TPU4i TPU4i dataflow Chip [49] Google TPU4 TPU4 dataflow Chip [50] GraphCore C2 GraphCore dataflow Card [51], [52] GraphCore C2 GraphCoreNode dataflow System [53] GraphCore Colossus Mk2 GraphCore2 dataflow Card [54] GraphCore Bow-2000 GraphCoreBow dataflow Card [55] GreenWaves GAP8 GAP8 dataflow Chip [56], [57] GreenWaves GAP9 GAP9 dataflow Chip [56], [57] Groq Groq Node GroqNode dataflow System [58] Groq Tensor Streaming Processor Groq dataflow Card [51], [59] Gyrfalcon Gyrfalcon Gyrfalcon PIM Chip [60] Gyrfalcon Gyrfalcon GyrfalconServer PIM System [61] Habana Gaudi Gaudi dataflow Card [62], [63] Habana Goya HL-1000 Goya dataflow Card [63], [64] Hailo Hailo Hailo-8 dataflow Chip [65] Horizon Robotics Journey2 Journey2 dataflow Chip [66] Huawei HiSilicon Ascend 310 Ascend-310 dataflow Chip [67] Huawei HiSilicon Ascend 910 Ascend-910 dataflow Chip [68] Intel Arria 10 1150 Arria FPGA Chip [69], [70] Intel Mobileye EyeQ5 EyeQ5 dataflow Chip [40] Kalray Coolidge Kalray manycore Chip [71], [72] Kneron KL720 KL720 dataflow Chip [73] Maxim Max 78000 Maxim dataflow Chip [74]–[76] Mythic M1076 Mythic76 PIM Chip [77]–[79] Mythic M1108 Mythic108 PIM Chip [77]–[79] NovuMind NovuTensor NovuMind dataflow Chip [80], [81] NVIDIA Ampere A10 A10 GPU Card [82] NVIDIA Ampere A100 A100 GPU Card [83] NVIDIA Ampere A30 A30 GPU Card [82] NVIDIA Ampere A40 A40 GPU Card [82] NVIDIA DGX Station DGX-Station GPU System [84] NVIDIA DGX-1 DGX-1 GPU System [84], [85] NVIDIA DGX-2 DGX-2 GPU System [85] NVIDIA DGX-A100 DGX-A100 GPU System [86] NVIDIA H100 H100 GPU Card [87] NVIDIA Jetson AGX Xavier XavierAGX GPU System [88] NVIDIA Jetson NX Orin OrinNX GPU System [89], [90] NVIDIA Jetson AGX Orin OrinAGX GPU System [89], [90] NVIDIA Jetson TX1 Jetson1 GPU System [91] NVIDIA Jetson TX2 Jetson2 GPU System [91] NVIDIA Jetson Xavier NX XavierNX GPU System [88] NVIDIA DRIVE AGX L2 AGX-L2 GPU System [92] NVIDIA DRIVE AGX L5 AGX-L5 GPU System [92] NVIDIA Pascal P100 P100 GPU Card [93], [94] NVIDIA T4 T4 GPU Card [95] NVIDIA Volta V100 V100 GPU Card [94], [96] Perceive Ergo Perceive dataflow Chip [97] Preferred Networks MN-3 Preferred-MN-3 multicore Card [98], [99] Quadric q1-64 Quadric dataflow Chip [100] Qualcomm Cloud AI 100 Qcomm dataflow Card [101], [102] Rockchip RK3399Pro RK3399Pro dataflow Chip [103] SiMa.ai SiMa.ai SiMa.ai dataflow Chip [104] Syntiant NDP101 Syntiant PIM Chip [105], [106] Tachyum Prodigy Tachyum CPU Chip [107] Tenstorrent Tenstorrent Tenstorrent multicore Card [108] Tesla Tesla Full Self-Driving Computer Tesla dataflow System [109], [110] Texas Instruments TDA4VM TexInst dataflow Chip [111]–[113] Toshiba 2015 Toshiba multicore System [114] Untether TsunAImi TsunAImi PIM Card [115]

with a second wafer that greatly improves power and clock distribution throughout the CG200 chip [55]. This translates into 40% better performance and 16% better performance-per-Watt.

  • Almost a year after Google announced details of their fourth generation inference-only TPU4i accelerator in June 2021 [49], Google shared details about their fourth generation training accelerator, TPUv4. Very few details were announced, but they did share peak power and per- formance numbers [50]. As with previous TPU variants, TPU4 is available through the Google Compute Cloud and for internal operations.

Next, we must mention accelerators that do not appear on Figure 2 yet. Each has been released with some benchmark results but either no peak performance numbers or no peak power numbers.

  • After last year releasing some impressive benchmark results for their reconfigurable AI accelerator technol- ogy [119] and this year publishing two deeper technol- ogy reveals [120], [121] and an applications paper with Argonne National Laboratory [122], SambaNova still has not provided any details from which we can estimate peak performance or power consumption of their solutions.
  • In May 2022, Intel’s Habana Labs announced the sec- ond generations of the Goya inference accelerator and Gaudi training accelerator, named Greco and Gaudi2, respectively [123], [124]. Both promised multiple times better performance than their predecessor. Greco will be a single-width PCIe card drawing 75W, while the Gaudi will continue to be a double-width PCIe card drawing 650W (likely on a PCIe 5.0 slot). Habana released some benchmarking comparisons to Nvidia A100 GPUs for the Gaudi2, but peak performance numbers were not disclosed for either of these accelerators.
  • Esperanto has produced a few demo chips for evaluation by Samsung and other partners [125]. The chip is reported to be a 1,000-core RISC-V processor with each core having an AI tensor accelerator. Esperanto has published a few relative performance metrics [126], but they have not disclosed any peak power or peak performance values.
  • During the Tesla AI Day event, Telsa gave some details of their custom-built Dojo accelerator and system. They did provide peak performance of 22.6 TF FP32 performance per chip, but they did not report peak power draw per chip. Perhaps these details will come later [127].

Finally there is one departure to the report this year. Last year, Centaur Technology announced a x86 CPU with an integrated AI accelerator, which was realized as a 4,096 byte- wide SIMD unit. The performance estimates were competitive, but VIA Technologies, the parent company of Centaur, sold off the USA-based engineering team of the processor to Intel, Corp. and seems to have ended the development of the CNS processor [128].

III. OBSERVATIONS AND TRENDS

There are several observations comments for us to appreci- ate on Figure 2.

  • Int8 continues to be the default numerical precision for embedded, autonomous and data center inference appli- cations. This precision is adequate for most AI/ML ap- plications with a reasonable number of classes. However, some accelerators also use fp16 and/or bf16 for inference. For training, has become integer representations
  • Among the very low power chips, what is not captured is the other features beyond the machine learning accelera- tor on the chip. It is very common in this category and the Embedded category to release system-on-chip (SoC) solutions, which often include low-power CPU cores, audio and video analog-to-digital converters (ADCs),

encryption engines, network interfaces, etc. These ad- ditional features of the SoCs do not change the peak performance metric, but they do have a direct impact on the peak power reported for the chip, so please keep this in mind when comparing them.

  • Not much has changed in the Embedded segment, which probably means that the computational performance and peak power is adequate for the types of applications in this area.
  • The density has become very crowded in the Autonomous and Data Center Chips and Cards segments, which required the zoomed in Figure 3. Over the past few years, several established embedded computing micro- electronics companies including Texas Instruments have released AI accelerators, while NVIDIA has released and announced several more powerful automotive and robotics application systems as mentioned above. Among the Data Center Cards, the PCIe v5 specification is highly anticipated so as to break through the 300W power limit of PCIe v4.
  • Finally, the high-end training systems are not only posting very impressive performance numbers, but those com- panies have also been announcing highly scalable inter- networking technologies to network thousands of cards together. This is particularly important for dataflow accel- erators like Cerebras, GraphCore, Groq, Tesla Dojo, and SambaNova, which are explicitly/statically programmed or “placed-and-routed” onto the computational hardware. It enables these accelerators to accommodate extremely large models like transformers [129].

A. Broader Trends

We also collected release dates, fabrication technology, and

peak performance for multiple precisions for a smaller subset

of accelerators listed in Table I. We were curious about the

trends of peak performance over the past ten years and how

numerical precision and fabrication technology influenced it.

These data are plotted in Figure 4. Figure 4a plots the release

date of a number of accelerators versus their peak performance

for one or more precision formats. There are marked gains in

peak performance for each of the precision formats, but within

each format the maximum gain is 1.5 orders of magnitude

over the 10-year period. In Figure 4b, we plot the release

date versus the fabrication technology used for the accelerator.

The default precision for the peak performance values is int8;

however, there are a number of accelerators (e.g., NVIDIA

K20, K80 and AMD Mi8) which did not have int8 support.

For these accelerators, the peak performance is reported for

the lowest precision that the accelerator supported. This plot

shows that much performance has been gained over the past

ten years by supporting lower precision formats; it is partic-

ularly interesting to observe how support for lower precision

formats was included in these accelerators as research and

industry explore the effectiveness of lower floating point and

integer formats in CNN/DNN inference and training.

We have several more observations and trends that are not

yet captured in graphs. First, the exploration for the best

LLSC Overview - 3

MIT LINCOLN LABORATORY

S U P E R C O M P U T I N G C E N T E R

MIT LINCOL S U P E R C O M P

2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Release Date

100

101

102

103

Peak Performance (TOps/sec - Log)^ AMD-MI

AMD-MI

AMD-MI

AMD-MI100AMD-MI100AMD-MI

AMD-MI

AMD-MI210AMD-MI210AMD-MI

AMD-MI

AMD-MI250AMD-MI250AMD-MI

AMD-MI

Baidu-Kunlun

Baidu-Kunlun

Baidu-Kunlun

Baidu-Kunlun

TPU

TPU

TPU3 TPU4i

GraphCore1 GraphCore

Groq-TSP

Habana-Goya Intel-Xe-HPC

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-A

NV-P100 NV-A

NV-T

NV-T

NV-T

NV-V

NV-K80^ NV-V

NV-K

Qcomm

Qcomm

Tenstorrent

Neural Network Peak Performance

Past Decade – Precision Comparison

Computation Precision int int BFloat fp

From: Albert Reuther, MIT (a) Peak performance for various precisions vs. release date. LL Supercomputing Center

LLSC Overview - 4

MIT LINCOLN LABORATORY

S U P E R C O M P U T I N G C E N T E R

MIT LINCO S U P E R C O M

2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Release Date

100

101

102

103

Peak Performance (TOps/sec - Log)^ AMD-MI

AMD-MI

AMD-MI100AMD-MI

Baidu-Kunlun^ AMD-MI

Baidu-Kunlun

TPU

TPU

TPU3 TPU4i

GraphCore1 GraphCore

Groq-TSP

Habana-Goya

Habana-Gaudi^ Intel-Xe-HPC

NV-A

NV-A

NV-A

NV-P

NV-V100^ NV-T

NV-K

NV-K

Qcomm (^) Tenstorrent

Neural Network Peak Performance

Past Decade – Fab Technology Comparison

From: Albert Reuther, MIT LL Supercomputing Center

Precision int int BFloat fp fp

Fab Tech GF- 12 GF- 28 Samsung- 7 Samsung- 14 TSMC- 6 TSMC- 7 TSMC- 8 TSMC- 12 TSMC- 16 TSMC- 28

(b) Peak performance and fabrication technology vs. release date.

Fig. 4: Trends with respect to release date for subset of publicly announced AI accelerators and processors.

numerical formats for inference and training continue. For inference, some discussion continues whether int4 will be acceptable for embedded inference, and the Maxim MAX 78000 SoC solution supports 1-bit, 2-bit, 4-bit, and 8-bit integer weights [75]. On the training side, it has been an- nounced that NIVIDA Hopper, Intel Gaudi2 and a future GraphCore accelerator will support the lower precision FP numerical format [130]. GraphCore posted an analysis paper on FP8 [131], including trade-off analyses of scaled integer versus floating point representations, different 8-bit floating point representations, and mixed representation DNN model performance.

Another trend that has caught our attention is that math- ematical kernels other than DNN/CNN models have been implemented on several dataflow accelerators. These dataflow accelerators generally handle each data item independently (i.e., there are no cache lines), and data movement and com- putational operations are explicitly/statically programmed or

[8] Y. LeCun, “Deep Learning Hardware: Past, Present, and Future,” in 2019 IEEE International Solid- State Circuits Conference - (ISSCC), feb 2019, pp. 12–19. [9] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “AI Accelerator Survey and Trends,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC), sep 2021, pp. 1–9. [10] ——, “Survey of Machine Learning Accelerators,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC), 2020, pp. 1–12. [11] ——, “Survey and Benchmarking of Machine Learning Accelerators,” in 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019. Institute of Electrical and Electronics Engineers Inc., sep

  1. [Online]. Available: https://doi.org/10.1109/HPEC.2019. [12] A. Canziani, A. Paszke, and E. Culurciello, “An Analysis of Deep Neural Network Models for Practical Applications,” arXiv preprint arXiv:1605.07678, 2016. [Online]. Available: http://arxiv.org/abs/1605. 07678 [13] C. S. Lindsey and T. Lindblad, “Survey of Neural Network Hardware,” in SPIE 2492, Applications and Science of Artificial Neural Networks, S. K. Rogers and D. W. Ruck, Eds., vol.
  2. International Society for Optics and Photonics, apr 1995, pp. 1194–1205. [Online]. Available: http://proceedings.spiedigitallibrary. org/proceeding.aspx?articleid= [14] Y. Liao, “Neural Networks in Hardware: A Survey,” Department of Computer Science, University of California, Tech. Rep., 2001. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.460. [15] J. Misra and I. Saha, “Artificial Neural Networks in Hardware: A Survey of Two Decades of Progress,” Neurocomputing, vol. 74, no. 1-3, pp. 239–255, dec 2010. [Online]. Available: https: //doi.org/10.1016/j.neucom.2010.03. [16] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, dec 2017. [Online]. Available: https://doi.org/10.1109/JPROC.2017. [17] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, Efficient Processing of Deep Neural Networks. Morgan and Claypool Publishers, 2020. [Online]. Available: https: //doi.org/10.2200/S01004ED1V01Y202004CAC [18] H. F. Langroudi, T. Pandit, M. Indovina, and D. Kudithipudi, “Digital Neuromorphic Chips for Deep Learning Inference: A Comprehensive Study,” in Applications of Machine Learning, M. E. Zelinski, T. M. Taha, J. Howe, A. A. Awwal, and K. M. Iftekharuddin, Eds. SPIE, sep 2019, p. 9. [Online]. Available: https://doi.org/10.1117/12. [19] Y. Chen, Y. Xie, L. Song, F. Chen, and T. Tang, “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, mar 2020. [Online]. Available: https://doi.org/10.1016/j.eng.2020.01. [20] E. Wang, J. J. Davis, R. Zhao, H.-C. C. Ng, X. Niu, W. Luk, P. Y. K. Cheung, and G. A. Constantinides, “Deep Neural Network Approximation for Custom Hardware,” ACM Computing Surveys, vol. 52, no. 2, pp. 1–39, may 2019. [Online]. Available: https://dl.acm.org/doi/10.1145/ [21] S. Khan and A. Mann, “AI Chips: What They Are and Why They Matter,” Georgetown Center for Security and Emerging Technology, Tech. Rep., apr 2020. [Online]. Available: https://cset.georgetown.edu/ research/ai-chips-what-they-are-and-why-they-matter/ [22] U. Rueckert, “Digital Neural Network Accelerators,” in NANO-CHIPS 2030: On-Chip AI for an Efficient Data-Driven World, B. Murmann and B. Hoefflinger, Eds. Springer, Cham, 2020, ch. 12, pp. 181–202. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978- 3-030-18338-7 12 [23] T. Rogers and M. Khairy, “An Academic’s Attempt to Clear the Fog of the Machine Learning Accelerator War — SIGARCH,” aug 2021. [Online]. Available: https://www.sigarch.org/an-academics-attempt-to- clear-the-fog-of-the-machine-learning-accelerator-war/ [24] F. P. Sunny, E. Taheri, M. Nikdast, and S. Pasricha, “A Survey on Silicon Photonics for Deep Learning,” ACM Journal on Emerging Technologies in Computing Systems, vol. 17, no. 4, oct 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/ [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica- tion with Deep Convolutional Neural Networks,” Neural Information Processing Systems, vol. 25, 2012. [26] N. P. Jouppi, C. Young, N. Patil, and D. Patterson, “A Domain-Specific Architecture for Deep Neural Networks,” Communications of the ACM, vol. 61, no. 9, pp. 50–59, aug 2018. [Online]. Available: http://doi.acm.org/10.1145/

[27] G. Roos, “FPGA Acceleration Card Delivers on Bandwidth, Speed, and Flexibility,” nov 2019. [Online]. Available: https://www.eetimes.com/ fpga-acceleration-card-delivers-on-bandwidth-speed-and-flexibility/ [28] “aiWare3 Hardware IP Helps Drive Autonomous Vehicles To Production,” oct 2018. [Online]. Available: https://aimotive.com/news/ content/ [29] R. Merritt, “Startup Accelerates AI at the Sensor,” feb 2019. [Online]. Available: https://www.eetimes.com/startup-accelerates-ai- at-the-sensor/ [30] T. Peng, “Alibaba’s New AI Chip Can Process Nearly 80K Images Per Second,” 2019. [Online]. Available: https://medium.com/syncedreview/alibabas-new-ai-chip- can-process-nearly-80k-images-per-second-63412dec22a [31] P. Clarke, “Indo-US Startup Preps Agent-based AI Processor,” aug

  1. [Online]. Available: https://www.eenewsanalog.com/news/indo- us-startup-preps-agent-based-ai-processor/page/0/ [32] J. Hamilton, “AWS Inferentia Machine Learning Processor,” nov
  2. [Online]. Available: https://perspectives.mvdirona.com/2018/11/ aws-inferentia-machine-learning-processor/ [33] C. Evangelist, “Deep dive into Amazon Inferentia: A Custom- Built Chip to Enhance ML and AI,” jan 2020. [On- line]. Available: https://www.cloudmanagementinsider.com/amazon- inferentia-for-machine-learning-and-artificial-intelligence/ [34] D. Schor, “Arm Ethos is for Ubiquitous AI At the Edge,” feb 2020. [Online]. Available: https://fuse.wikichip.org/news/3282/arm-ethos-is- for-ubiquitous-ai-at-the-edge/ [35] S. Ward-Foxton, “Axelera Demos AI Test Chip After Taping Out in Four Months,” may 2022. [Online]. Available: https://www.eetimes. com/axelera-demos-ai-test-chip-after-taping-out-in-four-months/ [36] J. Ouyang, X. Du, Y. Ma, and J. Liu, “Kunlun: A 14nm High- Performance AI Processor for Diversified Workloads,” in 2021 IEEE International Solid- State Circuits Conference (ISSCC), vol. 64, feb 2021, pp. 50–51. [37] R. Merritt, “Baidu Accelerator Rises in AI,” jul 2018. [Online]. Available: https://www.eetimes.com/baidu-accelerator-rises-in-ai/ [38] C. Duckett, “Baidu Creates Kunlun Silicon for AI,” jul 2018. [Online]. Available: https://www.zdnet.com/article/baidu-creates-kunlun-silicon- for-ai/ [39] B. Wheeler, “Bitmain SoC Brings AI to the Edge,” feb 2019. [Online]. Available: https://www.linleygroup.com/newsletters/newsletter detail. php%3Fnum=5975%26year=2019%26tag= [40] M. Demler, “Blaize Ignites Edge-AI Performance,” The Linley Group, Tech. Rep., sep 2020. [Online]. Available: https://www.blaize.com/ wp-content/uploads/2020/09/Blaize-Ignites-Edge-AI-Performance.pdf [41] L. Gwennap, “Kendryte Embeds AI for Surveillance,” mar
  3. [Online]. Available: https://www.linleygroup.com/newsletters/ newsletter detail.php?num= [42] A. Hock, “Introducing the Cerebras CS-1, the Industry’s Fastest Artificial Intelligence Computer,” nov 2019. [Online]. Available: https://www.cerebras.net/introducing-the-cerebras-cs-1-the- industrys-fastest-artificial-intelligence-computer/ [43] T. Trader, “Cerebras Doubles AI Performance with Second- Gen 7nm Wafer Scale Engine,” apr 2021. [Online]. Available: https://www.hpcwire.com/2021/04/20/cerebras-doubles-ai- performance-with-second-gen-7nm-wafer-scale-engine/ [44] “Cornami Achieves Unprecedented Performance at Lowest Power Dissipation for Deep Neural Networks,” oct 2019. [Online]. Available: https://cornami.com/1416-2/ [45] P. Clarke, “GlobalFoundries Aids Launch of Chinese AI Startup,” dec 2019. [Online]. Available: https://www.eenewsanalog.com/news/ globalfoundries-aids-launch-chinese-ai-startup [46] “Edge TPU,” 2019. [Online]. Available: https://cloud.google.com/ edge-tpu/ [47] N. P. Jouppi, D. H. Yoon, G. Kurian, S. Li, N. Patil, J. Laudon, C. Young, and D. Patterson, “A Domain-Specific Supercomputer for Training Deep Neural Networks,” Commun. ACM, vol. 63, no. 7, pp. 67–78, jun 2020. [Online]. Available: https://doi.org/10.1145/ [48] P. Teich, “Tearing Apart Google’s TPU 3.0 AI Coprocessor,” may
  4. [Online]. Available: https://www.nextplatform.com/2018/05/10/ tearing-apart-googles-tpu-3-0-ai-coprocessor/ [49] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, T. Norrie, N. Patil, S. Prasad, C. Young, Z. Zhou, D. Patterson, and G. Llc, “Ten Lessons From Three Generations Shaped Google’s TPUv4i,” in Proc. of 2021 ACM/IEEE 48th Annual International Symposium on Computer Archi- tecture (ISCA). IEEE Computer Society, jun 2021, pp. 1–14.

[50] O. Peckham, “Google Cloud’s New TPU v4 ML Hub Packs 9 Exaflops of AI,” may 2022. [Online]. Available: https://www.hpcwire.com/2022/ 05/16/google-clouds-new-tpu-v4-ml-hub-packs-9-exaflops-of-ai/ [51] L. Gwennap, “Groq Rocks Neural Networks,” Micropro- cessor Report, Tech. Rep., jan 2020. [Online]. Avail- able: http://groq.com/wp-content/uploads/2020/04/Groq-Rocks-NNs- Linley-Group-MPR-2020Jan06.pdf [52] D. Lacey, “Preliminary IPU Benchmarks,” oct 2017. [Online]. Available: https://www.graphcore.ai/posts/preliminary- ipu-benchmarks-providing-previously-unseen-performance-for-a- range-of-machine-learning-applications [53] “Dell DSS8440 Graphcore IPU Server,” Graphcore, Tech. Rep., feb 2020. [Online]. Available: https://www.graphcore.ai/hubfs/ Leadgenassets/DSS8440IPUServerWhitePaper 2020.pdf [54] S. Ward-Foxton, “Graphcore Takes on Nvidia with Second-Gen AI Accelerator,” jul 2020. [Online]. Available: https://www.eetimes.com/ graphcore-takes-on-nvidia-with-second-gen-ai-accelerator/ [55] M. Tyson, “Graphcore Bow IPU Introduces TSMC 3D Wafer-on-Wafer Processor,” mar 2022. [Online]. Available: https://www.tomshardware. com/news/graphcore-tsmc-bow-ipu-3d-wafer-on-wafer-processor [56] “GAP Application Processors,” 2020. [Online]. Available: https: //greenwaves-technologies.com/gap8 gap9/ [57] J. Turley, “GAP9 for ML at the Edge,” jun 2020. [Online]. Available: https://www.eejournal.com/article/gap9-for-ml-at-the-edge/ [58] N. Hemsoth, “Groq Shares Recipe for TSP Nodes, Systems,” sep

  1. [Online]. Available: https://www.nextplatform.com/2020/09/29/ groq-shares-recipe-for-tsp-nodes-systems/ [59] D. Abts, J. Ross, J. Sparling, M. Wong-VanHaren, M. Baker, T. Hawkins, A. Bell, J. Thompson, T. Kahsai, G. Kimmell, J. Hwang, R. Leslie-Hurd, M. Bye, E. R. Creswick, M. Boyd, M. Venigalla, E. Laforge, J. Purdy, P. Kamath, D. Maheshwari, M. Beidler, G. Rosseel, O. Ahmad, G. Gagarin, R. Czekalski, A. Rane, S. Parmar, J. Werner, J. Sproch, A. Macias, and B. Kurtz, “Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), may 2020, pp. 145–158. [Online]. Available: https://doi.org/10.1109/ISCA45697.2020. [60] S. Ward-Foxton, “Gyrfalcon Unveils Fourth AI Accelerator Chip — EE Times,” nov 2019. [Online]. Available: https://www.eetimes.com/ gyrfalcon-unveils-fourth-ai-accelerator-chip/ [61] “SolidRun, Gyrfalcon Develop Arm-based Edge Op- timized AI Inference Server,” feb 2020. [Online]. Available: https://www.hpcwire.com/off-the-wire/solidrun-gyrfalcon- develop-edge-optimized-ai-inference-server/ [62] L. Gwennap, “Habana Offers Gaudi for AI Training,” Microprocessor Report, Tech. Rep., jun 2019. [Online]. Available: https://habana.ai/wp- content/uploads/2019/06/Habana-Offers-Gaudi-for-AI-Training.pdf [63] E. Medina and E. Dagan, “Habana Labs Purpose-Built AI Inference and Training Processor Architectures: Scaling AI Training Systems Using Standard Ethernet With Gaudi Processor,” IEEE Micro, vol. 40, no. 2, pp. 17–24, mar 2020. [Online]. Available: https://doi.org/10.1109/MM.2020. [64] L. Gwennap, “Habana Wins Cigar for AI Inference,” feb 2019. [Online]. Available: https://www.linleygroup.com/mpr/article.php?id= 12103 [65] S. Ward-Foxton, “Details of Hailo AI Edge Accelerator Emerge,” aug
  2. [Online]. Available: https://www.eetimes.com/details-of-hailo- ai-edge-accelerator-emerge/ [66] “Horizon Robotics Journey2 Automotive AI Processor Series,” 2020. [Online]. Available: https://en.horizon.ai/product/journey [67] Huawei, “Ascend 310 AI Processor,” 2020. [Online]. Available: https: //e.huawei.com/us/products/cloud-computing-dc/atlas/ascend- [68] ——, “Ascend 910 AI Processor,” 2020. [Online]. Available: https: //e.huawei.com/us/products/cloud-computing-dc/atlas/ascend- [69] M. S. Abdelfattah, D. Han, A. Bitar, R. DiCecco, S. O’Connell, N. Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, and G. R. Chiu, “DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration,” in 2018 28th International Conference on Field Programmable Logic and Applications (FPL), aug 2018, pp. 411–
  3. [Online]. Available: https://doi.org/10.1109/FPL.2018. [70] N. Hemsoth, “Intel FPGA Architecture Focuses on Deep Learning Inference,” jul 2018. [On- line]. Available: https://www.nextplatform.com/2018/07/31/intel-fpga- architecture-focuses-on-deep-learning-inference/ [71] B. Dupont de Dinechin, “Kalray’s MPPA® Manycore Processor: At the Heart of Intelligent Systems,” in 17th IEEE International New Circuits and Systems Conference (NEWCAS). Munich: IEEE, jun
  4. [Online]. Available: https://www.european-processor-initiative. eu/dissemination-material/1259/ [72] P. Clarke, “NXP, Kalray Demo Coolidge Parallel Processor in ’BlueBox’,” jan 2020. [Online]. Available: https://www.eenewsanalog. com/news/nxp-kalray-demo-coolidge-parallel-processor-bluebox [73] S. Ward-Foxton, “Kneron Attracts Strategic Investors,” jan
  5. [Online]. Available: https://www.eetimes.com/kneron-attracts- strategic-investors/ [74] ——, “Maxim Debuts Homegrown AI Accelerator in Latest ULP SoC,” nov 2020. [Online]. Available: https://www.eetimes.com/maxim- debuts-homegrown-ai-accelerator-in-latest-ulp-soc/ [75] A. Jani, “Maxim Showcases Efficient Custom AI,” feb 2021. [Online]. Available: https://www.linleygroup.com/newsletters/newsletter detail. php?num=6274&year=2021&tag= [76] M. Clay, C. Grecos, M. Shirvaikar, and B. Richey, “Benchmarking the MAX78000 Artificial Intelligence Microcontroller for Deep Learning Applications,” in Real-Time Image Processing and Deep Learning 2022, N. Kehtarnavaz and M. F. Carlsohn, Eds., vol. 12102, International Society for Optics and Photonics. SPIE, 2022, pp. 47–52. [Online]. Available: https://doi.org/10.1117/12. [77] S. Ward-Foxton, “Mythic Resizes its AI Chip,” jun 2021. [Online]. Available: https://www.eetimes.com/mythic-resizes-its-analog-ai-chip/ [78] N. Hemsoth, “A Mythic Approach to Deep Learning Inference,” aug
  6. [Online]. Available: https://www.nextplatform.com/2018/08/23/ a-mythic-approach-to-deep-learning-inference/ [79] D. Fick, “Mythic @ Hot Chips 2018,” aug 2018. [Online]. Available: https://medium.com/mythic-ai/mythic-hot-chips-2018-637dfb9e38b [80] K. Freund, “NovuMind: An Early Entrant in AI Silicon,” Moor Insights & Strategy, Tech. Rep., may 2019. [Online]. Available: https: //moorinsightsstrategy.com/wp-content/uploads/2019/05/NovuMind- An-Early-Entrant-in-AI-Silicon-By-Moor-Insights-And-Strategy.pdf [81] J. Yoshida, “NovuMind’s AI Chip Sparks Controversy,” oct
  7. [Online]. Available: https://www.eetimes.com/novuminds-ai- chip-sparks-controversy/ [82] T. P. Morgan, “Nvidia Rounds Out ”Ampere” Lineup With Two New Accelerators,” apr 2021. [Online]. Available: https://www.nextplatform.com/2021/04/15/nvidia-rounds- out-ampere-lineup-with-two-new-accelerators/ [83] R. Krashinsky, O. Giroux, S. Jones, N. Stam, and S. Ramaswamy, “NVIDIA Ampere Architecture In-Depth,” may 2020. [Online]. Available: https://devblogs.nvidia.com/nvidia-ampere-architecture-in- depth/ [84] P. Alcorn, “Nvidia Infuses DGX-1 with Volta, Eight V100s in a Single Chassis,” may 2017. [Online]. Available: https://www.tomshardware. com/news/nvidia-volta-v100-dgx-1-hgx-1,34380.html [85] I. Cutress, “NVIDIA’s DGX-2: Sixteen Tesla V100s, 30TB of NVMe, Only $400K,” mar 2018. [Online]. Avail- able: https://www.anandtech.com/show/12587/nvidias-dgx2-sixteen- v100-gpus-30-tb-of-nvme-only-400k [86] C. Campa, C. Kawalek, H. Vo, and J. Bessoudo, “Defining AI Innovation with NVIDIA DGX A100,” may 2020. [Online]. Available: https://devblogs.nvidia.com/defining-ai-innovation-with-dgx-a100/ [87] R. Smith, “NVIDIA Hopper GPU Architecture and H Accelerator Announced: Working Smarter and Harder,” mar 2022. [Online]. Available: https://www.anandtech.com/show/17327/nvidia- hopper-gpu-architecture-and-h100-accelerator-announced [88] ——, “NVIDIA Gives Jetson AGX Xavier a Trim, Announces Nano-Sized Jetson Xavier NX,” nov 2019. [Online]. Available: https://www.anandtech.com/show/15070/nvidia- gives-jetson-xavier-a-trim-announces-nanosized-jetson-xavier-nx [89] B. Funk, “NVIDIA Jetson AGX Orin: The Next-Gen Platform That Will Power Our AI Robot Overlords Unveiled,” mar 2022. [Online]. Available: https://hothardware.com/news/nvidia-jetson-agx-orin [90] “Jetson AGX Orin for Next-Gen Robotics,” 2022. [Online]. Avail- able: https://www.nvidia.com/en-us/autonomous-machines/embedded- systems/jetson-orin/ [91] D. Franklin, “NVIDIA Jetson TX2 Delivers Twice the Intelligence to the Edge,” mar 2017. [Online]. Available: https://developer.nvidia. com/blog/jetson-tx2-delivers-twice-intelligence-edge/ [92] B. Hill, “NVIDIA Unveils Ampere-Infused DRIVE AGX For Autonomous Cars, Isaac Robotics Platform With BMW Partnership,” may 2022. [Online]. Available: https://hothardware.com/news/nvidia- drive-agx-pegasus-orin-ampere-next-gen-autonomous-cars [93] “NVIDIA Tesla P100.” [Online]. Available: https://www.nvidia.com/ en-us/data-center/tesla-p100/

vol. 9, pp. 93 422–93 432, feb 2020. [Online]. Available: https: //arxiv.org/abs/2002.03260v [140] T. Lu, T. Marin, Y. Zhuo, Y. F. Chen, and C. Ma, “Accelerating MRI Reconstruction on TPUs,” 2020 IEEE High Performance Extreme Computing Conference (HPEC 2020), sep 2020. [Online]. Available: https://arxiv.org/abs/2006.14080v [141] F. Belletti, D. King, K. Yang, R. Nelet, Y. Shafi, Y.-F. Chen, and J. Anderson, “Tensor Processing Units for Financial Monte Carlo,” in Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing. Society for Industrial and Applied Mathematics, jun 2019, pp. 12–23. [Online]. Available: https://arxiv.org/abs/1906.02818v [142] K. Yang, Y. F. Chen, G. Roumpos, C. Colby, and J. Anderson, “High performance Monte Carlo simulation of ising model on TPU clusters,” in International Conference for High Performance Computing, Networking, Storage and Analysis, SC. IEEE Computer Society, nov

  1. [Online]. Available: https://arxiv.org/abs/1903.11714v [143] C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank, “A Survey of Neuromorphic Computing and Neural Networks in Hardware,” arXiv preprint arXiv:1705.06963, may 2017. [Online]. Available: http://arxiv.org/abs/1705. [144] C. D. James, J. B. Aimone, N. E. Miner, C. M. Vineyard, F. H. Rothganger, K. D. Carlson, S. A. Mulder, T. J. Draelos, A. Faust, M. J. Marinella, J. H. Naegle, and S. J. Plimpton, “A Historical Survey of Algorithms and Hardware Architectures for Neural-inspired and Neuromorphic Computing Applications,” Biologically Inspired Cognitive Architectures, vol. 19, pp. 49–64, jan 2017. [Online]. Available: https://www.sciencedirect.com/science/ article/abs/pii/S2212683X [145] R. F. Service, “Microchips That Mimic the Human Brain Could Make AI Far More Energy Efficient,” may 2022. [Online]. Available: https://www.science.org/content/article/microchips-mimic- human-brain-could-make-ai-far-more-energy-efficient [146] G. Orchard, E. P. Frady, D. B. D. Rubin, S. Sanborn, S. B. Shrestha, F. T. Sommer, and M. Davies, “Efficient Neuromorphic Signal Pro- cessing with Loihi 2,” in 2021 IEEE Workshop on Signal Processing Systems (SiPS), oct 2021, pp. 254–259. [147] M. Davies, A. Wild, G. Orchard, Y. Sandamirskaya, G. A. F. Guerra, P. Joshi, P. Plank, and S. R. Risbud, “Advancing Neuromorphic Computing With Loihi: A Survey of Results and Outlook,” Proceedings of the IEEE, vol. 109, no. 5, pp. 911–934, may 2021. [148] M. Barnell, C. Raymond, M. Wilson, D. Isereau, and C. Cicotta, “Target Classification in Synthetic Aperture Radar and Optical Imagery Using Loihi Neuromorphic Hardware,” in 2020 IEEE High Perfor- mance Extreme Computing Conference (HPEC), 2020, pp. 1–6. [149] A. Viale, A. Marchisio, M. Martina, G. Masera, and M. Shafique, “CarSNN: An Efficient Spiking Neural Network for Event-Based Autonomous Cars on the Loihi Neuromorphic Research Processor,” in 2021 International Joint Conference on Neural Networks (IJCNN), jul 2021, pp. 1–10. [150] S. Ward-Foxton, “Innatera Unveils Neuromorphic AI Chip to Accelerate Spiking Networks,” jul 2021. [Online]. Available: https://www.eetimes.com/innatera-unveils-neuromorphic-ai- chip-to-accelerate-spiking-networks/ [151] M. Levy, “Innatera’s Spiking Neural Processor,” apr 2021. [Online]. Available: https://www.linleygroup.com/newsletters/newsletter detail. php?num=6302&year=2021&tag= [152] V. Ostrovskii, P. Fedoseev, Y. Bobrova, and D. Butusov, “Structural and Parametric Identification of Knowm Memristors,” Nanomaterials, vol. 12, no. 1, jan 2022. [Online]. Avail- able: /pmc/articles/PMC8746671//pmc/articles/PMC8746671/?report= abstracthttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8746671/ [153] S. Ward-Foxton, “Optical Compute Promises Game- Changing AI Performance,” aug 2020. [Online]. Available: https://www.eetimes.com/optical-compute-promises-game-changing- ai-performance/?utm source=eetimes&utm medium=networksearch [154] ——, “Optical Chip Solves Hardest Math Problems Faster than GPUs,” dec 2021. [Online]. Available: https://www.eetimes.com/optical- computing-chip-runs-hardest-math-problems-100x-faster-than-gpus/ [155] J. Launay, I. Poli, K. M¨uller, I. Carron, L. Daudet, F. Krzakala, and S. Gigan, “Light-in-the-Loop: Using a Photonics Co-Processor for Scalable Training of Neural Networks,” arXiv preprint, jun 2020. [Online]. Available: https://arxiv.org/abs/2006.01475v [156] E. Cottle, F. Michel, J. Wilson, N. New, and I. Kundu, “Optical Convolutional Neural Networks – Combining Silicon Photonics and Fourier Optics for Computer Vision,” arXiv preprint, dec 2020. [Online]. Available: https://arxiv.org/abs/2103.09044v

[157] J. Wilson, “The Multiply and Fourier Transform Unit: A Micro-Scale Optical Processor,” Optalysys, Tech. Rep., dec 2020. [Online]. Available: https://optalysys.com/s/Multiply and Fourier Transform white paper 12 12 20.pdf [158] D. Schneider, “A Neural-Net Based on Light Could Best Digital Computers,” jun 2019. [Online]. Available: https://spectrum.ieee.org/ a-neural-net-based-on-light-could-best-digital-computers [159] C. Q. Choi, “Photonic Chip Performs Image Recognition at the Speed of Light,” jun 2022. [Online]. Available: https://spectrum.ieee.org/ photonic-neural-network