520:
292:
112:
have been used in very specialized application domains since the inception of the semiconductor industry, they generally implement a specific function with very limited flexibility. In contrast, the shift towards domain-specific architectures wants to achieve a better balance of flexibility and
1188:
Shaw, David E.; Adams, Peter J.; Azaria, Asaph; Bank, Joseph A.; Batson, Brannon; Bell, Alistair; Bergdorf, Michael; Bhatt, Jhanvi; Butts, J. Adam; Correia, Timothy; Dirks, Robert M.; Dror, Ron O.; Eastwood, Michael P.; Edwards, Bruce; Even, Amos (2021-11-14). "Anton 3: Twenty microseconds of
185:
outlined five principles for DSA design that lead to better area efficiency and energy savings. The objective in these types of architecture is often also to reduce the Non-Recurring
Engineering (NRE) costs so that the investment in a specialized solution can be more easily amortized.
222:
Use the easiest form of parallelism that matches the domain: since the target application domains almost always present an inherent form of parallelism, it is important to decide how to take advantage of this parallelism and expose it to the software. If, for example, a
1036:
Putnam, Andrew; Caulfield, Adrian M.; Chung, Eric S.; Chiou, Derek; Constantinides, Kypros; Demme, John; Esmaeilzadeh, Hadi; Fowers, Jeremy; Gopal, Gopi
Prashanth; Gray, Jan; Haselman, Michael; Hauck, Scott; Heil, Stephen; Hormati, Amir; Kim, Joo-Young (2016-10-28).
245:
Use a domain-specific programming language to port code to the DSA: one of the challenges for DSAs is ease of use, and more specifically, being able to effectively program the architecture and run applications on it. Whenever possible, it is advised to use existing
399:
connected through a PCIe bus into data center servers, with the idea of using the FPGA to accelerate various applications running on the server, leveraging the reconfiguration capabilities of FPGA to accelerate many different applications.
194:
requires a remarkable amount of energy in order to attempt to minimize the latency to access data. In the case of Domain-Specific
Architectures, it is expected that understanding the application domains by hardware and
202:
Invest saved resources into arithmetic units or bigger memories: since a remarkable amount of hardware resources can be saved by dropping general-purpose architectural optimizations such as out-of-order execution,
350:
was developed in 2015 to accelerate DNN inference since the company projected that the use of voice search would require to double the computational resources allocated at the time for neural network inference.
97:
The end of Moore's Law shifted the focus away from general-purpose architectures towards more specialized hardware. Although general-purpose CPU will likely have a place in any computer system,
199:
designers allows for simpler and specialized memory hierarchies, where the data movement is largely handled in software, with tailor-made memories for specific functions within the domain.
24:
specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as
438:
hardware design available in a number of highly parametrizable configurations. The small-NVDLA model is designed to be deployed in resource-constrained scenarios such as
523:
The architecture of the Anton3 specialized cores. Geometry Cores carry out general-purpose computation while specialized hardware accelerate force-fields computation
1284:
Computer
Architecture. A Quantitative Approach. Sixth Edition. John L. Hennessy. Stanford University. David A. Patterson. University of California, Berkeley.
454:
Aside from an application in artificial intelligence, DSAs are being adopted in many domains within scientific computing, image processing, and networking.
543:
between molecules. This heterogeneous system combines general-purpose hardware and domain-specific components to achieve record-breaking simulation speed.
44:
boom that started in the 1960s, computer architects were tasked with finding new ways to exploit the increasingly large number of transistors available.
381:
The TPU was fabricated with a 28-nm process and clocked at 700MHz. The portion of the application that runs on the TPU is implemented in TensorFlow.
155:-based artificial intelligence in the 2010s, several domain-specific architectures have been developed to accelerate inference for different forms of
258:
to more easily program a DSA. Re-use of existing compiler toolchains and software frameworks makes using a new DSA significantly more accessible.
535:
topology interconnection network to connect several computing nodes. Each computing node contains a set of 64 cores interconnected through a
698:
1304:
109:
384:
The TPU computes primarily reduced precision integers, which further contributes to energy savings and increased performance.
426:
search engine. The proposed architecture provided a runtime reconfigurable design based on a two-dimensional systolic array.
72:
1258:
1233:
1208:
1097:
996:
962:
738:
587:
182:
754:
94:. Performance improvement could no longer be achieved by simply increasing the operating frequency of a single core.
920:
101:
composed of general-purpose and domain-specific components are the most recent trend for achieving high performance.
98:
446:
scenarios. NVDLA provides its own dedicated training infrastructure, compilation tools and runtime software stack.
242:
applications, and it can also reduce the amount of resources required to implement the respective arithmetic units.
1324:
234:
Reduce data size and type to the simplest needed for the domain: whenever possible, using narrower and simpler
87:, and architects were not concerned with the internal structure or specific characteristics of these programs.
90:
The end of
Dennard Scaling pushed computer architects to switch from a single, very fast processor to several
1309:
422:
accelerator for the
Catapult framework that was primarily designed to accelerate the ranking function in the
404:
396:
251:
208:
1192:
Proceedings of the
International Conference for High Performance Computing, Networking, Storage and Analysis
419:
367:
1161:
442:
where cost, area and power are the main concerns. Conversely. the large-NVDLA model is more suitable for
211:, and hardware speculation, the resources saved should be re-invested to maximally exploit the available
1113:
642:
Dennard, R.H.; Gaensslen, F.H.; Yu, Hwa-Nien; Rideout, V.L.; Bassous, E.; LeBlanc, A.R. (October 1974).
443:
1012:
415:. For this reason, a major concern for the authors of the framework was the limited programmability.
323:
156:
80:
263:
247:
121:
117:
731:
ASIC design in the silicon sandbox: a complete guide to building mixed-signal integrated circuits
540:
274:
212:
204:
145:
1092:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 573.
991:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 560.
957:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 557.
582:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 540.
484:
347:
341:
163:
68:
64:
826:
1294:
778:
363:
239:
105:
21:
655:
435:
278:
91:
41:
802:
705:
273:
One of the application domains where DSA have found the most amount of success is that of
8:
227:
architecture can work in the domain, it would be easier for the programmer to use than a
659:
1214:
1066:
926:
679:
528:
492:
439:
141:
129:
120:
These specialized hardware were developed specifically to operate within the domain of
1218:
1204:
1093:
1058:
992:
958:
930:
916:
734:
671:
624:
583:
508:
463:
132:
and personal computers. With the improvement of the hardware/software stack for both
125:
60:
1089:
988:
954:
579:
1196:
1070:
1050:
908:
663:
616:
491:
multi-core domain-specific architecture (DSA) for mobile devices and in future for
469:
327:
277:. In particular, several architectures have been developed for the acceleration of
216:
191:
178:
152:
29:
683:
1190:
480:
472:
375:
76:
49:
519:
1299:
643:
604:
500:
496:
488:
423:
371:
238:
yields several advantages. For example, it reduces the cost of moving data for
190:
Minimize the distance over which data is moved: moving data in general-purpose
140:
GPUs, these architectures are being used more and more for the acceleration of
53:
45:
667:
620:
1318:
1062:
900:
675:
628:
539:. The cores implement a specialized deep pipeline to efficiently compute the
536:
403:
Differently from Google's TPU, the
Catapult FPGA needed to be programmed via
291:
167:
1200:
912:
52:
enabled architects to focus on improving the performance of general-purpose
1234:"Hot Chips 2018: The Google Pixel Visual Core Live Blog (10am PT, 5pm UTC)"
116:
A notable early example of a domain-specific programmable architecture are
1039:"A reconfigurable fabric for accelerating large-scale datacenter services"
359:
355:
827:"Machine Learning Processor (MLP) - Microarchitectures - ARM - WikiChip"
128:. These programmable processing units found widespread adoption both in
499:
which were introduced on
October 19, 2017. It has also appeared in the
255:
644:"Design of ion-implanted MOSFET's with very small physical dimensions"
362:
bus, to be easily incorporated in existing servers. It is primarily a
1038:
392:
235:
1054:
196:
504:
408:
59:
These efforts yielded several technological innovations, such as
434:
NVDLA is NVIDIA's deep-learning inference accelerator. It is an
1137:
476:
305:
160:
133:
875:
532:
79:. The impact of these innovations was measured on generalist
901:"Deep Neural Networks (DNNs) Fundamentals and Architectures"
215:, for example, by adding more arithmetic units or solve any
851:
412:
366:
engine following a CISC (Complex
Instruction Set Computer)
281:(DNN). In the following sections, we report some examples.
228:
224:
84:
1259:"Anton 3 Is a 'Fire-Breathing' Molecular Simulation Beast"
1035:
641:
137:
25:
148:
tasks, even outside of the domain of image processing.
907:, Boca Raton: Chapman and Hall/CRC, pp. 77–107,
1187:
605:"Cramming More Components Onto Integrated Circuits"
1083:
982:
948:
573:
374:to save energy, reducing the number of writes to
1316:
803:"NVDLA - Microarchitectures - Nvidia - WikiChip"
527:Anton3 is a DSA designed to efficiently compute
1162:"NVIDIA BlueField Data Processing Units (DPUs)"
1084:Hennessy, John L.; Patterson, David A. (2019).
983:Hennessy, John L.; Patterson, David A. (2019).
949:Hennessy, John L.; Patterson, David A. (2019).
574:Hennessy, John L.; Patterson, David A. (2019).
268:
1086:Computer architecture: a quantitative approach
985:Computer architecture: a quantitative approach
951:Computer architecture: a quantitative approach
576:Computer architecture: a quantitative approach
1189:molecular dynamics simulation before lunch".
173:
290:
898:
849:
518:
449:
1256:
1317:
531:simulations. It uses a specialized 3D
28:, that are designed to operate on any
1183:
1181:
978:
976:
974:
602:
468:The Pixel Visual Core (PVC) is an of
387:
1138:"NVDLA Primer — NVDLA Documentation"
728:
699:"Multicore Processors – A Necessity"
648:IEEE Journal of Solid-State Circuits
569:
567:
565:
563:
561:
559:
557:
555:
457:
1231:
696:
13:
1278:
1178:
971:
507:, this chip was replaced with the
479:. The PVC is a fully programmable
18:domain-specific architecture (DSA)
14:
1336:
779:"NVIDIA Accelerated Applications"
552:
370:. The multiplication engine uses
219:issues by adding bigger memories.
1250:
1225:
1154:
1130:
1106:
1077:
1029:
1005:
942:
892:
868:
843:
899:Ghayoumi, Mehdi (2021-10-12),
819:
795:
771:
747:
722:
690:
635:
596:
405:hardware-description languages
56:on general-purpose programs.
1:
546:
354:The TPU was designed to be a
1257:Russell, John (2021-09-02).
729:Barr, Keith Elliott (2007).
603:Moore, G.E. (January 1998).
269:DSA for deep neural networks
7:
1288:
495:. It first appeared in the
395:'s Project Catapult put an
10:
1341:
461:
339:
296:Tensor Processing Unit 3.0
261:
157:artificial neural networks
35:
1114:"A peck between penguins"
1043:Communications of the ACM
905:Deep Learning in Practice
733:. New York: McGraw-Hill.
668:10.1109/jssc.1974.1050511
621:10.1109/jproc.1998.658762
514:
319:
311:
301:
289:
248:Domain-Specific Languages
174:Guidelines for DSA design
151:Since the renaissance of
850:Ragan-Kelley, Jonathan.
429:
264:Domain-specific language
40:In conjunction with the
1201:10.1145/3458817.3487397
913:10.1201/9781003025818-5
609:Proceedings of the IEEE
501:Google Pixel 3 and 3 XL
497:Google Pixel 2 and 2 XL
275:artificial intelligence
146:embarrassingly parallel
1195:. ACM. pp. 1–11.
524:
342:Tensor Processing Unit
335:
285:Tensor Processing Unit
65:out-of-order execution
1325:Computer architecture
522:
450:DSA for other domains
418:Microsoft designed a
364:matrix-multiplication
166:, NVIDIA's NVDLA and
106:hardware accelerators
99:heterogeneous systems
22:computer architecture
1295:Hardware Accelerator
503:. Starting with the
358:communicating via a
279:Deep Neural Networks
159:. Some examples are
660:1974IJSSC...9..256D
286:
67:, deep instruction
1017:Microsoft Research
1013:"Project Catapult"
529:molecular-dynamics
525:
388:Microsoft Catapult
372:systolic execution
284:
192:memory hierarchies
61:multi-level caches
20:is a programmable
1238:www.anandtech.com
1210:978-1-4503-8442-1
1099:978-0-12-811905-1
998:978-0-12-811905-1
964:978-0-12-811905-1
740:978-0-07-148161-8
589:978-0-12-811905-1
509:Pixel Neural Core
464:Pixel Visual Core
458:Pixel Visual Core
333:
332:
126:computer graphics
113:specialization.
1332:
1273:
1272:
1270:
1269:
1254:
1248:
1247:
1245:
1244:
1229:
1223:
1222:
1185:
1176:
1175:
1173:
1172:
1158:
1152:
1151:
1149:
1148:
1134:
1128:
1127:
1125:
1124:
1110:
1104:
1103:
1081:
1075:
1074:
1033:
1027:
1026:
1024:
1023:
1009:
1003:
1002:
980:
969:
968:
946:
940:
939:
938:
937:
896:
890:
889:
887:
886:
872:
866:
865:
863:
862:
847:
841:
840:
838:
837:
823:
817:
816:
814:
813:
799:
793:
792:
790:
789:
775:
769:
768:
766:
765:
755:"What is a GPU?"
751:
745:
744:
726:
720:
719:
717:
716:
710:
704:. Archived from
703:
697:Schauer, Bryan.
694:
688:
687:
639:
633:
632:
600:
594:
593:
571:
473:image processors
328:Machine learning
294:
287:
283:
217:memory bandwidth
153:machine-learning
122:image processing
30:computer program
1340:
1339:
1335:
1334:
1333:
1331:
1330:
1329:
1315:
1314:
1291:
1281:
1279:Further reading
1276:
1267:
1265:
1255:
1251:
1242:
1240:
1230:
1226:
1211:
1186:
1179:
1170:
1168:
1160:
1159:
1155:
1146:
1144:
1136:
1135:
1131:
1122:
1120:
1112:
1111:
1107:
1100:
1082:
1078:
1055:10.1145/2996868
1049:(11): 114–122.
1034:
1030:
1021:
1019:
1011:
1010:
1006:
999:
981:
972:
965:
947:
943:
935:
933:
923:
897:
893:
884:
882:
874:
873:
869:
860:
858:
856:halide-lang.org
848:
844:
835:
833:
831:en.wikichip.org
825:
824:
820:
811:
809:
807:en.wikichip.org
801:
800:
796:
787:
785:
777:
776:
772:
763:
761:
759:Virtual Desktop
753:
752:
748:
741:
727:
723:
714:
712:
708:
701:
695:
691:
640:
636:
601:
597:
590:
572:
553:
549:
517:
466:
460:
452:
432:
390:
344:
338:
326:
297:
271:
266:
183:David Patterson
176:
130:gaming consoles
92:processor cores
77:multiprocessing
54:microprocessors
50:Dennard Scaling
38:
12:
11:
5:
1338:
1328:
1327:
1313:
1312:
1307:
1302:
1300:AI Accelerator
1297:
1290:
1287:
1286:
1285:
1280:
1277:
1275:
1274:
1249:
1232:Cutress, Ian.
1224:
1209:
1177:
1153:
1129:
1105:
1098:
1090:Krste Asanović
1076:
1028:
1004:
997:
989:Krste Asanović
970:
963:
955:Krste Asanović
941:
921:
891:
867:
842:
818:
794:
770:
746:
739:
721:
689:
654:(5): 256–268.
634:
595:
588:
580:Krste Asanović
550:
548:
545:
516:
513:
459:
456:
451:
448:
431:
428:
389:
386:
337:
334:
331:
330:
324:Neural network
321:
317:
316:
313:
309:
308:
303:
299:
298:
295:
270:
267:
260:
259:
250:(DSL) such as
243:
232:
220:
200:
175:
172:
73:multithreading
37:
34:
9:
6:
4:
3:
2:
1337:
1326:
1323:
1322:
1320:
1311:
1308:
1306:
1303:
1301:
1298:
1296:
1293:
1292:
1283:
1282:
1264:
1260:
1253:
1239:
1235:
1228:
1220:
1216:
1212:
1206:
1202:
1198:
1194:
1193:
1184:
1182:
1167:
1163:
1157:
1143:
1139:
1133:
1119:
1115:
1109:
1101:
1095:
1091:
1087:
1080:
1072:
1068:
1064:
1060:
1056:
1052:
1048:
1044:
1040:
1032:
1018:
1014:
1008:
1000:
994:
990:
986:
979:
977:
975:
966:
960:
956:
952:
945:
932:
928:
924:
922:9781003025818
918:
914:
910:
906:
902:
895:
881:
877:
871:
857:
853:
846:
832:
828:
822:
808:
804:
798:
784:
780:
774:
760:
756:
750:
742:
736:
732:
725:
711:on 2011-11-25
707:
700:
693:
685:
681:
677:
673:
669:
665:
661:
657:
653:
649:
645:
638:
630:
626:
622:
618:
614:
610:
606:
599:
591:
585:
581:
577:
570:
568:
566:
564:
562:
560:
558:
556:
551:
544:
542:
538:
534:
530:
521:
512:
510:
506:
502:
498:
494:
490:
486:
482:
478:
474:
471:
465:
455:
447:
445:
441:
437:
427:
425:
421:
416:
414:
410:
406:
401:
398:
394:
385:
382:
379:
377:
373:
369:
365:
361:
357:
352:
349:
343:
329:
325:
322:
318:
314:
310:
307:
304:
300:
293:
288:
282:
280:
276:
265:
257:
253:
249:
244:
241:
237:
233:
231:architecture.
230:
226:
221:
218:
214:
210:
206:
201:
198:
193:
189:
188:
187:
184:
180:
179:John Hennessy
171:
169:
165:
162:
158:
154:
149:
147:
143:
139:
135:
131:
127:
123:
119:
114:
111:
107:
102:
100:
95:
93:
88:
86:
82:
78:
74:
70:
66:
62:
57:
55:
51:
47:
43:
42:semiconductor
33:
31:
27:
23:
19:
1266:. Retrieved
1262:
1252:
1241:. Retrieved
1237:
1227:
1191:
1169:. Retrieved
1165:
1156:
1145:. Retrieved
1141:
1132:
1121:. Retrieved
1117:
1108:
1085:
1079:
1046:
1042:
1031:
1020:. Retrieved
1016:
1007:
984:
950:
944:
934:, retrieved
904:
894:
883:. Retrieved
879:
876:"TensorFlow"
870:
859:. Retrieved
855:
845:
834:. Retrieved
830:
821:
810:. Retrieved
806:
797:
786:. Retrieved
782:
773:
762:. Retrieved
758:
749:
730:
724:
713:. Retrieved
706:the original
692:
651:
647:
637:
615:(1): 82–85.
612:
608:
598:
575:
526:
475:designed by
467:
453:
433:
417:
402:
391:
383:
380:
356:co-processor
353:
345:
272:
240:memory-bound
177:
150:
115:
103:
96:
89:
58:
39:
17:
15:
541:force-field
436:open-source
213:parallelism
205:prefetching
46:Moore's Law
1268:2023-07-06
1243:2023-07-07
1171:2023-07-06
1147:2023-07-06
1123:2023-07-06
1022:2023-07-06
936:2023-07-06
885:2023-07-06
880:TensorFlow
861:2023-07-06
836:2023-07-06
812:2023-07-06
788:2023-07-06
764:2023-07-07
715:2023-07-06
547:References
462:See also:
340:See also:
312:Introduced
262:See also:
256:TensorFlow
236:data types
209:coalescing
207:, address
81:benchmarks
1219:239036976
1142:nvdla.org
1063:0001-0782
931:241427658
676:0018-9200
629:0018-9219
470:ARM-based
393:Microsoft
346:Google's
170:'s MLP.
142:massively
69:pipelines
1319:Category
1289:See also
852:"Halide"
407:such as
315:May 2016
302:Designer
197:compiler
161:Google's
83:such as
1263:HPCwire
1071:3826382
656:Bibcode
505:Pixel 4
409:Verilog
36:History
1217:
1207:
1166:NVIDIA
1096:
1069:
1061:
995:
961:
929:
919:
783:NVIDIA
737:
684:283984
682:
674:
627:
586:
515:Anton3
485:vision
477:Google
306:Google
252:Halide
134:NVIDIA
104:While
75:, and
1215:S2CID
1067:S2CID
927:S2CID
709:(PDF)
702:(PDF)
680:S2CID
533:torus
481:image
430:NVDLA
118:GPUs.
1310:FPGA
1305:ASIC
1205:ISBN
1118:Bing
1094:ISBN
1059:ISSN
993:ISBN
959:ISBN
917:ISBN
735:ISBN
672:ISSN
625:ISSN
584:ISBN
537:mesh
487:and
424:Bing
413:VHDL
411:and
397:FPGA
376:SRAM
360:PCIe
320:Type
254:and
229:MIMD
225:SIMD
181:and
144:and
136:and
124:and
110:ASIC
108:and
85:SPEC
48:and
26:CPUs
1197:doi
1051:doi
909:doi
664:doi
617:doi
493:IoT
444:HPC
440:IoT
420:CNN
378:.
368:ISA
348:TPU
336:TPU
168:ARM
164:TPU
138:AMD
1321::
1261:.
1236:.
1213:.
1203:.
1180:^
1164:.
1140:.
1116:.
1088:.
1065:.
1057:.
1047:59
1045:.
1041:.
1015:.
987:.
973:^
953:.
925:,
915:,
903:,
878:.
854:.
829:.
805:.
781:.
757:.
678:.
670:.
662:.
650:.
646:.
623:.
613:86
611:.
607:.
578:.
554:^
511:.
489:AI
483:,
71:,
63:,
32:.
16:A
1271:.
1246:.
1221:.
1199::
1174:.
1150:.
1126:.
1102:.
1073:.
1053::
1025:.
1001:.
967:.
911::
888:.
864:.
839:.
815:.
791:.
767:.
743:.
718:.
686:.
666::
658::
652:9
631:.
619::
592:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.