Domain-specific architecture

520: 292: 112:

have been used in very specialized application domains since the inception of the semiconductor industry, they generally implement a specific function with very limited flexibility. In contrast, the shift towards domain-specific architectures wants to achieve a better balance of flexibility and

1188:

Shaw, David E.; Adams, Peter J.; Azaria, Asaph; Bank, Joseph A.; Batson, Brannon; Bell, Alistair; Bergdorf, Michael; Bhatt, Jhanvi; Butts, J. Adam; Correia, Timothy; Dirks, Robert M.; Dror, Ron O.; Eastwood, Michael P.; Edwards, Bruce; Even, Amos (2021-11-14). "Anton 3: Twenty microseconds of

185:

outlined five principles for DSA design that lead to better area efficiency and energy savings. The objective in these types of architecture is often also to reduce the Non-Recurring Engineering (NRE) costs so that the investment in a specialized solution can be more easily amortized.

222:

Use the easiest form of parallelism that matches the domain: since the target application domains almost always present an inherent form of parallelism, it is important to decide how to take advantage of this parallelism and expose it to the software. If, for example, a

1036:

Putnam, Andrew; Caulfield, Adrian M.; Chung, Eric S.; Chiou, Derek; Constantinides, Kypros; Demme, John; Esmaeilzadeh, Hadi; Fowers, Jeremy; Gopal, Gopi Prashanth; Gray, Jan; Haselman, Michael; Hauck, Scott; Heil, Stephen; Hormati, Amir; Kim, Joo-Young (2016-10-28).

245:

Use a domain-specific programming language to port code to the DSA: one of the challenges for DSAs is ease of use, and more specifically, being able to effectively program the architecture and run applications on it. Whenever possible, it is advised to use existing

399:

connected through a PCIe bus into data center servers, with the idea of using the FPGA to accelerate various applications running on the server, leveraging the reconfiguration capabilities of FPGA to accelerate many different applications.

194:

requires a remarkable amount of energy in order to attempt to minimize the latency to access data. In the case of Domain-Specific Architectures, it is expected that understanding the application domains by hardware and

202:

Invest saved resources into arithmetic units or bigger memories: since a remarkable amount of hardware resources can be saved by dropping general-purpose architectural optimizations such as out-of-order execution,

350:

was developed in 2015 to accelerate DNN inference since the company projected that the use of voice search would require to double the computational resources allocated at the time for neural network inference.

97:

The end of Moore's Law shifted the focus away from general-purpose architectures towards more specialized hardware. Although general-purpose CPU will likely have a place in any computer system,

199:

designers allows for simpler and specialized memory hierarchies, where the data movement is largely handled in software, with tailor-made memories for specific functions within the domain.

24:

specifically tailored to operate very efficiently within the confines of a given application domain. The term is often used in contrast to general-purpose architectures, such as

438:

hardware design available in a number of highly parametrizable configurations. The small-NVDLA model is designed to be deployed in resource-constrained scenarios such as

523:

The architecture of the Anton3 specialized cores. Geometry Cores carry out general-purpose computation while specialized hardware accelerate force-fields computation

1284:

Computer Architecture. A Quantitative Approach. Sixth Edition. John L. Hennessy. Stanford University. David A. Patterson. University of California, Berkeley.

454:

Aside from an application in artificial intelligence, DSAs are being adopted in many domains within scientific computing, image processing, and networking.

543:

between molecules. This heterogeneous system combines general-purpose hardware and domain-specific components to achieve record-breaking simulation speed.

44:

boom that started in the 1960s, computer architects were tasked with finding new ways to exploit the increasingly large number of transistors available.

381:

The TPU was fabricated with a 28-nm process and clocked at 700MHz. The portion of the application that runs on the TPU is implemented in TensorFlow.

155:-based artificial intelligence in the 2010s, several domain-specific architectures have been developed to accelerate inference for different forms of 258:

to more easily program a DSA. Re-use of existing compiler toolchains and software frameworks makes using a new DSA significantly more accessible.

535:

topology interconnection network to connect several computing nodes. Each computing node contains a set of 64 cores interconnected through a

698: 1304: 109: 384:

The TPU computes primarily reduced precision integers, which further contributes to energy savings and increased performance.

426:

search engine. The proposed architecture provided a runtime reconfigurable design based on a two-dimensional systolic array.

72: 1258: 1233: 1208: 1097: 996: 962: 738: 587: 182: 754: 94:. Performance improvement could no longer be achieved by simply increasing the operating frequency of a single core. 920: 101:

composed of general-purpose and domain-specific components are the most recent trend for achieving high performance.

98: 446:

scenarios. NVDLA provides its own dedicated training infrastructure, compilation tools and runtime software stack.

242:

applications, and it can also reduce the amount of resources required to implement the respective arithmetic units.

1324: 234:

Reduce data size and type to the simplest needed for the domain: whenever possible, using narrower and simpler

87:, and architects were not concerned with the internal structure or specific characteristics of these programs. 90:

The end of Dennard Scaling pushed computer architects to switch from a single, very fast processor to several

1309: 422:

accelerator for the Catapult framework that was primarily designed to accelerate the ranking function in the

404: 396: 251: 208: 1192:

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

419: 367: 1161: 442:

where cost, area and power are the main concerns. Conversely. the large-NVDLA model is more suitable for

211:, and hardware speculation, the resources saved should be re-invested to maximally exploit the available 1113: 642:

Dennard, R.H.; Gaensslen, F.H.; Yu, Hwa-Nien; Rideout, V.L.; Bassous, E.; LeBlanc, A.R. (October 1974).

443: 1012: 415:. For this reason, a major concern for the authors of the framework was the limited programmability. 323: 156: 80: 263: 247: 121: 117: 731:

ASIC design in the silicon sandbox: a complete guide to building mixed-signal integrated circuits

540: 274: 212: 204: 145: 1092:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 573. 991:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 560. 957:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 557. 582:(6 ed.). Cambridge, Mass: Morgan Kaufmann Publishers, an imprint of Elsevier. p. 540. 484: 347: 341: 163: 68: 64: 826: 1294: 778: 363: 239: 105: 21: 655: 435: 278: 91: 41: 802: 705: 273:

One of the application domains where DSA have found the most amount of success is that of

8: 227:

architecture can work in the domain, it would be easier for the programmer to use than a

659: 1214: 1066: 926: 679: 528: 492: 439: 141: 129: 120:

These specialized hardware were developed specifically to operate within the domain of

1218: 1204: 1093: 1058: 992: 958: 930: 916: 734: 671: 624: 583: 508: 463: 132:

and personal computers. With the improvement of the hardware/software stack for both

125: 60: 1089: 988: 954: 579: 1196: 1070: 1050: 908: 663: 616: 491:

multi-core domain-specific architecture (DSA) for mobile devices and in future for

469: 327: 277:. In particular, several architectures have been developed for the acceleration of 216: 191: 178: 152: 29: 683: 1190: 480: 472: 375: 76: 49: 519: 1299: 643: 604: 500: 496: 488: 423: 371: 238:

yields several advantages. For example, it reduces the cost of moving data for

190:

Minimize the distance over which data is moved: moving data in general-purpose

140:

GPUs, these architectures are being used more and more for the acceleration of

53: 45: 667: 620: 1318: 1062: 900: 675: 628: 539:. The cores implement a specialized deep pipeline to efficiently compute the 536: 403:

Differently from Google's TPU, the Catapult FPGA needed to be programmed via

291: 167: 1200: 912: 52:

enabled architects to focus on improving the performance of general-purpose

1234:"Hot Chips 2018: The Google Pixel Visual Core Live Blog (10am PT, 5pm UTC)" 116:

A notable early example of a domain-specific programmable architecture are

1039:"A reconfigurable fabric for accelerating large-scale datacenter services" 359: 355: 827:"Machine Learning Processor (MLP) - Microarchitectures - ARM - WikiChip" 128:. These programmable processing units found widespread adoption both in 499:

which were introduced on October 19, 2017. It has also appeared in the

255: 644:"Design of ion-implanted MOSFET's with very small physical dimensions" 362:

bus, to be easily incorporated in existing servers. It is primarily a

1038: 392: 235: 1054: 196: 504: 408: 59:

These efforts yielded several technological innovations, such as

434:

NVDLA is NVIDIA's deep-learning inference accelerator. It is an

1137: 476: 305: 160: 133: 875: 532: 79:. The impact of these innovations was measured on generalist 901:"Deep Neural Networks (DNNs) Fundamentals and Architectures" 215:, for example, by adding more arithmetic units or solve any 851: 412: 366:

engine following a CISC (Complex Instruction Set Computer)

281:(DNN). In the following sections, we report some examples. 228: 224: 84: 1259:"Anton 3 Is a 'Fire-Breathing' Molecular Simulation Beast" 1035: 641: 137: 25: 148:

tasks, even outside of the domain of image processing.

907:, Boca Raton: Chapman and Hall/CRC, pp. 77–107, 1187: 605:"Cramming More Components Onto Integrated Circuits" 1083: 982: 948: 573: 374:to save energy, reducing the number of writes to 1316: 803:"NVDLA - Microarchitectures - Nvidia - WikiChip" 527:Anton3 is a DSA designed to efficiently compute 1162:"NVIDIA BlueField Data Processing Units (DPUs)" 1084:Hennessy, John L.; Patterson, David A. (2019). 983:Hennessy, John L.; Patterson, David A. (2019). 949:Hennessy, John L.; Patterson, David A. (2019). 574:Hennessy, John L.; Patterson, David A. (2019). 268: 1086:Computer architecture: a quantitative approach 985:Computer architecture: a quantitative approach 951:Computer architecture: a quantitative approach 576:Computer architecture: a quantitative approach 1189:molecular dynamics simulation before lunch". 173: 290: 898: 849: 518: 449: 1256: 1317: 531:simulations. It uses a specialized 3D 28:, that are designed to operate on any 1183: 1181: 978: 976: 974: 602: 468:The Pixel Visual Core (PVC) is an of 387: 1138:"NVDLA Primer — NVDLA Documentation" 728: 699:"Multicore Processors – A Necessity" 648:IEEE Journal of Solid-State Circuits 569: 567: 565: 563: 561: 559: 557: 555: 457: 1231: 696: 13: 1278: 1178: 971: 507:, this chip was replaced with the 479:. The PVC is a fully programmable 18:domain-specific architecture (DSA) 14: 1336: 779:"NVIDIA Accelerated Applications" 552: 370:. The multiplication engine uses 219:issues by adding bigger memories. 1250: 1225: 1154: 1130: 1106: 1077: 1029: 1005: 942: 892: 868: 843: 899:Ghayoumi, Mehdi (2021-10-12), 819: 795: 771: 747: 722: 690: 635: 596: 405:hardware-description languages 56:on general-purpose programs. 1: 546: 354:The TPU was designed to be a 1257:Russell, John (2021-09-02). 729:Barr, Keith Elliott (2007). 603:Moore, G.E. (January 1998). 269:DSA for deep neural networks 7: 1288: 495:. It first appeared in the 395:'s Project Catapult put an 10: 1341: 461: 339: 296:Tensor Processing Unit 3.0 261: 157:artificial neural networks 35: 1114:"A peck between penguins" 1043:Communications of the ACM 905:Deep Learning in Practice 733:. New York: McGraw-Hill. 668:10.1109/jssc.1974.1050511 621:10.1109/jproc.1998.658762 514: 319: 311: 301: 289: 248:Domain-Specific Languages 174:Guidelines for DSA design 151:Since the renaissance of 850:Ragan-Kelley, Jonathan. 429: 264:Domain-specific language 40:In conjunction with the 1201:10.1145/3458817.3487397 913:10.1201/9781003025818-5 609:Proceedings of the IEEE 501:Google Pixel 3 and 3 XL 497:Google Pixel 2 and 2 XL 275:artificial intelligence 146:embarrassingly parallel 1195:. ACM. pp. 1–11. 524: 342:Tensor Processing Unit 335: 285:Tensor Processing Unit 65:out-of-order execution 1325:Computer architecture 522: 450:DSA for other domains 418:Microsoft designed a 364:matrix-multiplication 166:, NVIDIA's NVDLA and 106:hardware accelerators 99:heterogeneous systems 22:computer architecture 1295:Hardware Accelerator 503:. Starting with the 358:communicating via a 279:Deep Neural Networks 159:. Some examples are 660:1974IJSSC...9..256D 286: 67:, deep instruction 1017:Microsoft Research 1013:"Project Catapult" 529:molecular-dynamics 525: 388:Microsoft Catapult 372:systolic execution 284: 192:memory hierarchies 61:multi-level caches 20:is a programmable 1238:www.anandtech.com 1210:978-1-4503-8442-1 1099:978-0-12-811905-1 998:978-0-12-811905-1 964:978-0-12-811905-1 740:978-0-07-148161-8 589:978-0-12-811905-1 509:Pixel Neural Core 464:Pixel Visual Core 458:Pixel Visual Core 333: 332: 126:computer graphics 113:specialization. 1332: 1273: 1272: 1270: 1269: 1254: 1248: 1247: 1245: 1244: 1229: 1223: 1222: 1185: 1176: 1175: 1173: 1172: 1158: 1152: 1151: 1149: 1148: 1134: 1128: 1127: 1125: 1124: 1110: 1104: 1103: 1081: 1075: 1074: 1033: 1027: 1026: 1024: 1023: 1009: 1003: 1002: 980: 969: 968: 946: 940: 939: 938: 937: 896: 890: 889: 887: 886: 872: 866: 865: 863: 862: 847: 841: 840: 838: 837: 823: 817: 816: 814: 813: 799: 793: 792: 790: 789: 775: 769: 768: 766: 765: 755:"What is a GPU?" 751: 745: 744: 726: 720: 719: 717: 716: 710: 704:. Archived from 703: 697:Schauer, Bryan. 694: 688: 687: 639: 633: 632: 600: 594: 593: 571: 473:image processors 328:Machine learning 294: 287: 283: 217:memory bandwidth 153:machine-learning 122:image processing 30:computer program 1340: 1339: 1335: 1334: 1333: 1331: 1330: 1329: 1315: 1314: 1291: 1281: 1279:Further reading 1276: 1267: 1265: 1255: 1251: 1242: 1240: 1230: 1226: 1211: 1186: 1179: 1170: 1168: 1160: 1159: 1155: 1146: 1144: 1136: 1135: 1131: 1122: 1120: 1112: 1111: 1107: 1100: 1082: 1078: 1055:10.1145/2996868 1049:(11): 114–122. 1034: 1030: 1021: 1019: 1011: 1010: 1006: 999: 981: 972: 965: 947: 943: 935: 933: 923: 897: 893: 884: 882: 874: 873: 869: 860: 858: 856:halide-lang.org 848: 844: 835: 833: 831:en.wikichip.org 825: 824: 820: 811: 809: 807:en.wikichip.org 801: 800: 796: 787: 785: 777: 776: 772: 763: 761: 759:Virtual Desktop 753: 752: 748: 741: 727: 723: 714: 712: 708: 701: 695: 691: 640: 636: 601: 597: 590: 572: 553: 549: 517: 466: 460: 452: 432: 390: 344: 338: 326: 297: 271: 266: 183:David Patterson 176: 130:gaming consoles 92:processor cores 77:multiprocessing 54:microprocessors 50:Dennard Scaling 38: 12: 11: 5: 1338: 1328: 1327: 1313: 1312: 1307: 1302: 1300:AI Accelerator 1297: 1290: 1287: 1286: 1285: 1280: 1277: 1275: 1274: 1249: 1232:Cutress, Ian. 1224: 1209: 1177: 1153: 1129: 1105: 1098: 1090:Krste Asanović 1076: 1028: 1004: 997: 989:Krste Asanović 970: 963: 955:Krste Asanović 941: 921: 891: 867: 842: 818: 794: 770: 746: 739: 721: 689: 654:(5): 256–268. 634: 595: 588: 580:Krste Asanović 550: 548: 545: 516: 513: 459: 456: 451: 448: 431: 428: 389: 386: 337: 334: 331: 330: 324:Neural network 321: 317: 316: 313: 309: 308: 303: 299: 298: 295: 270: 267: 260: 259: 250:(DSL) such as 243: 232: 220: 200: 175: 172: 73:multithreading 37: 34: 9: 6: 4: 3: 2: 1337: 1326: 1323: 1322: 1320: 1311: 1308: 1306: 1303: 1301: 1298: 1296: 1293: 1292: 1283: 1282: 1264: 1260: 1253: 1239: 1235: 1228: 1220: 1216: 1212: 1206: 1202: 1198: 1194: 1193: 1184: 1182: 1167: 1163: 1157: 1143: 1139: 1133: 1119: 1115: 1109: 1101: 1095: 1091: 1087: 1080: 1072: 1068: 1064: 1060: 1056: 1052: 1048: 1044: 1040: 1032: 1018: 1014: 1008: 1000: 994: 990: 986: 979: 977: 975: 966: 960: 956: 952: 945: 932: 928: 924: 922:9781003025818 918: 914: 910: 906: 902: 895: 881: 877: 871: 857: 853: 846: 832: 828: 822: 808: 804: 798: 784: 780: 774: 760: 756: 750: 742: 736: 732: 725: 711:on 2011-11-25 707: 700: 693: 685: 681: 677: 673: 669: 665: 661: 657: 653: 649: 645: 638: 630: 626: 622: 618: 614: 610: 606: 599: 591: 585: 581: 577: 570: 568: 566: 564: 562: 560: 558: 556: 551: 544: 542: 538: 534: 530: 521: 512: 510: 506: 502: 498: 494: 490: 486: 482: 478: 474: 471: 465: 455: 447: 445: 441: 437: 427: 425: 421: 416: 414: 410: 406: 401: 398: 394: 385: 382: 379: 377: 373: 369: 365: 361: 357: 352: 349: 343: 329: 325: 322: 318: 314: 310: 307: 304: 300: 293: 288: 282: 280: 276: 265: 257: 253: 249: 244: 241: 237: 233: 231:architecture. 230: 226: 221: 218: 214: 210: 206: 201: 198: 193: 189: 188: 187: 184: 180: 179:John Hennessy 171: 169: 165: 162: 158: 154: 149: 147: 143: 139: 135: 131: 127: 123: 119: 114: 111: 107: 102: 100: 95: 93: 88: 86: 82: 78: 74: 70: 66: 62: 57: 55: 51: 47: 43: 42:semiconductor 33: 31: 27: 23: 19: 1266:. Retrieved 1262: 1252: 1241:. Retrieved 1237: 1227: 1191: 1169:. Retrieved 1165: 1156: 1145:. Retrieved 1141: 1132: 1121:. Retrieved 1117: 1108: 1085: 1079: 1046: 1042: 1031: 1020:. Retrieved 1016: 1007: 984: 950: 944: 934:, retrieved 904: 894: 883:. Retrieved 879: 876:"TensorFlow" 870: 859:. Retrieved 855: 845: 834:. Retrieved 830: 821: 810:. Retrieved 806: 797: 786:. Retrieved 782: 773: 762:. Retrieved 758: 749: 730: 724: 713:. Retrieved 706:the original 692: 651: 647: 637: 615:(1): 82–85. 612: 608: 598: 575: 526: 475:designed by 467: 453: 433: 417: 402: 391: 383: 380: 356:co-processor 353: 345: 272: 240:memory-bound 177: 150: 115: 103: 96: 89: 58: 39: 17: 15: 541:force-field 436:open-source 213:parallelism 205:prefetching 46:Moore's Law 1268:2023-07-06 1243:2023-07-07 1171:2023-07-06 1147:2023-07-06 1123:2023-07-06 1022:2023-07-06 936:2023-07-06 885:2023-07-06 880:TensorFlow 861:2023-07-06 836:2023-07-06 812:2023-07-06 788:2023-07-06 764:2023-07-07 715:2023-07-06 547:References 462:See also: 340:See also: 312:Introduced 262:See also: 256:TensorFlow 236:data types 209:coalescing 207:, address 81:benchmarks 1219:239036976 1142:nvdla.org 1063:0001-0782 931:241427658 676:0018-9200 629:0018-9219 470:ARM-based 393:Microsoft 346:Google's 170:'s MLP. 142:massively 69:pipelines 1319:Category 1289:See also 852:"Halide" 407:such as 315:May 2016 302:Designer 197:compiler 161:Google's 83:such as 1263:HPCwire 1071:3826382 656:Bibcode 505:Pixel 4 409:Verilog 36:History 1217: 1207: 1166:NVIDIA 1096: 1069: 1061: 995: 961: 929: 919: 783:NVIDIA 737: 684:283984 682: 674: 627: 586: 515:Anton3 485:vision 477:Google 306:Google 252:Halide 134:NVIDIA 104:While 75:, and 1215:S2CID 1067:S2CID 927:S2CID 709:(PDF) 702:(PDF) 680:S2CID 533:torus 481:image 430:NVDLA 118:GPUs. 1310:FPGA 1305:ASIC 1205:ISBN 1118:Bing 1094:ISBN 1059:ISSN 993:ISBN 959:ISBN 917:ISBN 735:ISBN 672:ISSN 625:ISSN 584:ISBN 537:mesh 487:and 424:Bing 413:VHDL 411:and 397:FPGA 376:SRAM 360:PCIe 320:Type 254:and 229:MIMD 225:SIMD 181:and 144:and 136:and 124:and 110:ASIC 108:and 85:SPEC 48:and 26:CPUs 1197:doi 1051:doi 909:doi 664:doi 617:doi 493:IoT 444:HPC 440:IoT 420:CNN 378:. 368:ISA 348:TPU 336:TPU 168:ARM 164:TPU 138:AMD 1321:: 1261:. 1236:. 1213:. 1203:. 1180:^ 1164:. 1140:. 1116:. 1088:. 1065:. 1057:. 1047:59 1045:. 1041:. 1015:. 987:. 973:^ 953:. 925:, 915:, 903:, 878:. 854:. 829:. 805:. 781:. 757:. 678:. 670:. 662:. 650:. 646:. 623:. 613:86 611:. 607:. 578:. 554:^ 511:. 489:AI 483:, 71:, 63:, 32:. 16:A 1271:. 1246:. 1221:. 1199:: 1174:. 1150:. 1126:. 1102:. 1073:. 1053:: 1025:. 1001:. 967:. 911:: 888:. 864:. 839:. 815:. 791:. 767:. 743:. 718:. 686:. 666:: 658:: 652:9 631:. 619:: 592:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Knowledge

Domain-specific architecture

Index