Knowledge

Single instruction, multiple threads

Source đź“ť

163:) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to 148:, etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself. 183:
block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent
76:(both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously 171:). As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times. 184:
for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.
91:
is that each of the SIMT cores may have a completely different Stack Pointer (and thus perform computations on completely different data sets), whereas SIMD lanes are simply part of an ALU that knows nothing about memory per se.
174:
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an
65:
tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to
1131: 262: 47: 484: 600: 298: 326:
Lindholm, Erik; Nickolls, John; Oberman, Stuart; Montrym, John (2008). "NVIDIA Tesla: A Unified Graphics and Computing Architecture".
981: 536: 1126: 810: 514: 390: 477: 164: 84:
unit with a single instruction cache and a single instruction decoder which reads instructions using a single Program Counter.
42:
in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several
563: 72:
The simplest way to understand SIMT is to imagine a multi-core system, where each core has its own register file, its own
117: 774: 757: 735: 470: 450: 31: 769: 691: 1206: 1201: 1226: 965: 730: 1216: 1180: 1097: 155:
overhead, i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of
35: 1034: 519: 223: 105: 1146: 840: 493: 442: 1221: 955: 113: 1141: 1102: 913: 845: 73: 434: 1109: 1087: 1066: 861: 835: 784: 752: 168: 152: 133: 8: 1168: 1114: 1008: 805: 656: 435: 1158: 815: 718: 347: 27: 1151: 950: 825: 713: 446: 1163: 1136: 1071: 1049: 1013: 945: 370: 351: 339: 109: 1119: 1044: 779: 723: 590: 1211: 960: 686: 595: 1195: 996: 740: 636: 615: 51: 365:
Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De Bosschere, Koen (2010).
1003: 578: 556: 551: 462: 375: 1173: 991: 866: 666: 605: 369:. Symp. Application Accelerators in High Performance Computing (SAAHPC). 417: 282: 1061: 1029: 986: 876: 641: 573: 343: 141: 88: 66: 1092: 651: 529: 137: 364: 1039: 524: 420:
Structured Parallel Programming: Patterns for Efficient Computation
285:
Structured Parallel Programming: Patterns for Efficient Computation
145: 116:, released a competing product slightly later on May 14, 2007, the 432: 367:
An experimental study on performance portability of OpenCL kernels
933: 800: 745: 661: 646: 610: 541: 830: 820: 706: 701: 631: 585: 568: 546: 199: 156: 101: 325: 1056: 938: 923: 906: 901: 896: 891: 886: 881: 871: 928: 918: 696: 194: 39: 160: 43: 263:
General-purpose computing on graphics processing units
48:
general-purpose computing on graphics processing units
418:
Michael McCool; James Reinders; Arch Robison (2013).
283:
Michael McCool; James Reinders; Arch Robison (2013).
1193: 437:Computer Architecture: A Quantitative Approach 299:"Nvidia Fermi Compute Architecture Whitepaper" 478: 433:John L. Hennessy; David A. Patterson (1990). 492: 485: 471: 16:Execution model used in parallel computing 374: 319: 441:(6 ed.). Morgan Kaufmann. pp.  61:of them, seem to execute many more than 1194: 466: 132:As access time of all the widespread 87:The key difference between SIMT and 20:Single instruction, multiple threads 811:Input–output memory management unit 13: 14: 1238: 215:Sequence of SIMD Lane operations 32:single instruction, multiple data 26:) is an execution model used in 426: 411: 383: 358: 291: 276: 127: 1: 269: 57:The processors, say a number 422:. Elsevier. pp. 209 ff. 228:Thread of SIMD Instructions 7: 256: 106:Tesla GPU microarchitecture 10: 1243: 308:. NVIDIA Corporation. 2009 151:SIMT is intended to limit 95: 1080: 1022: 974: 854: 793: 679: 624: 507: 500: 391:"Advanced Topics in CUDA" 204:Hennessy & Patterson 167:(not to be confused with 80:to all SIMT cores from a 841:Video display controller 494:Graphics processing unit 336:(Subscription required.) 239:Body of vectorized loop 54:combine CPUs with GPUs. 34:(SIMD) is combined with 287:. Elsevier. p. 52. 100:SIMT was introduced by 38:. It is different from 956:Shared graphics memory 165:multithreading in CPUs 1207:Computer architecture 1142:Hardware acceleration 846:Video processing unit 1202:Classes of computers 1067:Performance per watt 836:Texture mapping unit 785:Unified shader model 153:instruction fetching 46:and is relevant for 1227:Threads (computing) 1009:Integrated graphics 189: 108:with the G80 chip. 50:(GPGPU), e.g. some 1217:Parallel computing 1159:Parallel computing 1035:Display resolution 816:Render output unit 806:Geometry processor 344:10.1109/MM.2008.31 187: 28:parallel computing 1189: 1188: 1004:External graphics 987:Discrete graphics 951:Memory controller 714:Graphics pipeline 675: 674: 254: 253: 188:SIMT Terminology 1234: 1164:Vector processor 1147:Image processing 1137:Graphics library 1072:Transistor count 1014:System on a chip 946:Memory bandwidth 826:Stream processor 505: 504: 487: 480: 473: 464: 463: 457: 456: 440: 430: 424: 423: 415: 409: 408: 406: 405: 395: 387: 381: 380: 378: 362: 356: 355: 337: 323: 317: 316: 314: 313: 303: 295: 289: 288: 280: 250:Vectorized loop 190: 186: 110:ATI Technologies 64: 60: 1242: 1241: 1237: 1236: 1235: 1233: 1232: 1231: 1192: 1191: 1190: 1185: 1076: 1018: 970: 850: 789: 780:Tiled rendering 671: 620: 591:InfiniteReality 496: 491: 461: 460: 453: 431: 427: 416: 412: 403: 401: 393: 389: 388: 384: 376:1854/LU-1016024 363: 359: 335: 324: 320: 311: 309: 306:www.nvidia.com/ 301: 297: 296: 292: 281: 277: 272: 259: 130: 98: 62: 58: 17: 12: 11: 5: 1240: 1230: 1229: 1224: 1222:SIMD computing 1219: 1214: 1209: 1204: 1187: 1186: 1184: 1183: 1178: 1177: 1176: 1166: 1161: 1156: 1155: 1154: 1144: 1139: 1134: 1129: 1124: 1123: 1122: 1117: 1107: 1106: 1105: 1100: 1095: 1084: 1082: 1078: 1077: 1075: 1074: 1069: 1064: 1059: 1054: 1053: 1052: 1047: 1037: 1032: 1026: 1024: 1020: 1019: 1017: 1016: 1011: 1006: 1001: 1000: 999: 994: 984: 978: 976: 972: 971: 969: 968: 963: 961:Texture memory 958: 953: 948: 943: 942: 941: 936: 931: 926: 921: 911: 910: 909: 904: 899: 894: 889: 884: 879: 869: 864: 858: 856: 852: 851: 849: 848: 843: 838: 833: 828: 823: 818: 813: 808: 803: 797: 795: 791: 790: 788: 787: 782: 777: 772: 767: 766: 765: 755: 750: 749: 748: 738: 733: 728: 727: 726: 721: 711: 710: 709: 704: 699: 689: 687:Compute kernel 683: 681: 677: 676: 673: 672: 670: 669: 664: 659: 654: 649: 644: 639: 634: 628: 626: 622: 621: 619: 618: 613: 608: 603: 598: 593: 588: 583: 582: 581: 576: 571: 561: 560: 559: 554: 549: 544: 534: 533: 532: 527: 522: 511: 509: 502: 498: 497: 490: 489: 482: 475: 467: 459: 458: 451: 425: 410: 382: 357: 318: 290: 274: 273: 271: 268: 267: 266: 258: 255: 252: 251: 248: 245: 241: 240: 237: 234: 230: 229: 226: 221: 217: 216: 213: 210: 206: 205: 202: 197: 129: 126: 97: 94: 52:supercomputers 36:multithreading 15: 9: 6: 4: 3: 2: 1239: 1228: 1225: 1223: 1220: 1218: 1215: 1213: 1210: 1208: 1205: 1203: 1200: 1199: 1197: 1182: 1179: 1175: 1172: 1171: 1170: 1167: 1165: 1162: 1160: 1157: 1153: 1150: 1149: 1148: 1145: 1143: 1140: 1138: 1135: 1133: 1130: 1128: 1125: 1121: 1118: 1116: 1113: 1112: 1111: 1108: 1104: 1101: 1099: 1096: 1094: 1091: 1090: 1089: 1086: 1085: 1083: 1079: 1073: 1070: 1068: 1065: 1063: 1060: 1058: 1055: 1051: 1048: 1046: 1043: 1042: 1041: 1038: 1036: 1033: 1031: 1028: 1027: 1025: 1021: 1015: 1012: 1010: 1007: 1005: 1002: 998: 995: 993: 990: 989: 988: 985: 983: 980: 979: 977: 973: 967: 964: 962: 959: 957: 954: 952: 949: 947: 944: 940: 937: 935: 932: 930: 927: 925: 922: 920: 917: 916: 915: 912: 908: 905: 903: 900: 898: 895: 893: 890: 888: 885: 883: 880: 878: 875: 874: 873: 870: 868: 865: 863: 860: 859: 857: 853: 847: 844: 842: 839: 837: 834: 832: 829: 827: 824: 822: 819: 817: 814: 812: 809: 807: 804: 802: 799: 798: 796: 792: 786: 783: 781: 778: 776: 773: 771: 768: 764: 761: 760: 759: 756: 754: 751: 747: 744: 743: 742: 741:Rasterisation 739: 737: 734: 732: 731:HDR rendering 729: 725: 722: 720: 717: 716: 715: 712: 708: 705: 703: 700: 698: 695: 694: 693: 690: 688: 685: 684: 682: 678: 668: 665: 663: 660: 658: 655: 653: 650: 648: 645: 643: 640: 638: 637:Apple silicon 635: 633: 630: 629: 627: 623: 617: 616:Apple silicon 614: 612: 609: 607: 604: 602: 599: 597: 594: 592: 589: 587: 584: 580: 577: 575: 572: 570: 567: 566: 565: 562: 558: 555: 553: 550: 548: 545: 543: 540: 539: 538: 535: 531: 528: 526: 523: 521: 518: 517: 516: 513: 512: 510: 506: 503: 499: 495: 488: 483: 481: 476: 474: 469: 468: 465: 454: 452:9781558600690 448: 444: 439: 438: 429: 421: 414: 399: 398:cc.gatech.edu 392: 386: 377: 372: 368: 361: 353: 349: 345: 341: 333: 329: 322: 307: 300: 294: 286: 279: 275: 264: 261: 260: 249: 246: 243: 242: 238: 235: 232: 231: 227: 225: 222: 219: 218: 214: 211: 208: 207: 203: 201: 198: 196: 192: 191: 185: 182: 178: 172: 170: 166: 162: 158: 154: 149: 147: 143: 139: 135: 125: 123: 119: 115: 111: 107: 103: 93: 90: 85: 83: 79: 75: 70: 68: 55: 53: 49: 45: 41: 37: 33: 29: 25: 21: 1169:Video coding 770:Tessellation 762: 680:Architecture 436: 428: 419: 413: 402:. Retrieved 397: 385: 366: 360: 331: 327: 321: 310:. Retrieved 305: 293: 284: 278: 180: 176: 173: 150: 136:types (e.g. 131: 121: 99: 86: 81: 77: 71: 56: 23: 19: 18: 1152:Compression 1023:Performance 975:Form factor 867:Framebuffer 831:Tensor unit 821:Shader unit 753:Ray-tracing 692:Fabrication 667:Intel 2700G 601:3dfx Voodoo 596:NEC µPD7220 128:Description 118:TeraScale 1 1196:Categories 1062:Frame rate 1030:Clock rate 992:Clustering 794:Components 574:Radeon Pro 404:2014-08-28 328:IEEE Micro 312:2014-07-17 270:References 169:multi-core 142:GDDR SDRAM 124:GPU chip. 89:SIMD lanes 67:SIMD lanes 1093:Scrolling 997:Switching 652:VideoCore 236:Workgroup 224:Wavefront 212:Work-item 138:DDR SDRAM 78:broadcast 1040:Fillrate 719:Geometry 579:Instinct 257:See also 146:XDR DRAM 1120:Texture 1050:Texel/s 1045:Pixel/s 982:IP core 934:HBM-PIM 801:Blitter 775:T&L 746:Shading 662:Imageon 657:Vivante 647:PowerVR 611:Glaze3D 542:GeForce 508:Desktop 352:2793450 334:(2): 6 265:(GPGPU) 247:NDRange 193:Nvidia 120:-based 104:in the 96:History 1098:Sprite 1057:FLOP/s 855:Memory 724:Vertex 707:MOSFET 702:FinFET 632:Adreno 625:Mobile 586:Matrox 569:Radeon 547:Quadro 537:Nvidia 449:  400:. 2011 350:  209:Thread 200:OpenCL 157:Nvidia 122:"R600" 112:, now 102:Nvidia 82:single 30:where 1212:GPGPU 1174:Codec 1132:GPGPU 939:HBM3E 924:HBM2E 907:GDDR7 902:GDDR6 897:GDDR5 892:GDDR4 887:GDDR3 882:GDDR2 872:SGRAM 557:Tegra 552:Tesla 515:Intel 394:(PDF) 348:S2CID 302:(PDF) 233:Block 1181:VLIW 1127:ASIC 1103:Tile 1081:Misc 966:VRAM 929:HBM3 919:HBM2 877:GDDR 763:SIMT 758:SIMD 697:CMOS 642:Mali 447:ISBN 445:ff. 244:Grid 220:Warp 195:CUDA 181:ELSE 159:and 74:ALUs 44:GPUs 40:SPMD 24:SIMT 914:HBM 862:DMA 736:MAC 564:AMD 530:Arc 501:GPU 443:314 371:hdl 340:doi 161:AMD 134:RAM 114:AMD 1198:: 1115:GI 1110:3D 1088:2D 606:S3 525:Xe 520:GT 396:. 346:. 338:. 332:28 330:. 304:. 177:IF 144:, 140:, 69:. 486:e 479:t 472:v 455:. 407:. 379:. 373:: 354:. 342:: 315:. 179:- 63:p 59:p 22:(

Index

parallel computing
single instruction, multiple data
multithreading
SPMD
GPUs
general-purpose computing on graphics processing units
supercomputers
SIMD lanes
ALUs
SIMD lanes
Nvidia
Tesla GPU microarchitecture
ATI Technologies
AMD
TeraScale 1
RAM
DDR SDRAM
GDDR SDRAM
XDR DRAM
instruction fetching
Nvidia
AMD
multithreading in CPUs
multi-core
CUDA
OpenCL
Wavefront
General-purpose computing on graphics processing units
"Nvidia Fermi Compute Architecture Whitepaper"
doi

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑