163:) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to
148:, etc.) is still relatively high, engineers came up with the idea to hide the latency that inevitably comes with each memory access. Strictly, the latency-hiding is a feature of the zero-overhead scheduling implemented by modern GPUs. This might or might not be considered to be a property of 'SIMT' itself.
183:
block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent
76:(both SIMD and Scalar) and its own data cache, but that unlike a standard multi-core system which has multiple independent instruction caches and decoders, as well as multiple independent Program Counter registers, the instructions are synchronously
171:). As with SIMD, another major benefit is the sharing of the control logic by many data lanes, leading to an increase in computational density. One block of control logic can manage N data lanes, instead of replicating the control logic N times.
184:
for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.
91:
is that each of the SIMT cores may have a completely different Stack
Pointer (and thus perform computations on completely different data sets), whereas SIMD lanes are simply part of an ALU that knows nothing about memory per se.
174:
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilization where a processor's threads follow different control-flow paths. For instance, to handle an
65:
tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to
1131:
262:
47:
484:
600:
298:
326:
Lindholm, Erik; Nickolls, John; Oberman, Stuart; Montrym, John (2008). "NVIDIA Tesla: A Unified
Graphics and Computing Architecture".
981:
536:
1126:
810:
514:
390:
477:
164:
84:
unit with a single instruction cache and a single instruction decoder which reads instructions using a single
Program Counter.
42:
in that all instructions in all "threads" are executed in lock-step. The SIMT execution model has been implemented on several
563:
72:
The simplest way to understand SIMT is to imagine a multi-core system, where each core has its own register file, its own
117:
774:
757:
735:
470:
450:
31:
769:
691:
1206:
1201:
1226:
965:
730:
1216:
1180:
1097:
155:
overhead, i.e. the latency that comes with memory access, and is used in modern GPUs (such as those of
35:
1034:
519:
223:
105:
1146:
840:
493:
442:
1221:
955:
113:
1141:
1102:
913:
845:
73:
434:
1109:
1087:
1066:
861:
835:
784:
752:
168:
152:
133:
8:
1168:
1114:
1008:
805:
656:
435:
1158:
815:
718:
347:
27:
1151:
950:
825:
713:
446:
1163:
1136:
1071:
1049:
1013:
945:
370:
351:
339:
109:
1119:
1044:
779:
723:
590:
1211:
960:
686:
595:
1195:
996:
740:
636:
615:
51:
365:
Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De
Bosschere, Koen (2010).
1003:
578:
556:
551:
462:
375:
1173:
991:
866:
666:
605:
369:. Symp. Application Accelerators in High Performance Computing (SAAHPC).
417:
282:
1061:
1029:
986:
876:
641:
573:
343:
141:
88:
66:
1092:
651:
529:
137:
364:
1039:
524:
420:
Structured
Parallel Programming: Patterns for Efficient Computation
285:
Structured
Parallel Programming: Patterns for Efficient Computation
145:
116:, released a competing product slightly later on May 14, 2007, the
432:
367:
An experimental study on performance portability of OpenCL kernels
933:
800:
745:
661:
646:
610:
541:
830:
820:
706:
701:
631:
585:
568:
546:
199:
156:
101:
325:
1056:
938:
923:
906:
901:
896:
891:
886:
881:
871:
928:
918:
696:
194:
39:
160:
43:
263:
General-purpose computing on graphics processing units
48:
general-purpose computing on graphics processing units
418:
Michael McCool; James
Reinders; Arch Robison (2013).
283:
Michael McCool; James
Reinders; Arch Robison (2013).
1193:
437:Computer Architecture: A Quantitative Approach
299:"Nvidia Fermi Compute Architecture Whitepaper"
478:
433:John L. Hennessy; David A. Patterson (1990).
492:
485:
471:
16:Execution model used in parallel computing
374:
319:
441:(6 ed.). Morgan Kaufmann. pp.
61:of them, seem to execute many more than
1194:
466:
132:As access time of all the widespread
87:The key difference between SIMT and
20:Single instruction, multiple threads
811:Input–output memory management unit
13:
14:
1238:
215:Sequence of SIMD Lane operations
32:single instruction, multiple data
26:) is an execution model used in
426:
411:
383:
358:
291:
276:
127:
1:
269:
57:The processors, say a number
422:. Elsevier. pp. 209 ff.
228:Thread of SIMD Instructions
7:
256:
106:Tesla GPU microarchitecture
10:
1243:
308:. NVIDIA Corporation. 2009
151:SIMT is intended to limit
95:
1080:
1022:
974:
854:
793:
679:
624:
507:
500:
391:"Advanced Topics in CUDA"
204:Hennessy & Patterson
167:(not to be confused with
80:to all SIMT cores from a
841:Video display controller
494:Graphics processing unit
336:(Subscription required.)
239:Body of vectorized loop
54:combine CPUs with GPUs.
34:(SIMD) is combined with
287:. Elsevier. p. 52.
100:SIMT was introduced by
38:. It is different from
956:Shared graphics memory
165:multithreading in CPUs
1207:Computer architecture
1142:Hardware acceleration
846:Video processing unit
1202:Classes of computers
1067:Performance per watt
836:Texture mapping unit
785:Unified shader model
153:instruction fetching
46:and is relevant for
1227:Threads (computing)
1009:Integrated graphics
189:
108:with the G80 chip.
50:(GPGPU), e.g. some
1217:Parallel computing
1159:Parallel computing
1035:Display resolution
816:Render output unit
806:Geometry processor
344:10.1109/MM.2008.31
187:
28:parallel computing
1189:
1188:
1004:External graphics
987:Discrete graphics
951:Memory controller
714:Graphics pipeline
675:
674:
254:
253:
188:SIMT Terminology
1234:
1164:Vector processor
1147:Image processing
1137:Graphics library
1072:Transistor count
1014:System on a chip
946:Memory bandwidth
826:Stream processor
505:
504:
487:
480:
473:
464:
463:
457:
456:
440:
430:
424:
423:
415:
409:
408:
406:
405:
395:
387:
381:
380:
378:
362:
356:
355:
337:
323:
317:
316:
314:
313:
303:
295:
289:
288:
280:
250:Vectorized loop
190:
186:
110:ATI Technologies
64:
60:
1242:
1241:
1237:
1236:
1235:
1233:
1232:
1231:
1192:
1191:
1190:
1185:
1076:
1018:
970:
850:
789:
780:Tiled rendering
671:
620:
591:InfiniteReality
496:
491:
461:
460:
453:
431:
427:
416:
412:
403:
401:
393:
389:
388:
384:
376:1854/LU-1016024
363:
359:
335:
324:
320:
311:
309:
306:www.nvidia.com/
301:
297:
296:
292:
281:
277:
272:
259:
130:
98:
62:
58:
17:
12:
11:
5:
1240:
1230:
1229:
1224:
1222:SIMD computing
1219:
1214:
1209:
1204:
1187:
1186:
1184:
1183:
1178:
1177:
1176:
1166:
1161:
1156:
1155:
1154:
1144:
1139:
1134:
1129:
1124:
1123:
1122:
1117:
1107:
1106:
1105:
1100:
1095:
1084:
1082:
1078:
1077:
1075:
1074:
1069:
1064:
1059:
1054:
1053:
1052:
1047:
1037:
1032:
1026:
1024:
1020:
1019:
1017:
1016:
1011:
1006:
1001:
1000:
999:
994:
984:
978:
976:
972:
971:
969:
968:
963:
961:Texture memory
958:
953:
948:
943:
942:
941:
936:
931:
926:
921:
911:
910:
909:
904:
899:
894:
889:
884:
879:
869:
864:
858:
856:
852:
851:
849:
848:
843:
838:
833:
828:
823:
818:
813:
808:
803:
797:
795:
791:
790:
788:
787:
782:
777:
772:
767:
766:
765:
755:
750:
749:
748:
738:
733:
728:
727:
726:
721:
711:
710:
709:
704:
699:
689:
687:Compute kernel
683:
681:
677:
676:
673:
672:
670:
669:
664:
659:
654:
649:
644:
639:
634:
628:
626:
622:
621:
619:
618:
613:
608:
603:
598:
593:
588:
583:
582:
581:
576:
571:
561:
560:
559:
554:
549:
544:
534:
533:
532:
527:
522:
511:
509:
502:
498:
497:
490:
489:
482:
475:
467:
459:
458:
451:
425:
410:
382:
357:
318:
290:
274:
273:
271:
268:
267:
266:
258:
255:
252:
251:
248:
245:
241:
240:
237:
234:
230:
229:
226:
221:
217:
216:
213:
210:
206:
205:
202:
197:
129:
126:
97:
94:
52:supercomputers
36:multithreading
15:
9:
6:
4:
3:
2:
1239:
1228:
1225:
1223:
1220:
1218:
1215:
1213:
1210:
1208:
1205:
1203:
1200:
1199:
1197:
1182:
1179:
1175:
1172:
1171:
1170:
1167:
1165:
1162:
1160:
1157:
1153:
1150:
1149:
1148:
1145:
1143:
1140:
1138:
1135:
1133:
1130:
1128:
1125:
1121:
1118:
1116:
1113:
1112:
1111:
1108:
1104:
1101:
1099:
1096:
1094:
1091:
1090:
1089:
1086:
1085:
1083:
1079:
1073:
1070:
1068:
1065:
1063:
1060:
1058:
1055:
1051:
1048:
1046:
1043:
1042:
1041:
1038:
1036:
1033:
1031:
1028:
1027:
1025:
1021:
1015:
1012:
1010:
1007:
1005:
1002:
998:
995:
993:
990:
989:
988:
985:
983:
980:
979:
977:
973:
967:
964:
962:
959:
957:
954:
952:
949:
947:
944:
940:
937:
935:
932:
930:
927:
925:
922:
920:
917:
916:
915:
912:
908:
905:
903:
900:
898:
895:
893:
890:
888:
885:
883:
880:
878:
875:
874:
873:
870:
868:
865:
863:
860:
859:
857:
853:
847:
844:
842:
839:
837:
834:
832:
829:
827:
824:
822:
819:
817:
814:
812:
809:
807:
804:
802:
799:
798:
796:
792:
786:
783:
781:
778:
776:
773:
771:
768:
764:
761:
760:
759:
756:
754:
751:
747:
744:
743:
742:
741:Rasterisation
739:
737:
734:
732:
731:HDR rendering
729:
725:
722:
720:
717:
716:
715:
712:
708:
705:
703:
700:
698:
695:
694:
693:
690:
688:
685:
684:
682:
678:
668:
665:
663:
660:
658:
655:
653:
650:
648:
645:
643:
640:
638:
637:Apple silicon
635:
633:
630:
629:
627:
623:
617:
616:Apple silicon
614:
612:
609:
607:
604:
602:
599:
597:
594:
592:
589:
587:
584:
580:
577:
575:
572:
570:
567:
566:
565:
562:
558:
555:
553:
550:
548:
545:
543:
540:
539:
538:
535:
531:
528:
526:
523:
521:
518:
517:
516:
513:
512:
510:
506:
503:
499:
495:
488:
483:
481:
476:
474:
469:
468:
465:
454:
452:9781558600690
448:
444:
439:
438:
429:
421:
414:
399:
398:cc.gatech.edu
392:
386:
377:
372:
368:
361:
353:
349:
345:
341:
333:
329:
322:
307:
300:
294:
286:
279:
275:
264:
261:
260:
249:
246:
243:
242:
238:
235:
232:
231:
227:
225:
222:
219:
218:
214:
211:
208:
207:
203:
201:
198:
196:
192:
191:
185:
182:
178:
172:
170:
166:
162:
158:
154:
149:
147:
143:
139:
135:
125:
123:
119:
115:
111:
107:
103:
93:
90:
85:
83:
79:
75:
70:
68:
55:
53:
49:
45:
41:
37:
33:
29:
25:
21:
1169:Video coding
770:Tessellation
762:
680:Architecture
436:
428:
419:
413:
402:. Retrieved
397:
385:
366:
360:
331:
327:
321:
310:. Retrieved
305:
293:
284:
278:
180:
176:
173:
150:
136:types (e.g.
131:
121:
99:
86:
81:
77:
71:
56:
23:
19:
18:
1152:Compression
1023:Performance
975:Form factor
867:Framebuffer
831:Tensor unit
821:Shader unit
753:Ray-tracing
692:Fabrication
667:Intel 2700G
601:3dfx Voodoo
596:NEC µPD7220
128:Description
118:TeraScale 1
1196:Categories
1062:Frame rate
1030:Clock rate
992:Clustering
794:Components
574:Radeon Pro
404:2014-08-28
328:IEEE Micro
312:2014-07-17
270:References
169:multi-core
142:GDDR SDRAM
124:GPU chip.
89:SIMD lanes
67:SIMD lanes
1093:Scrolling
997:Switching
652:VideoCore
236:Workgroup
224:Wavefront
212:Work-item
138:DDR SDRAM
78:broadcast
1040:Fillrate
719:Geometry
579:Instinct
257:See also
146:XDR DRAM
1120:Texture
1050:Texel/s
1045:Pixel/s
982:IP core
934:HBM-PIM
801:Blitter
775:T&L
746:Shading
662:Imageon
657:Vivante
647:PowerVR
611:Glaze3D
542:GeForce
508:Desktop
352:2793450
334:(2): 6
265:(GPGPU)
247:NDRange
193:Nvidia
120:-based
104:in the
96:History
1098:Sprite
1057:FLOP/s
855:Memory
724:Vertex
707:MOSFET
702:FinFET
632:Adreno
625:Mobile
586:Matrox
569:Radeon
547:Quadro
537:Nvidia
449:
400:. 2011
350:
209:Thread
200:OpenCL
157:Nvidia
122:"R600"
112:, now
102:Nvidia
82:single
30:where
1212:GPGPU
1174:Codec
1132:GPGPU
939:HBM3E
924:HBM2E
907:GDDR7
902:GDDR6
897:GDDR5
892:GDDR4
887:GDDR3
882:GDDR2
872:SGRAM
557:Tegra
552:Tesla
515:Intel
394:(PDF)
348:S2CID
302:(PDF)
233:Block
1181:VLIW
1127:ASIC
1103:Tile
1081:Misc
966:VRAM
929:HBM3
919:HBM2
877:GDDR
763:SIMT
758:SIMD
697:CMOS
642:Mali
447:ISBN
445:ff.
244:Grid
220:Warp
195:CUDA
181:ELSE
159:and
74:ALUs
44:GPUs
40:SPMD
24:SIMT
914:HBM
862:DMA
736:MAC
564:AMD
530:Arc
501:GPU
443:314
371:hdl
340:doi
161:AMD
134:RAM
114:AMD
1198::
1115:GI
1110:3D
1088:2D
606:S3
525:Xe
520:GT
396:.
346:.
338:.
332:28
330:.
304:.
177:IF
144:,
140:,
69:.
486:e
479:t
472:v
455:.
407:.
379:.
373::
354:.
342::
315:.
179:-
63:p
59:p
22:(
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.