313:, which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations.
5108:
1289:
This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity, the memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in
331:
719:, the consequences are that the operations now take longer to complete. If multi-issue is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin.
3330:– either by way of algorithmically loading data from memory, or reordering (remapping) the normally linear access to vector elements, or providing "Accumulators", arbitrary-sized matrices may be efficiently processed. IBM POWER10 provides MMA instructions although for arbitrary Matrix widths that do not fit the exact SIMD size data repetition techniques are needed which is wasteful of register file resources. NVidia provides a high-level Matrix
3238:– Vector architectures with a register-to-register design (analogous to load–store architectures for scalar processors) have instructions for transferring multiple elements between the memory and the vector registers. Typically, multiple addressing modes are supported. The unit-stride addressing mode is essential; modern vector architectures typically also support arbitrary constant strides, as well as the scatter/gather (also called
217:
697:
743:
implementation things are rarely that simple. The data is rarely sent in raw form, and is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this
1301:) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD (
3208:, SIMD by definition avoids inter-lane operations entirely (element 0 can only be added to another element 0), vector processors tackle this head-on. What programmers are forced to do in software (using shuffle and other tricks, to swap data into the right "lane") vector processors must do in hardware, automatically.
2545:
hard-coded constant 16, n is decremented by a hard-coded 4, so initially it is hard to appreciate the significance. The difference comes in the realisation that the vector hardware could be capable of doing 4 simultaneous operations, or 64, or 10,000, it would be the exact same vector assembler for all of them
455:, which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture.
3472:
The above SIMD example could potentially fault and fail at the end of memory, due to attempts to read too many values: it could also cause significant numbers of page or misaligned faults by similarly crossing over boundaries. In contrast, by allowing the vector architecture the freedom to decide how
3464:
Contrast this situation with SIMD, which is a fixed (inflexible) load width and fixed data processing width, unable to cope with loads that cross page boundaries, and even if they were they are unable to adapt to what actually succeeded, yet, paradoxically, if the SIMD program were to even attempt to
2313:
For Cray-style vector ISAs such as RVV, an instruction called "setvl" (set vector length) is used. The hardware first defines how many data values it can process in one "vector": this could be either actual registers or it could be an internal loop (the hybrid approach, mentioned above). This maximum
1288:
to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with a special instruction, the significance compared to
Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding.
714:
Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide
491:
As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by
3384:– elements may typically contain two, three or four sub-elements (vec2, vec3, vec4) where any given bit of a predicate mask applies to the whole vec2/3/4, not the elements in the sub-vector. Sub-vectors are also introduced in RISC-V RVV (termed "LMUL"). Subvectors are a critical integral part of the
446:
the key distinguishing factor of SIMT-based GPUs is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1
4194:
Moreira, José E.; Barton, Kit; Battle, Steven; Bergner, Peter; Bertran, Ramon; Bhat, Puneeth; Caldeira, Pedro; Edelsohn, David; Fossum, Gordon; Frey, Brad; Ivanovic, Nemanja; Kerchner, Chip; Lim, Vincent; Kapoor, Shakti; Tulio
Machado Filho; Silvia Melitta Mueller; Olsson, Brett; Sadasivam, Satish;
2971:
for example, things go rapidly downhill just as they did with the general case of using SIMD for general-purpose IAXPY loops. To sum the four partial results, two-wide SIMD can be used, followed by a single scalar add, to finally produce the answer, but, frequently, the data must be transferred out
758:
in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has
742:
In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient
710:
Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing a LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to
650:
Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA
3503:
be the vectorization ratio. If the time taken for the vector unit to add an array of 64 numbers is 10 times faster than its equivalent scalar counterpart, r = 10. Also, if the total number of operations in a program is 100, out of which only 10 are scalar (after vectorization), then f = 0.9, i.e.,
3468:
This begins to hint at the reason why ffirst is so innovative, and is best illustrated by memcpy or strcpy when implemented with standard 128-bit non-predicated non-ffirst SIMD. For IBM POWER9 the number of hand-optimised instructions to implement strncpy is in excess of 240. By contrast, the same
3200:
From the IAXPY example, it can be seen that unlike SIMD processors, which can simplify their internal hardware by avoiding dealing with misaligned memory access, a vector processor cannot get away with such simplification: algorithms are written which inherently rely on Vector Load and Store being
3182:
Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any potential rounding errors do not matter, and low latency is
2544:
Also note, that just like the predicated SIMD variant, the pointers to x and y are advanced by t0 times four because they both point to 32 bit data, but that n is decremented by straight t0. Compared to the fixed-size SIMD assembler there is very little apparent difference: x and y are advanced by
1956:
Realistically, for general-purpose loops such as in portable libraries, where n cannot be limited in this way, the overhead of setup and cleanup for SIMD in order to cope with non-multiples of the SIMD width, can far exceed the instruction count inside the loop itself. Assuming worst-case that the
1750:
here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON
1357:
This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate the difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit
1245:
The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For a greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor
320:
to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined
3435:
workloads. Of interest, however, is that speed is far more important than accuracy in 3D for GPUs, where computation of pixel coordinates simply do not require high precision. The Vulkan specification recognises this and sets surprisingly low accuracy requirements, so that GPU Hardware can reduce
1961:
first have to have a preparatory section which works on the beginning unaligned data, up to the first point where SIMD memory-aligned operations can take over. this will either involve (slower) scalar-only operations or smaller-sized packed SIMD operations. Each copy implements the full algorithm
784:
To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To the CPU, this would look something like this:
302:, but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up.
2296:
Here it can be seen that the code is much cleaner but a little complex: at least, however, there is no setup or cleanup: on the last iteration of the loop, the predicate mask wil be set to either 0b0000, 0b0001, 0b0011, 0b0111 or 0b1111, resulting in between 0 and 4 SIMD element operations being
776:
Vector processors take this concept one step further. Instead of pipelining just the instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there".
294:
The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a
1196:
adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus
3280:– a less restrictive more generic variation of the compress/expand theme which instead takes one vector to specify the indices to use to "reorder" another vector. Gather/scatter is more complex to implement than compress/expand, and, being inherently non-sequential, can interfere with
2499:
This is essentially not very different from the SIMD version (processes 4 data elements per loop), or from the initial Scalar version (processes just the one). n still contains the number of data elements remaining to be processed, but t0 contains the copy of VL – the number that is
3270:– usually using a bit-mask, data is linearly compressed or expanded (redistributed) based on whether bits in the mask are set or clear, whilst always preserving the sequential order and never duplicating values (unlike Gather-Scatter aka permute). These instructions feature in
1669:
The STAR-like code remains concise, but because the STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access.
1332:(SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and the pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units.
686:, almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions.
1972:
Eight-wide SIMD requires repeating the inner loop algorithm first with four-wide SIMD elements, then two-wide SIMD, then one (scalar), with a test and branch in between each one, in order to cover the first and last remaining SIMD elements (0 <= n <= 7).
2975:
Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for
2983:
Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.
3477:
iterations of the loop the batches of vectorised memory reads are optimally aligned with the underlying caches and virtual memory arrangements. Additionally, the hardware may choose to use the opportunity to end any given loop iteration's memory reads
3397:– aka "Lane Shuffling" which allows sub-vector inter-element computations without needing extra (costly, wasteful) instructions to move the sub-elements into the correct SIMD "lanes" and also saves predicate mask bits. Effectively this is an in-flight
3482:
on a page boundary (avoiding a costly second TLB lookup), with speculative execution preparing the next virtual memory page whilst data is still being processed in the current loop. All of this is determined by the hardware, not the program itself.
2300:
It is clear how predicated SIMD at least merits the term "vector capable", because it can cope with variable-length vectors by using predicate masks. The final evolving step to a "true" vector ISA, however, is to not have any evidence in the ISA
1261:
prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too.
2332:
number that can be processed by the hardware in subsequent vector instructions, and sets the internal special register, "VL", to that same amount. ARM refers to this technique as "vector length agnostic" programming in its tutorials on SVE2.
168:
machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as
2552:
Not only is it a much more compact program (saving on L1 Cache size), but as previously mentioned, the vector version can issue far more data processing to the ALUs, again saving power because
Instruction Decode and Issue can sit idle.
3195:
Compared to any SIMD processor claiming to be a vector processor, the order of magnitude reduction in program size is almost shocking. However, this level of elegance at the ISA level has quite a high price tag at the hardware level:
646:
SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these is that vector processors, inherently by definition and design, have always been variable-length since their inception.
2536:
Where with predicated SIMD the mask bitlength is limited to that which may be held in a scalar (or special mask) register, vector ISA's mask registers have no such limitation. Cray-I vectors could be just over 1,000 elements (in
2009:
Vector processors on the other hand are designed to issue computations of variable length for an arbitrary count, n, and thus require very little setup, and no cleanup. Even compared to those SIMD ISAs which have masks (but no
1223:, as the supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the
415:
2 module within a card that physically resembles a graphics coprocessor, but instead of serving as a co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions.
3178:
The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on
Embedded implementations where maximum vector length could be only one).
777:
Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of the instruction
730:, the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to
191:(DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1.
1348:
IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads".
3186:
This example again highlights a key critical fundamental difference between true vector processors and those SIMD processors, including most commercial GPUs, which are inspired by features of vector processors.
3449:
Introduced in ARM SVE2 and RISC-V RVV is the concept of speculative sequential Vector Loads. ARM SVE2 has a special register named "First Fault
Register", where RVV modifies (truncates) the Vector Length (VL).
1296:
Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using the normal scalar pipeline. Modern vector processors (such as the
447:
data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This is
1313:) processing, and it is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication (
2846:
This is where the problems start. SIMD by design is incapable of doing arithmetic operations "inter-element". Element 0 of one SIMD register may be added to
Element 0 of another register, but Element 0 may
1343:
may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do a pipelined loop over 16 units for a hybrid approach. The
Broadcom
3465:
find out in advance (in each inner loop, every time) what might optimally succeed, those instructions only serve to hinder performance because they would, by necessity, be part of the critical inner loop.
2568:
This example starts with an algorithm which involves reduction. Just as with the previous example, it will be first shown in scalar instructions, then SIMD, and finally vector instructions, starting in
173:, the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category,
2006:
Without predication, the wider the SIMD width the worse the problems get, leading to massive opcode proliferation, degraded performance, extra power consumption and unnecessary software complexity.
1208:
adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down the decoding of the more common instructions such as normal adding. (
4195:
Saleil, Baptiste; Schmidt, Bill; Srinivasaraghavan, Rajalakshmi; Srivatsan, Shricharan; Thompto, Brian; Wagner, Andreas; Wu, Nelson (2021). "A matrix math facility for Power ISA(TM) processors".
1273:
field, but unlike the STAR-100 which uses memory for its repeats, the
Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of
327:. The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era.
3377:
operations as well as short vectors for common operations (RGB, ARGB, XYZ, XYZW) support for the following is typically present in modern GPUs, in addition to those found in vector processors:
2328:
On calling setvl with the number of outstanding data elements to be processed, "setvl" is permitted (essentially required) to limit that to the
Maximum Vector Length (MVL) and thus returns the
2324:
not make the mistake of assuming a fixed vector width: consequently MVL is not a quantity that the programmer needs to know. This can be a little disconcerting after years of SIMD mindset).
2838:
This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.
2556:
Additionally, the number of elements going in to the function can start at zero. This sets the vector length to zero, which effectively disables all vector instructions, turning them into
2336:
Below is the Cray-style vector assembler for the same SIMD style loop, above. Note that t0 (which, containing a convenient copy of VL, can vary) is used instead of hard-coded constants:
3457:
amount loaded to either the amount that would succeed without raising a memory fault or simply to an amount (greater than zero) that is most convenient. The important factor is that
1999:
Over time as the ISA evolves to keep increasing performance, it results in ISA Architects adding 2-wide SIMD, then 4-wide SIMD, then 8-wide and upwards. It can therefore be seen why
2022:
Assuming a hypothetical predicated (mask capable) SIMD ISA, and again assuming that the SIMD instructions can cope with misaligned data, the instruction loop would look like this:
703:
Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through
1219:
Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in
2014:
instruction), Vector processors produce much more compact code because they do not need to perform explicit mask calculation to cover the last few elements (illustrated below).
3807:
1200:
Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes
4299:
3899:
2314:
amount (the number of hardware "lanes") is termed "MVL" (Maximum Vector Length). Note that, as seen in SX-Aurora and
Videocore IV, MVL may be an actual hardware lane quantity
2297:
performed, respectively. One additional potential complication: some RISC ISAs do not have a "min" instruction, needing instead to use a branch or scalar predicated compare.
3461:
instructions are notified or may determine exactly how many Loads actually succeeded, using that quantity to only carry out work on the data that has actually been loaded.
1953:
Unfortunately for SIMD, the clue was in the assumption above, "that n is a multiple of 4" as well as "aligned access", which, clearly, is a limited specialist use-case.
781:
that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time.
3590:
1249:
The self-repeating instructions are found in early vector computers like the STAR-100, where the above action would be described in a single instruction (somewhat like
4042:
3630:
1181:
With the length (equivalent to SIMD width) not being hard-coded into the instruction, not only is the encoding more compact, it's also "future-proof" and allows even
403:
Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their
3562:
2526:
in the SIMD width (load32x4 etc.) the vector ISA equivalents have no such limit. This makes vector programs both portable, Vendor Independent, and future-proof.
1246:
either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length.
6173:
4434:
4092:
1946:
Note that both x and y pointers are incremented by 16, because that is how long (in bytes) four 32-bit integers are. The decision was made that the algorithm
3334:
API although the internal details are not available. The most resource-efficient technique is in-place reordering of access to otherwise linear vector data.
349:
machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies (
30:) that were specifically designed from the ground up to handle large Vectors (Arrays). For SIMD instructions present in some general-purpose computers, see
3254:
containing multiple members. The members are extracted from data structure (element), and each extracted member is placed into a different vector register.
1489:
In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability.
3304:– a very simple and strategically useful instruction which drops sequentially-incrementing immediates into successive elements. Usually starts from zero.
504:. The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing.
1189:
Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages.
3473:
many elements to load, the first part of a strncpy, if beginning initially on a sub-optimal memory boundary, may return just enough loads such that on
773:, but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time.
3937:
1175:
The code itself is also smaller, which can lead to more efficient memory use, reduction in L1 instruction cache size, reduction in power consumption.
734:, is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to the entire group of results.
3298:– useful for interaction between scalar and vector, these broadcast a single value across a vector, or extract one item from a vector, respectively.
5145:
3048:
The code when n is larger than the maximum vector length is not that much more complex, and is a similar pattern to the first example ("IAXPY").
31:
2855:
than another Element 0. This places some severe limitations on potential implementations. For simplicity it can be assumed that n is exactly 8:
6284:
5467:
3948:
3401:
of the sub-vector, heavily features in 3D Shader binaries, and is sufficiently important as to be part of the Vulkan SPIR-V spec. The Broadcom
3356:– including vectorised versions of bit-level permutation operations, bitfield insert and extract, centrifuge operations, population count, and
157:
5986:
4524:
3348:
or decimal fixed-point, and support for much larger (arbitrary precision) arithmetic operations by supporting parallel carry-in and carry-out
1965:
perform the aligned SIMD loop at the maximum SIMD width up until the last few elements (those remaining that do not fit the fixed SIMD width)
5264:
4376:
396:
processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed
6143:
5709:
5526:
2504:
to be processed in each iteration. t0 is subtracted from n after each iteration, and if n is zero then all elements have been processed.
6497:
380:
Throughout, Cray continued to be the performance leader, continually beating the competition with a series of machines that led to the
5489:
631:
4158:
3264:
allow parallel if/then/else constructs without resorting to branches. This allows code with conditional statements to be vectorized.
711:
the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design.
6138:
4310:
682:) which is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And
145:(ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single
3453:
The basic principle of ffirst is to attempt a large sequential Vector Load, but to allow the hardware to arbitrarily truncate the
2560:, at runtime. Thus, unlike non-predicated SIMD, even when there are no elements to process there is still no wasted cleanup code.
1746:), can do most of the operation in batches. The code is mostly similar to the scalar version. It is assumed that both x and y are
622:
Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as
73:, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional
6210:
4321:
3638:
is crucial to the performance. This ratio depends on the efficiency of the compilation like adjacency of the elements in memory.
6492:
5963:
4505:
3683:
200:
1169:
only three address translations are needed. Depending on the architecture, this can represent a significant savings by itself.
4237:
4078:
4545:
4106:
6907:
6031:
5294:
5138:
4772:
3960:
425:
178:
3882:
6917:
6058:
4795:
3317:
on a vector (for example, find the one maximum value of an entire vector, or sum all elements). Iteration is of the form
3261:
2996:
to the ISA. If it is assumed that n is less or equal to the maximum vector length, only three instructions are required:
1329:
1310:
1240:
361:(NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller.
2549:. Even compared to the predicate-capable SIMD, it is still more compact, clearer, more elegant and uses less resources.
5185:
4684:
3652:
3231:
Where many SIMD ISAs borrow or are inspired by the list below, typical features that a vector processor will have are:
595:
467:
follows similar principles as the early vector processors, and is being implemented in commercial products such as the
4540:
6225:
6053:
6026:
5405:
4790:
4767:
4266:
3357:
552:
264:
238:
126:
74:
246:
7076:
7040:
6603:
5496:
5462:
5457:
5376:
5341:
4369:
1497:
The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop:
3862:
3250:
variants of the standard vector load and stores. Segment loads read a vector from memory, where each element is a
1172:
Another saving is fetching and decoding the instruction itself, which has to be done only one time instead of ten.
1074:
Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL):
7015:
6912:
6313:
6220:
6021:
5242:
5131:
4762:
4577:
4019:
519:
in Flynn's Taxonomy. Common examples using SIMD with features inspired by vector processors include: Intel x86's
184:
767:
is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the
7091:
6041:
5760:
5195:
4869:
4783:
4732:
3984:
3769:
1185:
designs to consider using vectors purely to gain all the other advantages, rather than go for high performance.
578:
242:
58:
3288:
Memory Load/Store modes, Gather/scatter vector operations act on the vector registers, and are often termed a
442:, and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in
6215:
6063:
6036:
5897:
5511:
5472:
5329:
5093:
4927:
4778:
4465:
3406:
1751:
can. If it does not, a "splat" (broadcast) must be used, to copy the scalar argument across a SIMD register:
1743:
516:
452:
6652:
6414:
5890:
5851:
5506:
5501:
5435:
5247:
3678:
3281:
731:
704:
322:
288:
170:
295:
corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes.
7086:
6279:
5976:
5674:
5371:
5112:
5058:
4518:
4362:
3926:
1197:
completed far faster overall, the limiting factor being the time required to fetch the data from memory.
754:
In order to reduce the amount of time consumed by these steps, most modern CPUs use a technique known as
397:
358:
188:
81:(SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably
4253:
3831:
An algorithm of hardware unit generation for processor core synthesis with packed SIMD type instructions
133:
project. Solomon's goal was to dramatically increase math performance by using a large number of simple
6929:
6576:
5993:
5484:
5452:
5222:
5210:
5190:
5037:
4832:
4717:
4679:
4529:
4419:
3795:
3713:
3432:
748:
528:
105:
4332:
7020:
6983:
6973:
5361:
5053:
5032:
4977:
4864:
4854:
4827:
4689:
3708:
1318:
1306:
582:
524:
4030:
3734:
7035:
6442:
6378:
6355:
6205:
6167:
6003:
5953:
5948:
5425:
5319:
5227:
5007:
4633:
4572:
4485:
4125:
2570:
1747:
1359:
342:
278:
227:
1984:
increase in instruction count! This can easily be demonstrated by compiling the iaxpy example for
6988:
6771:
6665:
6629:
6546:
6530:
6372:
6161:
6120:
6108:
5971:
5885:
5806:
5571:
5232:
5175:
5068:
5063:
4922:
4513:
4180:
3673:
1993:
231:
138:
50:
4343:
4217:
4136:
4055:
4000:
1328:
Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use
476:
6794:
6766:
6676:
6641:
6390:
6384:
6366:
6100:
6094:
5998:
5902:
5793:
5732:
5594:
5237:
4807:
4739:
4643:
4535:
4490:
3703:
3569:
3436:
power usage. The concept of reducing accuracy where it is simply not needed is explored in the
1968:
have a cleanup phase which, like the preparatory section, is just as large and just as complex.
755:
540:
512:
366:
78:
3976:
7081:
6968:
6877:
6623:
6335:
6153:
5912:
5880:
5838:
5750:
5551:
5366:
5356:
5346:
5336:
5306:
5289:
5154:
4899:
4859:
4812:
4802:
4597:
4460:
4399:
3595:
1340:
1314:
769:
716:
520:
412:
309:. Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight
142:
6998:
6934:
6520:
6242:
6132:
6079:
5611:
5324:
5180:
5162:
4839:
4727:
4722:
4712:
4699:
4495:
3345:
1067:
which has performed 10 sequential operations: effectively the loop count is on an explicit
726:
and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the
400:
for use in supercomputers coupling several scalar processors to act as a vector processor.
317:
90:
82:
62:
3223:
These stark differences are what distinguishes a vector processor from one that has SIMD.
177:
computing. Around this time Flynn categorized this type of processing as an early form of
8:
7045:
7030:
6850:
6701:
6683:
6647:
6635:
6289:
6236:
6013:
5929:
5811:
5666:
5561:
5420:
5002:
4957:
4757:
4623:
4066:
3398:
3289:
3205:
2507:
A number of things to note, when comparing against the Predicated SIMD assembly variant:
1214:
principles: RVV only adds around 190 vector instructions even with the advanced features.
679:
443:
404:
3509:
693:
to cope with iteration and reduction. This is illustrated further with examples, below.
6902:
6894:
6746:
6721:
6525:
6400:
5924:
5865:
5745:
5477:
5205:
5027:
4876:
4849:
4674:
4638:
4628:
4587:
4429:
4409:
4404:
4385:
4196:
4170:
1285:
1227:
of vector ISAs brings other benefits which are compelling even for Embedded use-cases.
1182:
393:
174:
86:
2541:
Thus it can be seen, very clearly, how vector ISAs reduce the number of instructions.
6855:
6822:
6738:
6670:
6571:
6561:
6551:
6482:
6477:
6472:
6395:
6324:
6230:
6190:
5823:
5773:
5723:
5699:
5581:
5521:
5516:
5398:
5314:
5073:
4749:
4707:
4602:
4288:
4233:
3980:
3969:
3765:
3668:
3271:
1298:
492:
definition, the addition of SIMD cannot, by itself, qualify a processor as an actual
408:
374:
285:
4147:
7025:
6958:
6944:
6799:
6706:
6660:
6467:
6462:
6457:
6452:
6447:
6437:
6307:
6274:
6185:
6089:
5936:
5919:
5907:
5846:
5410:
5388:
5274:
5252:
5170:
5083:
4882:
4664:
4480:
4475:
4470:
4439:
4225:
3964:
3842:
3834:
3698:
3504:
90% of the work is done by the vector unit. It follows the achievable speed up of:
3405:
IV uses the terminology "Lane rotate" where the rest of the industry uses the term
3352:
2522:
Where the SIMD variant hard-coded both the width (4) into the creation of the mask
1950:
only cope with 4-wide SIMD, therefore the constant is hard-coded into the program.
468:
310:
70:
20:
4277:
3833:. Asia-Pacific Conference on Circuits and Systems. Vol. 1. pp. 171–176.
6939:
6924:
6872:
6776:
6751:
6588:
6581:
6432:
6427:
6422:
6361:
6269:
6259:
5981:
5816:
5768:
5531:
5415:
5383:
5284:
5279:
5200:
4947:
4887:
4822:
4669:
4659:
4592:
4582:
4424:
4414:
3647:
2972:
of dedicated SIMD registers before the last scalar computation can be performed.
1957:
hardware cannot do misaligned SIMD memory accesses, a real-world algorithm will:
1193:
764:
603:
156:
In 1962, Westinghouse cancelled the project, but the effort was restarted by the
54:
4229:
4218:"A Modular Massively Parallel Processor for Volumetric Visualisation Processing"
3838:
7050:
6884:
6867:
6860:
6756:
6613:
6350:
6264:
6195:
5778:
5740:
5689:
5684:
5679:
5393:
5217:
5078:
4894:
4551:
4444:
3913:
3663:
3309:
3285:
3251:
1265:
Interestingly, though, Broadcom included space in all vector operations of the
744:
568:
439:
109:
32:
Flynn's taxonomy § Single instruction stream, multiple data streams (SIMD)
2959:- Fourth SIMD ADD: element 3 of first group added to element 2 of second group
2947:- Second SIMD ADD: element 1 of first group added to element 1 of second group
463:
Several modern CPU architectures are being designed as vector processors. The
187:
sought to avoid many of the difficulties with the ILLIAC concept with its own
7070:
6845:
6761:
5801:
5783:
5576:
5269:
4967:
4844:
4093:"Assembly - Fastest way to do horizontal SSE vector sum (Or other reduction)"
3416:
3374:
2953:- Third SIMD ADD: element 2 of first group added to element 2 of second group
2941:- First SIMD ADD: element 0 of first group added to element 0 of second group
1220:
760:
97:
5704:
3971:
Computer Organization and Design: the Hardware/Software Interface page 751-2
7055:
6993:
6809:
6786:
6598:
6319:
5257:
4567:
3469:
strncpy routine in hand-optimised RVV assembler is a mere 22 instructions.
3341:
2700:
Here, an accumulator (y) is used to sum up all the values in the array, x.
1274:
664:
635:
615:
581:. Two notable examples which have per-element (lane-based) predication are
370:
3692:
1192:
But more than that, a high performance vector processor may have multiple
611:
464:
6840:
6804:
6515:
6487:
6345:
6200:
5123:
5088:
1773:
The time taken would be basically the same as a vector implementation of
1063:
Note the complete lack of looping in the instructions, because it is the
472:
298:
The STAR-100 was otherwise slower than CDC's own supercomputers like the
134:
2708:
The scalar version of this would load each of x, add it to y, and loop:
1178:
With the program size being reduced branch prediction has an easier job.
6726:
6716:
6711:
6693:
6593:
6566:
5828:
5661:
5631:
5351:
3829:
Miyaoka, Y.; Choi, J.; Togawa, N.; Yanagisawa, M.; Ohtsuki, T. (2002).
3431:
obviously feature much more predominantly in 3D than in many demanding
3242:) addressing mode. Advanced architectures may also include support for
793:; assume a, b, and c are memory locations in their respective registers
4171:"IBM's POWER10 Processor - William Starke & Brian W. Thompto, IBM"
3847:
330:
6817:
6814:
6556:
5626:
5604:
4962:
4937:
4354:
3808:"Andes Announces RISC-V Multicore 1024-bit Vector Processor: AX45MPV"
3428:
3402:
3314:
1345:
1336:
1266:
480:
389:
385:
305:
The vector technique was first fully exploited in 1976 by the famous
161:
146:
38:
216:
112:
designs led to a decline in vector supercomputers during the 1990s.
6832:
5651:
5012:
4992:
4917:
4201:
960:
But to a vector processor, this task looks considerably different:
536:
411:
places the processor and either 24 or 48 gigabytes of memory on an
334:
299:
281:
150:
16:
Computer processor which works on arrays of several numbers at once
4222:
High Performance Computing for Computer Graphics and Visualisation
3204:
Whilst from the reduction example it can be seen that, aside from
1742:
A modern packed SIMD architecture, known by many names (listed in
5641:
5599:
5017:
4997:
4972:
4607:
4175:
3437:
2977:
2000:
1985:
1322:
1302:
696:
683:
586:
560:
548:
544:
435:
354:
350:
1980:
the size of the code, in fact in extreme cases it results in an
641:
392:. Since then, the supercomputer market has focused much more on
85:
and similar tasks. Vector processing techniques also operate in
5656:
5621:
5586:
4987:
4982:
4111:
4005:
3867:
3688:
3424:
3388:
3385:
3370:
3201:
successful, regardless of alignment of the start of the vector.
1281:
747:
has historically become a large impediment to performance; see
727:
607:
599:
532:
381:
362:
346:
306:
165:
3762:
The history of computer technology in their faces (in Russian)
125:
Vector processing development began in the early 1960s at the
6114:
5646:
5616:
3658:
3566:
So, even if the performance of the vector unit is very high (
2992:
Vector instruction sets have arithmetic reduction operations
2557:
291:(ASC), which were introduced in 1974 and 1972, respectively.
141:(CPU). The CPU fed a single common instruction to all of the
100:
design through the 1970s into the 1990s, notably the various
61:
are designed to operate efficiently and effectively on large
3219:
simplified software and complex hardware (vector processors)
1210:
This can be somewhat mitigated by keeping the entire ISA to
796:; add 10 numbers in a to 10 numbers in b, store results in c
6978:
6126:
6046:
5636:
5022:
4952:
4942:
3828:
3420:
3331:
2305:
of a SIMD width, leaving that entirely up to the hardware.
1236:
1211:
663:
instruction in NEC SX, without restricting the length to a
627:
623:
564:
431:
101:
4193:
2320:(Note: As mentioned in the ARM SVE2 Tutorial, programmers
475:
vector processor architectures being developed, including
96:
Vector machines appeared in the early 1970s and dominated
19:"Array processor" redirects here. Not to be confused with
5566:
5556:
4932:
4909:
4107:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
4001:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
3863:"Riscv-v-spec/V-spec.adoc at master · riscv/Riscv-v-spec"
3691:, an open ISA standard with an associated variable width
1290:
1254:
556:
194:
27:
626:(Multiple Instruction, Multiple Data) and realized with
164:. Their version of the design originally called for a 1
3784:
3726:
689:
SIMD, because it uses fixed-width batch processing, is
486:
321:
into each of the ALU subunits, a technique they called
1309:) are capable of this kind of selective, per-element (
715:
64-bit ALUs. As shown in the diagram, which assumes a
4079:"Sse - 1-to-4 broadcast and 4-to-1 reduce in AVX-512"
3598:
3572:
3512:
1737:
1165:
There are several savings inherent in this approach.
345:
tried to re-enter the high-end market again with its
337:
processor module with four scalar/vector processors
3968:
3883:"Vector Engine Assembly Language Reference Manual"
3624:
3584:
3556:
638:VLIW/vector processor combines both technologies.
3959:
3364:
1339:IV and other external vector processors like the
634:(Explicitly Parallel Instruction Computing). The
120:
7068:
4043:"Coding for Neon - Part 3 Matrix Multiplication"
203:was presented and developed by Kartsev in 1967.
3810:(Press release). GlobeNewswire. 7 December 2022
3759:
3216:complex software and simplified hardware (SIMD)
722:A vector processor, by contrast, even if it is
4322:PATCH to libc6 to add optimised POWER9 strncpy
3212:Overall then there is a choice to either have
5139:
4370:
2934:At this point four adds have been performed:
2547:and there would still be no SIMD cleanup code
1352:
1277:or sourced from one of the scalar registers.
1204:instructions run slower—i.e., whenever it is
1079:; again assume we have vector registers v1-v3
642:Difference between SIMD and vector processors
3226:
2533:that is automatically applied to the vectors
1358:integer variant of the "DAXPY" function, in
655:a way to set the vector length, such as the
26:This article is about Processors (including
6144:Computer performance by orders of magnitude
3486:
2563:
245:. Unsourced material may be challenged and
5153:
5146:
5132:
4377:
4363:
158:University of Illinois at Urbana–Champaign
4200:
3975:(2nd ed.). Morgan Kaufmann. p.
3846:
3753:
3732:
3190:
2308:
265:Learn how and when to remove this message
4215:
4159:RVV register gather-scatter instructions
3927:Vector and SIMD processors, slides 12-13
3444:
2096:# now do the operation, masked by m bits
2030:# prepare mask. few ISAs have min though
695:
369:(FPS) built add-on array processors for
329:
277:The first vector supercomputers are the
2987:
1335:In addition, GPUs such as the Broadcom
965:; assume we have vector registers v1-v3
749:Random-access memory § Memory wall
667:or to a multiple of a fixed data width.
7069:
4384:
4278:Abandoned US patent US20110227920-0096
4031:Videocore IV QPU analysis by Jeff Bush
3938:Array vs Vector Processing, slides 5-7
3684:Computer for operations with functions
3278:Register Gather, Scatter (aka permute)
1230:
1082:; with size larger than or equal to 10
670:Iteration and reduction over elements
201:computer for operations with functions
195:Computer for operations with functions
5127:
4358:
2963:but with 4-wide SIMD being incapable
2703:
2515:instruction has embedded within it a
1293:, which face exactly the same issue.
458:
137:under the control of a single master
6115:Floating-point operations per second
487:Comparison with modern architectures
426:Single instruction, multiple threads
243:adding citations to reliable sources
210:
179:single instruction, multiple threads
1492:
1330:Single Instruction Multiple Threads
1241:Single Instruction Multiple Threads
968:; with size equal or larger than 10
759:left the CPU, in the fashion of an
13:
3733:Parkinson, Dennis (17 June 1976).
3579:
2017:
1738:Pure (non-predicated, packed) SIMD
430:Modern graphics processing units (
14:
7103:
2841:
2207:# update x, y and n for next loop
952:; loop back if count is not yet 0
659:instruction in RISCV RVV, or the
630:(Very Long Instruction Word) and
407:of computers. Most recently, the
206:
127:Westinghouse Electric Corporation
104:platforms. The rapid fall in the
75:single instruction, multiple data
7041:Semiconductor device fabrication
5107:
5106:
4183:from the original on 2021-12-11.
4020:Videocore IV Programmer's Manual
3949:SIMD vs Vector GPU, slides 22-24
3632:, which suggests that the ratio
471:AX45MPV. There are also several
215:
7016:History of general-purpose CPUs
5243:Nondeterministic Turing machine
4578:Analysis of parallel algorithms
4337:
4326:
4315:
4304:
4293:
4282:
4271:
4260:
4246:
4209:
4187:
4163:
4152:
4141:
4130:
4119:
4099:
4085:
4071:
4060:
4049:
4035:
4024:
4013:
3993:
3953:
3942:
3931:
3920:
3906:
3900:"Documentation – Arm Developer"
3592:) there is a speedup less than
3321:where Reduction is of the form
2531:creates a hidden predicate mask
511:- also known as "Packed SIMD",
451:more complex and involved than
185:International Computers Limited
153:, fed in the form of an array.
5196:Deterministic finite automaton
3892:
3875:
3855:
3822:
3800:
3789:
3778:
3655:on pipelined vector processors
3619:
3607:
3551:
3536:
3524:
3521:
3497:be the vector speed ratio and
3365:GPU vector processing features
2980:of how to do "Horizontal Sum"
1253:). They are also found in the
737:
121:Early research and development
1:
5987:Simultaneous and heterogenous
4525:Simultaneous and heterogenous
3785:MIAOW Vertical Research Group
3719:
1675:; Assume tmp is pre-allocated
1284:introduced the idea of using
614:. Although memory-based, the
598:- these include the original
6671:Integrated memory controller
6653:Translation lookaside buffer
5852:Memory dependence prediction
5295:Random-access stored program
5248:Probabilistic Turing machine
5113:Category: Parallel computing
4344:ARM SVE2 paper by N. Stevens
4254:"CUDA C++ Programming Guide"
4216:Krikelis, Anargyros (1996).
3679:Chaining (vector processing)
3344:arithmetic, but can include
618:was also a vector processor.
289:Advanced Scientific Computer
171:computational fluid dynamics
7:
6127:Synaptic updates per second
4230:10.1007/978-1-4471-1011-8_8
3839:10.1109/APCCAS.2002.1114930
3735:"Computers by the thousand"
3641:
790:; Hypothetical RISC machine
717:multi-issue execution model
567:collaborated to create the
398:Virtual Vector Architecture
373:, later building their own
359:Nippon Electric Corporation
189:Distributed Array Processor
10:
7108:
6531:Heterogeneous architecture
5453:Orthogonal instruction set
5223:Alternating Turing machine
5211:Quantum cellular automaton
4420:High-performance computing
3714:Supercomputer architecture
3313:– operations that perform
3284:. Not to be confused with
1990:"-O3 -march=knl"
1353:Vector instruction example
1234:
423:
115:
106:price-to-performance ratio
25:
18:
7021:Microprocessor chronology
7008:
6984:Dynamic frequency scaling
6957:
6893:
6831:
6785:
6737:
6692:
6612:
6539:
6508:
6413:
6334:
6298:
6252:
6152:
6139:Cache performance metrics
6078:
6012:
5962:
5873:
5864:
5837:
5792:
5759:
5731:
5722:
5542:
5445:
5434:
5305:
5161:
5102:
5054:Automatic parallelization
5046:
4908:
4748:
4698:
4690:Application checkpointing
4652:
4616:
4560:
4504:
4453:
4392:
3709:History of supercomputing
3585:{\displaystyle r=\infty }
3227:Vector processor features
2968:
2956:
2950:
2944:
2938:
2516:
2512:
2011:
1989:
1774:
1270:
1258:
1250:
1091:# Set vector length VL=10
678:Predicated SIMD (part of
660:
656:
341:Other examples followed.
69:. This is in contrast to
53:(CPU) that implements an
7036:Hardware security module
6379:Digital signal processor
6356:Graphics processing unit
6168:Graphics processing unit
4300:Introduction to ARM SVE2
3760:B.N. Malinovsky (1995).
3487:Performance and speed up
3050:
2998:
2857:
2710:
2575:
2564:Vector reduction example
2338:
2024:
1779:
1753:
1672:
1499:
1364:
1325:) categorically do not.
1076:
962:
787:
343:Control Data Corporation
279:Control Data Corporation
7077:Central processing unit
6989:Dynamic voltage scaling
6772:Memory address register
6666:Branch target predictor
6630:Address generation unit
6373:Physics processing unit
6162:Central processing unit
6121:Transactions per second
6109:Instructions per second
6032:Array processing (SIMT)
5176:Stored-program computer
5069:Embarrassingly parallel
5064:Deterministic algorithm
4056:SIMD considered harmful
3674:Automatic vectorization
3625:{\displaystyle 1/(1-f)}
3328:Matrix Multiply support
3167:# repeat if n != 0
2821:; loop back if n > 0
2725:; y initialised to zero
2529:Setting VL effectively
2494:# repeat if n != 0
1658:; loop back if n > 0
465:RISC-V vector extension
438:which may be driven by
139:Central processing unit
51:central processing unit
6795:Hardwired control unit
6677:Memory management unit
6642:Memory management unit
6391:Secure cryptoprocessor
6385:Tensor Processing Unit
6367:Vision processing unit
6101:Cycles per instruction
6095:Instructions per cycle
6042:Associative processing
5733:Instruction pipelining
5155:Processor technologies
4784:Associative processing
4740:Non-blocking algorithm
4546:Clustered multi-thread
3704:Tensor Processing Unit
3626:
3586:
3558:
3191:Insights from examples
2309:Pure (true) vector ISA
756:instruction pipelining
700:
579:associative processing
513:SIMD within a register
434:) include an array of
419:
367:Floating Point Systems
338:
143:arithmetic logic units
79:SIMD within a register
63:one-dimensional arrays
7092:Vector supercomputers
6878:Sum-addressed decoder
6624:Arithmetic logic unit
5751:Classic RISC pipeline
5705:Epiphany architecture
5552:Motorola 68000 series
4900:Hardware acceleration
4813:Superscalar processor
4803:Dataflow architecture
4400:Distributed computing
4311:RVV fault-first loads
3914:"Vector Architecture"
3627:
3587:
3559:
3445:Fault (or Fail) First
3373:applications needing
3338:Advanced Math formats
3246:load and stores, and
3236:Vector Load and Store
2851:be added to anything
2743:; load one 32bit data
2285:; go back if n > 0
2093:; m = (1<<t0)-1
1935:; go back if n > 0
1517:; load one 32bit data
1341:NEC SX-Aurora TSUBASA
699:
571:, which is also SIMD.
333:
316:The Cray design used
91:graphics accelerators
6999:Performance per watt
6577:replacement policies
6243:Package on a package
6133:Performance per watt
6037:Pipelined processing
5807:Tomasulo's algorithm
5612:Clipper architecture
5468:Application-specific
5181:Finite-state machine
4779:Pipelined processing
4728:Explicit parallelism
4723:Implicit parallelism
4713:Dataflow programming
4224:. pp. 101–124.
4045:. 11 September 2013.
3596:
3570:
3510:
3346:binary-coded decimal
3206:permute instructions
2988:Vector ISA reduction
2228:; x := x + t0*4
2186:; v3 := v1 + v2
1988:, using the options
1848:; v3 := v1 + v2
1571:; r3 := r1 + r2
1257:architecture as the
859:; r3 := r1 + r2
594:- as categorised in
531:instructions, AMD's
318:pipeline parallelism
239:improve this section
83:numerical simulation
7031:Digital electronics
6684:Instruction decoder
6636:Floating-point unit
6290:Soft microprocessor
6237:System in a package
5812:Reservation station
5342:Transport-triggered
5003:Parallel Extensions
4808:Pipelined processor
4333:RVV strncpy example
4115:. 19 November 2022.
4009:. 19 November 2022.
3961:Patterson, David A.
3419:operations such as
3290:permute instruction
3268:Compress and Expand
3137:# advance x by VL*4
3080:# VL=t0=min(MVL, n)
3043:# reduce-add into y
3013:# VL=t0=min(MVL, n)
2833:; returns result, y
2464:# advance x by VL*4
2443:# advance y by VL*4
2356:# VL=t0=min(MVL, n)
2159:; v1 := v1 * a
1827:; v1 := v1 * a
1550:; r1 := r1 * a
1286:processor registers
1231:Vector instructions
517:Pipelined Processor
7087:Parallel computing
6903:Integrated circuit
6747:Processor register
6401:Baseband processor
5746:Operand forwarding
5206:Cellular automaton
4877:Massively parallel
4855:distributed shared
4675:Cache invalidation
4639:Instruction window
4430:Manycore processor
4410:Massively parallel
4405:Parallel computing
4386:Parallel computing
4267:LMUL > 1 in RVV
3741:. pp. 626–627
3622:
3582:
3557:{\displaystyle r/}
3554:
3395:Sub-vector Swizzle
3116:# add all x into y
2764:; y := y + r1
2267:; n := n - t0
1982:order of magnitude
1881:; x := x + 16
1251:vadd c, a, b, $ 10
1183:embedded processor
1157:# 10 stores into c
811:; count := 10
701:
500:, and vectors are
496:, because SIMD is
459:Recent development
444:Flynn's 1972 paper
394:massively parallel
375:minisupercomputers
339:
175:massively parallel
87:video-game console
7064:
7063:
6953:
6952:
6572:Instruction cache
6562:Scratchpad memory
6409:
6408:
6396:Network processor
6325:Network on a chip
6280:Ultra-low-voltage
6231:Multi-chip module
6074:
6073:
5860:
5859:
5847:Branch prediction
5824:Register renaming
5718:
5717:
5700:VISC architecture
5522:Quantum computing
5517:VISC architecture
5399:Secondary storage
5315:Microarchitecture
5275:Register machines
5121:
5120:
5074:Parallel slowdown
4708:Stream processing
4598:Karp–Flatt metric
4239:978-3-540-76016-0
4148:SX-Arora Overview
4067:ARM SVE2 tutorial
3965:Hennessy, John L.
3669:Stream processing
3653:Duncan's taxonomy
3340:– often includes
3296:Splat and Extract
3258:Masked Operations
2806:; n := n - 1
2785:; x := x + 4
2072:; m = 1<<t0
1920:; n := n - 4
1777:described above.
1643:; n := n - 1
1604:; x := x + 4
1299:SX-Aurora TSUBASA
1121:# 10 loads from b
1106:# 10 loads from a
596:Duncan's taxonomy
509:Pure (fixed) SIMD
409:SX-Aurora TSUBASA
286:Texas Instruments
275:
274:
267:
71:scalar processors
7099:
7026:Processor design
6918:Power management
6800:Instruction unit
6661:Branch predictor
6610:
6609:
6308:System on a chip
6250:
6249:
6090:Transistor count
6014:Flynn's taxonomy
5871:
5870:
5729:
5728:
5532:Addressing modes
5443:
5442:
5389:Memory hierarchy
5253:Hypercomputation
5171:Abstract machine
5148:
5141:
5134:
5125:
5124:
5110:
5109:
5084:Software lockout
4883:Computer cluster
4818:Vector processor
4773:Array processing
4758:Flynn's taxonomy
4665:Memory coherence
4440:Computer network
4379:
4372:
4365:
4356:
4355:
4346:
4341:
4335:
4330:
4324:
4319:
4313:
4308:
4302:
4297:
4291:
4289:Videocore IV QPU
4286:
4280:
4275:
4269:
4264:
4258:
4257:
4250:
4244:
4243:
4213:
4207:
4206:
4204:
4191:
4185:
4184:
4167:
4161:
4156:
4150:
4145:
4139:
4134:
4128:
4123:
4117:
4116:
4103:
4097:
4096:
4089:
4083:
4082:
4075:
4069:
4064:
4058:
4053:
4047:
4046:
4039:
4033:
4028:
4022:
4017:
4011:
4010:
3997:
3991:
3990:
3974:
3957:
3951:
3946:
3940:
3935:
3929:
3924:
3918:
3917:
3916:. 27 April 2020.
3910:
3904:
3903:
3896:
3890:
3889:
3887:
3879:
3873:
3872:
3859:
3853:
3852:
3850:
3826:
3820:
3819:
3817:
3815:
3804:
3798:
3793:
3787:
3782:
3776:
3775:
3757:
3751:
3750:
3748:
3746:
3730:
3699:Barrel processor
3693:vector extension
3631:
3629:
3628:
3623:
3606:
3591:
3589:
3588:
3583:
3563:
3561:
3560:
3555:
3520:
3353:Bit manipulation
3324:
3320:
3174:
3171:
3168:
3165:
3162:
3159:
3156:
3153:
3150:
3147:
3144:
3141:
3138:
3135:
3132:
3129:
3126:
3123:
3120:
3117:
3114:
3111:
3108:
3105:
3102:
3099:
3096:
3093:
3090:
3087:
3084:
3081:
3078:
3075:
3072:
3069:
3066:
3063:
3060:
3057:
3054:
3044:
3041:
3038:
3035:
3032:
3029:
3026:
3023:
3020:
3017:
3014:
3011:
3008:
3005:
3002:
2970:
2958:
2952:
2946:
2940:
2930:
2927:
2924:
2921:
2918:
2915:
2912:
2909:
2906:
2903:
2900:
2897:
2894:
2891:
2888:
2885:
2882:
2879:
2878:; for 2nd 4 of x
2876:
2873:
2870:
2867:
2864:
2861:
2834:
2831:
2828:
2825:
2822:
2819:
2816:
2813:
2810:
2807:
2804:
2801:
2798:
2795:
2792:
2789:
2786:
2783:
2780:
2777:
2774:
2771:
2768:
2765:
2762:
2759:
2756:
2753:
2750:
2747:
2744:
2741:
2738:
2735:
2732:
2729:
2726:
2723:
2720:
2717:
2714:
2704:Scalar assembler
2696:
2693:
2690:
2687:
2684:
2681:
2678:
2675:
2672:
2669:
2666:
2663:
2660:
2657:
2654:
2651:
2648:
2645:
2642:
2639:
2636:
2633:
2630:
2627:
2624:
2621:
2618:
2615:
2612:
2609:
2606:
2603:
2600:
2597:
2594:
2591:
2588:
2585:
2582:
2579:
2518:
2514:
2495:
2492:
2489:
2486:
2483:
2480:
2477:
2474:
2471:
2468:
2465:
2462:
2459:
2456:
2453:
2450:
2447:
2444:
2441:
2438:
2435:
2432:
2429:
2426:
2423:
2420:
2417:
2414:
2411:
2408:
2405:
2402:
2399:
2396:
2393:
2390:
2387:
2384:
2381:
2378:
2375:
2372:
2369:
2366:
2363:
2360:
2357:
2354:
2351:
2348:
2345:
2342:
2316:or a virtual one
2292:
2289:
2286:
2283:
2280:
2277:
2274:
2271:
2268:
2265:
2262:
2259:
2256:
2253:
2250:
2247:
2244:
2241:
2238:
2235:
2232:
2229:
2226:
2223:
2220:
2217:
2214:
2211:
2208:
2205:
2202:
2199:
2196:
2193:
2190:
2187:
2184:
2181:
2178:
2175:
2172:
2169:
2166:
2163:
2160:
2157:
2154:
2151:
2148:
2145:
2142:
2139:
2136:
2133:
2130:
2127:
2124:
2121:
2118:
2115:
2112:
2109:
2106:
2103:
2100:
2097:
2094:
2091:
2088:
2085:
2082:
2079:
2076:
2073:
2070:
2067:
2064:
2061:
2058:
2055:
2052:
2051:; t0 = min(n, 4)
2049:
2046:
2043:
2040:
2037:
2034:
2031:
2028:
2013:
1991:
1942:
1939:
1936:
1933:
1930:
1927:
1924:
1921:
1918:
1915:
1912:
1909:
1906:
1903:
1900:
1897:
1894:
1891:
1888:
1885:
1882:
1879:
1876:
1873:
1870:
1867:
1864:
1861:
1858:
1855:
1852:
1849:
1846:
1843:
1840:
1837:
1834:
1831:
1828:
1825:
1822:
1819:
1816:
1813:
1810:
1807:
1804:
1801:
1798:
1795:
1792:
1789:
1786:
1783:
1776:
1769:
1766:
1763:
1760:
1757:
1748:properly aligned
1744:Flynn's taxonomy
1733:
1730:
1727:
1724:
1721:
1718:
1715:
1712:
1709:
1706:
1703:
1700:
1697:
1694:
1691:
1688:
1685:
1682:
1679:
1676:
1665:
1662:
1659:
1656:
1653:
1650:
1647:
1644:
1641:
1638:
1635:
1632:
1629:
1626:
1623:
1620:
1617:
1614:
1611:
1608:
1605:
1602:
1599:
1596:
1593:
1590:
1587:
1584:
1581:
1578:
1575:
1572:
1569:
1566:
1563:
1560:
1557:
1554:
1551:
1548:
1545:
1542:
1539:
1536:
1533:
1530:
1527:
1524:
1521:
1518:
1515:
1512:
1509:
1506:
1503:
1493:Scalar assembler
1485:
1482:
1479:
1476:
1473:
1470:
1467:
1464:
1461:
1458:
1455:
1452:
1449:
1446:
1443:
1440:
1437:
1434:
1431:
1428:
1425:
1422:
1419:
1416:
1413:
1410:
1407:
1404:
1401:
1398:
1395:
1392:
1389:
1386:
1383:
1380:
1377:
1374:
1371:
1368:
1272:
1260:
1252:
1194:functional units
1161:
1158:
1155:
1152:
1149:
1146:
1143:
1140:
1137:
1134:
1131:
1128:
1125:
1122:
1119:
1116:
1113:
1110:
1107:
1104:
1101:
1098:
1095:
1092:
1089:
1086:
1083:
1080:
1059:
1056:
1053:
1050:
1047:
1044:
1041:
1038:
1035:
1032:
1029:
1026:
1023:
1020:
1017:
1014:
1011:
1008:
1005:
1002:
999:
996:
993:
990:
987:
984:
981:
978:
975:
972:
969:
966:
956:
953:
950:
947:
944:
941:
938:
935:
932:
929:
926:
923:
920:
917:
914:
911:
908:
905:
902:
899:
896:
893:
890:
887:
884:
881:
878:
875:
872:
869:
866:
863:
860:
857:
854:
851:
848:
845:
842:
839:
836:
833:
830:
827:
824:
821:
818:
815:
812:
809:
806:
803:
800:
797:
794:
791:
691:unable by design
680:Flynn's taxonomy
662:
658:
577:- also known as
494:vector processor
469:Andes Technology
436:shader pipelines
311:vector registers
270:
263:
259:
256:
250:
219:
211:
108:of conventional
89:hardware and in
43:vector processor
21:array processing
7107:
7106:
7102:
7101:
7100:
7098:
7097:
7096:
7067:
7066:
7065:
7060:
7046:Tick–tock model
7004:
6960:
6949:
6889:
6873:Address decoder
6827:
6781:
6777:Program counter
6752:Status register
6733:
6688:
6648:Load–store unit
6615:
6608:
6535:
6504:
6405:
6362:Image processor
6337:
6330:
6300:
6294:
6270:Microcontroller
6260:Embedded system
6248:
6148:
6081:
6070:
6008:
5958:
5856:
5833:
5817:Re-order buffer
5788:
5769:Data dependency
5755:
5714:
5544:
5538:
5437:
5436:Instruction set
5430:
5416:Multiprocessing
5384:Cache hierarchy
5377:Register/memory
5301:
5201:Queue automaton
5157:
5152:
5122:
5117:
5098:
5042:
4948:Coarray Fortran
4904:
4888:Beowulf cluster
4744:
4694:
4685:Synchronization
4670:Cache coherence
4660:Multiprocessing
4648:
4612:
4593:Cost efficiency
4588:Gustafson's law
4556:
4500:
4449:
4425:Multiprocessing
4415:Cloud computing
4388:
4383:
4352:
4350:
4349:
4342:
4338:
4331:
4327:
4320:
4316:
4309:
4305:
4298:
4294:
4287:
4283:
4276:
4272:
4265:
4261:
4252:
4251:
4247:
4240:
4214:
4210:
4192:
4188:
4169:
4168:
4164:
4157:
4153:
4146:
4142:
4135:
4131:
4124:
4120:
4105:
4104:
4100:
4091:
4090:
4086:
4077:
4076:
4072:
4065:
4061:
4054:
4050:
4041:
4040:
4036:
4029:
4025:
4018:
4014:
3999:
3998:
3994:
3987:
3958:
3954:
3947:
3943:
3936:
3932:
3925:
3921:
3912:
3911:
3907:
3898:
3897:
3893:
3888:. 16 June 2023.
3885:
3881:
3880:
3876:
3871:. 16 June 2023.
3861:
3860:
3856:
3827:
3823:
3813:
3811:
3806:
3805:
3801:
3794:
3790:
3783:
3779:
3772:
3758:
3754:
3744:
3742:
3731:
3727:
3722:
3648:SX architecture
3644:
3602:
3597:
3594:
3593:
3571:
3568:
3567:
3516:
3511:
3508:
3507:
3489:
3447:
3413:Transcendentals
3367:
3322:
3318:
3282:vector chaining
3262:predicate masks
3229:
3193:
3176:
3175:
3172:
3169:
3166:
3163:
3160:
3157:
3154:
3151:
3148:
3145:
3142:
3139:
3136:
3133:
3130:
3127:
3124:
3121:
3118:
3115:
3112:
3109:
3106:
3103:
3100:
3097:
3095:# load vector x
3094:
3091:
3088:
3085:
3082:
3079:
3076:
3073:
3070:
3067:
3064:
3061:
3058:
3055:
3052:
3046:
3045:
3042:
3039:
3036:
3033:
3030:
3028:# load vector x
3027:
3024:
3021:
3018:
3015:
3012:
3009:
3006:
3003:
3000:
2990:
2932:
2931:
2928:
2925:
2922:
2919:
2916:
2913:
2910:
2907:
2904:
2901:
2898:
2895:
2892:
2889:
2886:
2883:
2880:
2877:
2874:
2871:
2868:
2865:
2862:
2859:
2844:
2836:
2835:
2832:
2829:
2826:
2823:
2820:
2817:
2814:
2811:
2808:
2805:
2802:
2799:
2796:
2793:
2790:
2787:
2784:
2781:
2778:
2775:
2772:
2769:
2766:
2763:
2760:
2757:
2754:
2751:
2748:
2745:
2742:
2739:
2736:
2733:
2730:
2727:
2724:
2721:
2718:
2715:
2712:
2706:
2698:
2697:
2694:
2691:
2688:
2685:
2682:
2679:
2676:
2673:
2670:
2667:
2664:
2661:
2658:
2655:
2652:
2649:
2646:
2643:
2640:
2637:
2634:
2631:
2628:
2625:
2622:
2619:
2616:
2613:
2610:
2607:
2604:
2601:
2598:
2595:
2592:
2589:
2586:
2583:
2580:
2577:
2566:
2497:
2496:
2493:
2490:
2487:
2484:
2481:
2478:
2475:
2472:
2469:
2466:
2463:
2460:
2457:
2454:
2451:
2448:
2445:
2442:
2439:
2436:
2433:
2430:
2427:
2424:
2421:
2418:
2415:
2412:
2409:
2406:
2403:
2400:
2397:
2394:
2391:
2388:
2386:# load vector y
2385:
2382:
2379:
2376:
2373:
2371:# load vector x
2370:
2367:
2364:
2361:
2358:
2355:
2352:
2349:
2346:
2343:
2340:
2311:
2294:
2293:
2290:
2287:
2284:
2281:
2278:
2275:
2272:
2269:
2266:
2263:
2260:
2257:
2254:
2251:
2248:
2245:
2242:
2239:
2236:
2233:
2230:
2227:
2224:
2221:
2218:
2215:
2212:
2209:
2206:
2203:
2200:
2197:
2194:
2191:
2188:
2185:
2182:
2179:
2176:
2173:
2170:
2167:
2164:
2161:
2158:
2155:
2152:
2149:
2146:
2143:
2140:
2137:
2134:
2131:
2128:
2125:
2122:
2119:
2116:
2113:
2110:
2107:
2104:
2101:
2098:
2095:
2092:
2089:
2086:
2083:
2080:
2077:
2074:
2071:
2068:
2065:
2062:
2059:
2056:
2053:
2050:
2047:
2044:
2041:
2038:
2035:
2032:
2029:
2026:
2020:
2018:Predicated SIMD
2003:exists in x86.
1976:This more than
1944:
1943:
1940:
1937:
1934:
1931:
1928:
1925:
1922:
1919:
1916:
1913:
1910:
1907:
1904:
1901:
1898:
1895:
1892:
1889:
1886:
1883:
1880:
1877:
1874:
1871:
1868:
1865:
1862:
1859:
1856:
1853:
1850:
1847:
1844:
1841:
1838:
1835:
1832:
1829:
1826:
1823:
1820:
1817:
1814:
1811:
1808:
1805:
1802:
1799:
1796:
1793:
1790:
1787:
1784:
1781:
1771:
1770:
1767:
1764:
1761:
1758:
1755:
1740:
1735:
1734:
1731:
1728:
1725:
1722:
1719:
1716:
1713:
1710:
1707:
1704:
1701:
1698:
1695:
1692:
1689:
1686:
1683:
1680:
1677:
1674:
1667:
1666:
1663:
1660:
1657:
1654:
1651:
1648:
1645:
1642:
1639:
1636:
1633:
1630:
1627:
1624:
1621:
1618:
1615:
1612:
1609:
1606:
1603:
1600:
1597:
1594:
1591:
1588:
1585:
1582:
1579:
1576:
1573:
1570:
1567:
1564:
1561:
1558:
1555:
1552:
1549:
1546:
1543:
1540:
1537:
1534:
1531:
1528:
1525:
1522:
1519:
1516:
1513:
1510:
1507:
1504:
1501:
1495:
1487:
1486:
1483:
1480:
1477:
1474:
1471:
1468:
1465:
1462:
1459:
1456:
1453:
1450:
1447:
1444:
1441:
1438:
1435:
1432:
1429:
1426:
1423:
1420:
1417:
1414:
1411:
1408:
1405:
1402:
1399:
1396:
1393:
1390:
1387:
1384:
1381:
1378:
1375:
1372:
1369:
1366:
1355:
1243:
1233:
1163:
1162:
1159:
1156:
1153:
1150:
1147:
1144:
1141:
1138:
1135:
1132:
1129:
1126:
1123:
1120:
1117:
1114:
1111:
1108:
1105:
1102:
1099:
1096:
1093:
1090:
1087:
1084:
1081:
1078:
1069:per-instruction
1061:
1060:
1057:
1054:
1051:
1048:
1045:
1042:
1039:
1036:
1033:
1030:
1027:
1024:
1021:
1018:
1015:
1012:
1009:
1006:
1003:
1000:
997:
994:
991:
988:
985:
982:
979:
976:
973:
970:
967:
964:
958:
957:
954:
951:
948:
945:
942:
939:
936:
933:
930:
927:
924:
921:
918:
915:
912:
909:
906:
903:
900:
897:
894:
891:
888:
885:
882:
879:
876:
873:
870:
867:
864:
861:
858:
855:
852:
849:
846:
843:
840:
837:
834:
831:
828:
825:
822:
819:
816:
813:
810:
807:
804:
801:
798:
795:
792:
789:
765:address decoder
740:
732:vector chaining
705:vector chaining
644:
604:Convex C-Series
575:Predicated SIMD
502:variable-length
489:
461:
440:compute kernels
428:
422:
324:vector chaining
271:
260:
254:
251:
236:
220:
209:
197:
123:
118:
65:of data called
55:instruction set
47:array processor
35:
24:
17:
12:
11:
5:
7105:
7095:
7094:
7089:
7084:
7079:
7062:
7061:
7059:
7058:
7053:
7051:Pin grid array
7048:
7043:
7038:
7033:
7028:
7023:
7018:
7012:
7010:
7006:
7005:
7003:
7002:
6996:
6991:
6986:
6981:
6976:
6971:
6965:
6963:
6955:
6954:
6951:
6950:
6948:
6947:
6942:
6937:
6932:
6927:
6922:
6921:
6920:
6915:
6910:
6899:
6897:
6891:
6890:
6888:
6887:
6885:Barrel shifter
6882:
6881:
6880:
6875:
6868:Binary decoder
6865:
6864:
6863:
6853:
6848:
6843:
6837:
6835:
6829:
6828:
6826:
6825:
6820:
6812:
6807:
6802:
6797:
6791:
6789:
6783:
6782:
6780:
6779:
6774:
6769:
6764:
6759:
6757:Stack register
6754:
6749:
6743:
6741:
6735:
6734:
6732:
6731:
6730:
6729:
6724:
6714:
6709:
6704:
6698:
6696:
6690:
6689:
6687:
6686:
6681:
6680:
6679:
6668:
6663:
6658:
6657:
6656:
6650:
6639:
6633:
6627:
6620:
6618:
6607:
6606:
6601:
6596:
6591:
6586:
6585:
6584:
6579:
6574:
6569:
6564:
6559:
6549:
6543:
6541:
6537:
6536:
6534:
6533:
6528:
6523:
6518:
6512:
6510:
6506:
6505:
6503:
6502:
6501:
6500:
6490:
6485:
6480:
6475:
6470:
6465:
6460:
6455:
6450:
6445:
6440:
6435:
6430:
6425:
6419:
6417:
6411:
6410:
6407:
6406:
6404:
6403:
6398:
6393:
6388:
6382:
6376:
6370:
6364:
6359:
6353:
6351:AI accelerator
6348:
6342:
6340:
6332:
6331:
6329:
6328:
6322:
6317:
6314:Multiprocessor
6311:
6304:
6302:
6296:
6295:
6293:
6292:
6287:
6282:
6277:
6272:
6267:
6265:Microprocessor
6262:
6256:
6254:
6253:By application
6247:
6246:
6240:
6234:
6228:
6223:
6218:
6213:
6208:
6203:
6198:
6196:Tile processor
6193:
6188:
6183:
6178:
6177:
6176:
6165:
6158:
6156:
6150:
6149:
6147:
6146:
6141:
6136:
6130:
6124:
6118:
6112:
6106:
6105:
6104:
6092:
6086:
6084:
6076:
6075:
6072:
6071:
6069:
6068:
6067:
6066:
6056:
6051:
6050:
6049:
6044:
6039:
6034:
6024:
6018:
6016:
6010:
6009:
6007:
6006:
6001:
5996:
5991:
5990:
5989:
5984:
5982:Hyperthreading
5974:
5968:
5966:
5964:Multithreading
5960:
5959:
5957:
5956:
5951:
5946:
5945:
5944:
5934:
5933:
5932:
5927:
5917:
5916:
5915:
5910:
5900:
5895:
5894:
5893:
5888:
5877:
5875:
5868:
5862:
5861:
5858:
5857:
5855:
5854:
5849:
5843:
5841:
5835:
5834:
5832:
5831:
5826:
5821:
5820:
5819:
5814:
5804:
5798:
5796:
5790:
5789:
5787:
5786:
5781:
5776:
5771:
5765:
5763:
5757:
5756:
5754:
5753:
5748:
5743:
5741:Pipeline stall
5737:
5735:
5726:
5720:
5719:
5716:
5715:
5713:
5712:
5707:
5702:
5697:
5694:
5693:
5692:
5690:z/Architecture
5687:
5682:
5677:
5669:
5664:
5659:
5654:
5649:
5644:
5639:
5634:
5629:
5624:
5619:
5614:
5609:
5608:
5607:
5602:
5597:
5589:
5584:
5579:
5574:
5569:
5564:
5559:
5554:
5548:
5546:
5540:
5539:
5537:
5536:
5535:
5534:
5524:
5519:
5514:
5509:
5504:
5499:
5494:
5493:
5492:
5482:
5481:
5480:
5470:
5465:
5460:
5455:
5449:
5447:
5440:
5432:
5431:
5429:
5428:
5423:
5418:
5413:
5408:
5403:
5402:
5401:
5396:
5394:Virtual memory
5386:
5381:
5380:
5379:
5374:
5369:
5364:
5354:
5349:
5344:
5339:
5334:
5333:
5332:
5322:
5317:
5311:
5309:
5303:
5302:
5300:
5299:
5298:
5297:
5292:
5287:
5282:
5272:
5267:
5262:
5261:
5260:
5255:
5250:
5245:
5240:
5235:
5230:
5225:
5218:Turing machine
5215:
5214:
5213:
5208:
5203:
5198:
5193:
5188:
5178:
5173:
5167:
5165:
5159:
5158:
5151:
5150:
5143:
5136:
5128:
5119:
5118:
5116:
5115:
5103:
5100:
5099:
5097:
5096:
5091:
5086:
5081:
5079:Race condition
5076:
5071:
5066:
5061:
5056:
5050:
5048:
5044:
5043:
5041:
5040:
5035:
5030:
5025:
5020:
5015:
5010:
5005:
5000:
4995:
4990:
4985:
4980:
4975:
4970:
4965:
4960:
4955:
4950:
4945:
4940:
4935:
4930:
4925:
4920:
4914:
4912:
4906:
4905:
4903:
4902:
4897:
4892:
4891:
4890:
4880:
4874:
4873:
4872:
4867:
4862:
4857:
4852:
4847:
4837:
4836:
4835:
4830:
4823:Multiprocessor
4820:
4815:
4810:
4805:
4800:
4799:
4798:
4793:
4788:
4787:
4786:
4781:
4776:
4765:
4754:
4752:
4746:
4745:
4743:
4742:
4737:
4736:
4735:
4730:
4725:
4715:
4710:
4704:
4702:
4696:
4695:
4693:
4692:
4687:
4682:
4677:
4672:
4667:
4662:
4656:
4654:
4650:
4649:
4647:
4646:
4641:
4636:
4631:
4626:
4620:
4618:
4614:
4613:
4611:
4610:
4605:
4600:
4595:
4590:
4585:
4580:
4575:
4570:
4564:
4562:
4558:
4557:
4555:
4554:
4552:Hardware scout
4549:
4543:
4538:
4533:
4527:
4522:
4516:
4510:
4508:
4506:Multithreading
4502:
4501:
4499:
4498:
4493:
4488:
4483:
4478:
4473:
4468:
4463:
4457:
4455:
4451:
4450:
4448:
4447:
4445:Systolic array
4442:
4437:
4432:
4427:
4422:
4417:
4412:
4407:
4402:
4396:
4394:
4390:
4389:
4382:
4381:
4374:
4367:
4359:
4348:
4347:
4336:
4325:
4314:
4303:
4292:
4281:
4270:
4259:
4245:
4238:
4208:
4186:
4162:
4151:
4140:
4137:RISC-V RVV ISA
4129:
4118:
4098:
4084:
4070:
4059:
4048:
4034:
4023:
4012:
3992:
3985:
3952:
3941:
3930:
3919:
3905:
3891:
3874:
3854:
3821:
3799:
3788:
3777:
3770:
3752:
3724:
3723:
3721:
3718:
3717:
3716:
3711:
3706:
3701:
3696:
3686:
3681:
3676:
3671:
3666:
3664:Compute kernel
3661:
3656:
3650:
3643:
3640:
3621:
3618:
3615:
3612:
3609:
3605:
3601:
3581:
3578:
3575:
3553:
3550:
3547:
3544:
3541:
3538:
3535:
3532:
3529:
3526:
3523:
3519:
3515:
3488:
3485:
3446:
3443:
3442:
3441:
3410:
3392:
3366:
3363:
3362:
3361:
3349:
3335:
3325:
3323:x = y + y… + y
3308:Reduction and
3305:
3299:
3293:
3286:Gather-scatter
3275:
3265:
3255:
3252:data structure
3228:
3225:
3221:
3220:
3217:
3210:
3209:
3202:
3192:
3189:
3152:# n -= VL (t0)
3051:
2999:
2989:
2986:
2961:
2960:
2954:
2948:
2942:
2929:; add 2 groups
2893:; first 4 of x
2858:
2843:
2842:SIMD reduction
2840:
2711:
2705:
2702:
2576:
2565:
2562:
2539:
2538:
2534:
2527:
2520:
2479:# n -= VL (t0)
2407:# v1 += v0 * a
2339:
2310:
2307:
2025:
2019:
2016:
1970:
1969:
1966:
1963:
1780:
1768:; v4 = a,a,a,a
1754:
1739:
1736:
1673:
1500:
1494:
1491:
1365:
1354:
1351:
1232:
1229:
1221:supercomputers
1187:
1186:
1179:
1176:
1173:
1170:
1077:
963:
788:
745:memory latency
739:
736:
692:
676:
675:
673:
668:
643:
640:
620:
619:
589:
572:
569:Cell processor
503:
499:
488:
485:
460:
457:
424:Main article:
421:
418:
273:
272:
223:
221:
214:
208:
207:Supercomputers
205:
196:
193:
122:
119:
117:
114:
110:microprocessor
15:
9:
6:
4:
3:
2:
7104:
7093:
7090:
7088:
7085:
7083:
7080:
7078:
7075:
7074:
7072:
7057:
7054:
7052:
7049:
7047:
7044:
7042:
7039:
7037:
7034:
7032:
7029:
7027:
7024:
7022:
7019:
7017:
7014:
7013:
7011:
7007:
7000:
6997:
6995:
6992:
6990:
6987:
6985:
6982:
6980:
6977:
6975:
6972:
6970:
6967:
6966:
6964:
6962:
6956:
6946:
6943:
6941:
6938:
6936:
6933:
6931:
6928:
6926:
6923:
6919:
6916:
6914:
6911:
6909:
6906:
6905:
6904:
6901:
6900:
6898:
6896:
6892:
6886:
6883:
6879:
6876:
6874:
6871:
6870:
6869:
6866:
6862:
6859:
6858:
6857:
6854:
6852:
6849:
6847:
6846:Demultiplexer
6844:
6842:
6839:
6838:
6836:
6834:
6830:
6824:
6821:
6819:
6816:
6813:
6811:
6808:
6806:
6803:
6801:
6798:
6796:
6793:
6792:
6790:
6788:
6784:
6778:
6775:
6773:
6770:
6768:
6767:Memory buffer
6765:
6763:
6762:Register file
6760:
6758:
6755:
6753:
6750:
6748:
6745:
6744:
6742:
6740:
6736:
6728:
6725:
6723:
6720:
6719:
6718:
6715:
6713:
6710:
6708:
6705:
6703:
6702:Combinational
6700:
6699:
6697:
6695:
6691:
6685:
6682:
6678:
6675:
6674:
6672:
6669:
6667:
6664:
6662:
6659:
6654:
6651:
6649:
6646:
6645:
6643:
6640:
6637:
6634:
6631:
6628:
6625:
6622:
6621:
6619:
6617:
6611:
6605:
6602:
6600:
6597:
6595:
6592:
6590:
6587:
6583:
6580:
6578:
6575:
6573:
6570:
6568:
6565:
6563:
6560:
6558:
6555:
6554:
6553:
6550:
6548:
6545:
6544:
6542:
6538:
6532:
6529:
6527:
6524:
6522:
6519:
6517:
6514:
6513:
6511:
6507:
6499:
6496:
6495:
6494:
6491:
6489:
6486:
6484:
6481:
6479:
6476:
6474:
6471:
6469:
6466:
6464:
6461:
6459:
6456:
6454:
6451:
6449:
6446:
6444:
6441:
6439:
6436:
6434:
6431:
6429:
6426:
6424:
6421:
6420:
6418:
6416:
6412:
6402:
6399:
6397:
6394:
6392:
6389:
6386:
6383:
6380:
6377:
6374:
6371:
6368:
6365:
6363:
6360:
6357:
6354:
6352:
6349:
6347:
6344:
6343:
6341:
6339:
6333:
6326:
6323:
6321:
6318:
6315:
6312:
6309:
6306:
6305:
6303:
6297:
6291:
6288:
6286:
6283:
6281:
6278:
6276:
6273:
6271:
6268:
6266:
6263:
6261:
6258:
6257:
6255:
6251:
6244:
6241:
6238:
6235:
6232:
6229:
6227:
6224:
6222:
6219:
6217:
6214:
6212:
6209:
6207:
6204:
6202:
6199:
6197:
6194:
6192:
6189:
6187:
6184:
6182:
6179:
6175:
6172:
6171:
6169:
6166:
6163:
6160:
6159:
6157:
6155:
6151:
6145:
6142:
6140:
6137:
6134:
6131:
6128:
6125:
6122:
6119:
6116:
6113:
6110:
6107:
6102:
6099:
6098:
6096:
6093:
6091:
6088:
6087:
6085:
6083:
6077:
6065:
6062:
6061:
6060:
6057:
6055:
6052:
6048:
6045:
6043:
6040:
6038:
6035:
6033:
6030:
6029:
6028:
6025:
6023:
6020:
6019:
6017:
6015:
6011:
6005:
6002:
6000:
5997:
5995:
5992:
5988:
5985:
5983:
5980:
5979:
5978:
5975:
5973:
5970:
5969:
5967:
5965:
5961:
5955:
5952:
5950:
5947:
5943:
5940:
5939:
5938:
5935:
5931:
5928:
5926:
5923:
5922:
5921:
5918:
5914:
5911:
5909:
5906:
5905:
5904:
5901:
5899:
5896:
5892:
5889:
5887:
5884:
5883:
5882:
5879:
5878:
5876:
5872:
5869:
5867:
5863:
5853:
5850:
5848:
5845:
5844:
5842:
5840:
5836:
5830:
5827:
5825:
5822:
5818:
5815:
5813:
5810:
5809:
5808:
5805:
5803:
5802:Scoreboarding
5800:
5799:
5797:
5795:
5791:
5785:
5784:False sharing
5782:
5780:
5777:
5775:
5772:
5770:
5767:
5766:
5764:
5762:
5758:
5752:
5749:
5747:
5744:
5742:
5739:
5738:
5736:
5734:
5730:
5727:
5725:
5721:
5711:
5708:
5706:
5703:
5701:
5698:
5695:
5691:
5688:
5686:
5683:
5681:
5678:
5676:
5673:
5672:
5670:
5668:
5665:
5663:
5660:
5658:
5655:
5653:
5650:
5648:
5645:
5643:
5640:
5638:
5635:
5633:
5630:
5628:
5625:
5623:
5620:
5618:
5615:
5613:
5610:
5606:
5603:
5601:
5598:
5596:
5593:
5592:
5590:
5588:
5585:
5583:
5580:
5578:
5577:Stanford MIPS
5575:
5573:
5570:
5568:
5565:
5563:
5560:
5558:
5555:
5553:
5550:
5549:
5547:
5541:
5533:
5530:
5529:
5528:
5525:
5523:
5520:
5518:
5515:
5513:
5510:
5508:
5505:
5503:
5500:
5498:
5495:
5491:
5488:
5487:
5486:
5483:
5479:
5476:
5475:
5474:
5471:
5469:
5466:
5464:
5461:
5459:
5456:
5454:
5451:
5450:
5448:
5444:
5441:
5439:
5438:architectures
5433:
5427:
5424:
5422:
5419:
5417:
5414:
5412:
5409:
5407:
5406:Heterogeneous
5404:
5400:
5397:
5395:
5392:
5391:
5390:
5387:
5385:
5382:
5378:
5375:
5373:
5370:
5368:
5365:
5363:
5360:
5359:
5358:
5357:Memory access
5355:
5353:
5350:
5348:
5345:
5343:
5340:
5338:
5335:
5331:
5328:
5327:
5326:
5323:
5321:
5318:
5316:
5313:
5312:
5310:
5308:
5304:
5296:
5293:
5291:
5290:Random-access
5288:
5286:
5283:
5281:
5278:
5277:
5276:
5273:
5271:
5270:Stack machine
5268:
5266:
5263:
5259:
5256:
5254:
5251:
5249:
5246:
5244:
5241:
5239:
5236:
5234:
5231:
5229:
5226:
5224:
5221:
5220:
5219:
5216:
5212:
5209:
5207:
5204:
5202:
5199:
5197:
5194:
5192:
5189:
5187:
5186:with datapath
5184:
5183:
5182:
5179:
5177:
5174:
5172:
5169:
5168:
5166:
5164:
5160:
5156:
5149:
5144:
5142:
5137:
5135:
5130:
5129:
5126:
5114:
5105:
5104:
5101:
5095:
5092:
5090:
5087:
5085:
5082:
5080:
5077:
5075:
5072:
5070:
5067:
5065:
5062:
5060:
5057:
5055:
5052:
5051:
5049:
5045:
5039:
5036:
5034:
5031:
5029:
5026:
5024:
5021:
5019:
5016:
5014:
5011:
5009:
5006:
5004:
5001:
4999:
4996:
4994:
4991:
4989:
4986:
4984:
4981:
4979:
4976:
4974:
4971:
4969:
4968:Global Arrays
4966:
4964:
4961:
4959:
4956:
4954:
4951:
4949:
4946:
4944:
4941:
4939:
4936:
4934:
4931:
4929:
4926:
4924:
4921:
4919:
4916:
4915:
4913:
4911:
4907:
4901:
4898:
4896:
4895:Grid computer
4893:
4889:
4886:
4885:
4884:
4881:
4878:
4875:
4871:
4868:
4866:
4863:
4861:
4858:
4856:
4853:
4851:
4848:
4846:
4843:
4842:
4841:
4838:
4834:
4831:
4829:
4826:
4825:
4824:
4821:
4819:
4816:
4814:
4811:
4809:
4806:
4804:
4801:
4797:
4794:
4792:
4789:
4785:
4782:
4780:
4777:
4774:
4771:
4770:
4769:
4766:
4764:
4761:
4760:
4759:
4756:
4755:
4753:
4751:
4747:
4741:
4738:
4734:
4731:
4729:
4726:
4724:
4721:
4720:
4719:
4716:
4714:
4711:
4709:
4706:
4705:
4703:
4701:
4697:
4691:
4688:
4686:
4683:
4681:
4678:
4676:
4673:
4671:
4668:
4666:
4663:
4661:
4658:
4657:
4655:
4651:
4645:
4642:
4640:
4637:
4635:
4632:
4630:
4627:
4625:
4622:
4621:
4619:
4615:
4609:
4606:
4604:
4601:
4599:
4596:
4594:
4591:
4589:
4586:
4584:
4581:
4579:
4576:
4574:
4571:
4569:
4566:
4565:
4563:
4559:
4553:
4550:
4547:
4544:
4542:
4539:
4537:
4534:
4531:
4528:
4526:
4523:
4520:
4517:
4515:
4512:
4511:
4509:
4507:
4503:
4497:
4494:
4492:
4489:
4487:
4484:
4482:
4479:
4477:
4474:
4472:
4469:
4467:
4464:
4462:
4459:
4458:
4456:
4452:
4446:
4443:
4441:
4438:
4436:
4433:
4431:
4428:
4426:
4423:
4421:
4418:
4416:
4413:
4411:
4408:
4406:
4403:
4401:
4398:
4397:
4395:
4391:
4387:
4380:
4375:
4373:
4368:
4366:
4361:
4360:
4357:
4353:
4345:
4340:
4334:
4329:
4323:
4318:
4312:
4307:
4301:
4296:
4290:
4285:
4279:
4274:
4268:
4263:
4255:
4249:
4241:
4235:
4231:
4227:
4223:
4219:
4212:
4203:
4198:
4190:
4182:
4178:
4177:
4172:
4166:
4160:
4155:
4149:
4144:
4138:
4133:
4127:
4126:Cray Overview
4122:
4114:
4113:
4108:
4102:
4094:
4088:
4080:
4074:
4068:
4063:
4057:
4052:
4044:
4038:
4032:
4027:
4021:
4016:
4008:
4007:
4002:
3996:
3988:
3982:
3978:
3973:
3972:
3966:
3962:
3956:
3950:
3945:
3939:
3934:
3928:
3923:
3915:
3909:
3901:
3895:
3884:
3878:
3870:
3869:
3864:
3858:
3849:
3844:
3840:
3836:
3832:
3825:
3809:
3803:
3797:
3792:
3786:
3781:
3773:
3767:
3763:
3756:
3740:
3739:New Scientist
3736:
3729:
3725:
3715:
3712:
3710:
3707:
3705:
3702:
3700:
3697:
3694:
3690:
3687:
3685:
3682:
3680:
3677:
3675:
3672:
3670:
3667:
3665:
3662:
3660:
3657:
3654:
3651:
3649:
3646:
3645:
3639:
3637:
3636:
3616:
3613:
3610:
3603:
3599:
3576:
3573:
3564:
3548:
3545:
3542:
3539:
3533:
3530:
3527:
3517:
3513:
3505:
3502:
3501:
3496:
3495:
3484:
3481:
3476:
3470:
3466:
3462:
3460:
3456:
3451:
3439:
3434:
3430:
3426:
3422:
3418:
3417:trigonometric
3414:
3411:
3408:
3404:
3400:
3396:
3393:
3390:
3387:
3383:
3380:
3379:
3378:
3376:
3375:trigonometric
3372:
3369:With many 3D
3359:
3355:
3354:
3350:
3347:
3343:
3339:
3336:
3333:
3329:
3326:
3316:
3312:
3311:
3306:
3303:
3300:
3297:
3294:
3291:
3287:
3283:
3279:
3276:
3273:
3269:
3266:
3263:
3259:
3256:
3253:
3249:
3245:
3241:
3237:
3234:
3233:
3232:
3224:
3218:
3215:
3214:
3213:
3207:
3203:
3199:
3198:
3197:
3188:
3184:
3180:
3049:
2997:
2995:
2985:
2981:
2979:
2973:
2966:
2955:
2949:
2943:
2937:
2936:
2935:
2856:
2854:
2850:
2839:
2709:
2701:
2574:
2572:
2561:
2559:
2554:
2550:
2548:
2542:
2535:
2532:
2528:
2525:
2521:
2510:
2509:
2508:
2505:
2503:
2337:
2334:
2331:
2326:
2325:
2323:
2317:
2306:
2304:
2298:
2023:
2015:
2007:
2004:
2002:
1997:
1995:
1987:
1983:
1979:
1974:
1967:
1964:
1960:
1959:
1958:
1954:
1951:
1949:
1778:
1752:
1749:
1745:
1729:; y = y + tmp
1702:; tmp = a * x
1671:
1498:
1490:
1363:
1361:
1350:
1347:
1342:
1338:
1333:
1331:
1326:
1324:
1320:
1316:
1312:
1308:
1304:
1300:
1294:
1292:
1287:
1283:
1278:
1276:
1269:IV ISA for a
1268:
1263:
1256:
1247:
1242:
1238:
1228:
1226:
1222:
1217:
1215:
1213:
1207:
1203:
1198:
1195:
1190:
1184:
1180:
1177:
1174:
1171:
1168:
1167:
1166:
1075:
1072:
1070:
1066:
961:
786:
782:
780:
774:
772:
771:
766:
762:
761:assembly line
757:
752:
750:
746:
735:
733:
729:
725:
720:
718:
712:
708:
706:
698:
694:
690:
687:
685:
681:
671:
669:
666:
654:
653:
652:
648:
639:
637:
633:
629:
625:
617:
613:
609:
605:
601:
597:
593:
590:
588:
584:
580:
576:
573:
570:
566:
562:
558:
554:
550:
546:
542:
538:
534:
530:
526:
522:
518:
514:
510:
507:
506:
505:
501:
497:
495:
484:
482:
478:
474:
470:
466:
456:
454:
453:"Packed SIMD"
450:
449:significantly
445:
441:
437:
433:
427:
417:
414:
410:
406:
401:
399:
395:
391:
387:
383:
378:
376:
372:
371:minicomputers
368:
364:
360:
356:
352:
348:
344:
336:
332:
328:
326:
325:
319:
314:
312:
308:
303:
301:
296:
292:
290:
287:
283:
280:
269:
266:
258:
248:
244:
240:
234:
233:
229:
224:This section
222:
218:
213:
212:
204:
202:
192:
190:
186:
182:
180:
176:
172:
167:
163:
159:
154:
152:
148:
144:
140:
136:
132:
128:
113:
111:
107:
103:
99:
98:supercomputer
94:
92:
88:
84:
80:
76:
72:
68:
64:
60:
56:
52:
48:
44:
40:
33:
29:
22:
7082:Coprocessors
7056:Chip carrier
6994:Clock gating
6913:Mixed-signal
6810:Write buffer
6787:Control unit
6599:Clock signal
6338:accelerators
6320:Cypress PSoC
6180:
5977:Simultaneous
5941:
5794:Out-of-order
5426:Neuromorphic
5307:Architecture
5265:Belt machine
5258:Zeno machine
5191:Hierarchical
4817:
4653:Coordination
4583:Amdahl's law
4519:Simultaneous
4351:
4339:
4328:
4317:
4306:
4295:
4284:
4273:
4262:
4248:
4221:
4211:
4189:
4174:
4165:
4154:
4143:
4132:
4121:
4110:
4101:
4087:
4073:
4062:
4051:
4037:
4026:
4015:
4004:
3995:
3970:
3955:
3944:
3933:
3922:
3908:
3894:
3877:
3866:
3857:
3830:
3824:
3812:. Retrieved
3802:
3791:
3780:
3761:
3755:
3743:. Retrieved
3738:
3728:
3634:
3633:
3565:
3506:
3499:
3498:
3493:
3492:
3490:
3479:
3474:
3471:
3467:
3463:
3458:
3454:
3452:
3448:
3412:
3399:mini-permute
3394:
3381:
3368:
3351:
3342:Galois field
3337:
3327:
3307:
3301:
3295:
3277:
3267:
3257:
3247:
3243:
3239:
3235:
3230:
3222:
3211:
3194:
3185:
3181:
3177:
3047:
2993:
2991:
2982:
2974:
2964:
2962:
2933:
2908:; 2nd 4 of x
2852:
2848:
2845:
2837:
2707:
2699:
2567:
2555:
2551:
2546:
2543:
2540:
2530:
2523:
2506:
2501:
2498:
2335:
2329:
2327:
2321:
2319:
2315:
2312:
2302:
2299:
2295:
2021:
2008:
2005:
1998:
1981:
1977:
1975:
1971:
1955:
1952:
1947:
1945:
1772:
1741:
1668:
1496:
1488:
1356:
1334:
1327:
1311:"predicated"
1295:
1279:
1275:power of two
1264:
1248:
1244:
1224:
1218:
1209:
1205:
1201:
1199:
1191:
1188:
1164:
1073:
1068:
1064:
1062:
983:; count = 10
959:
783:
778:
775:
768:
753:
741:
724:single-issue
723:
721:
713:
709:
702:
688:
677:
665:power of two
649:
645:
636:Fujitsu FR-V
621:
616:CDC STAR-100
592:Pure Vectors
591:
574:
535:extensions,
515:(SWAR), and
508:
498:fixed-length
493:
490:
462:
448:
429:
402:
379:
340:
323:
315:
304:
297:
293:
276:
261:
252:
237:Please help
225:
198:
183:
155:
135:coprocessors
130:
124:
95:
66:
59:instructions
46:
42:
36:
6841:Multiplexer
6805:Data buffer
6516:Single-core
6488:bit slicing
6346:Coprocessor
6201:Coprocessor
6082:performance
6004:Cooperative
5994:Speculative
5954:Distributed
5913:Superscalar
5898:Instruction
5866:Parallelism
5839:Speculative
5671:System/3x0
5543:Instruction
5320:Von Neumann
5233:Post–Turing
5089:Scalability
4850:distributed
4733:Concurrency
4700:Programming
4541:Cooperative
4530:Speculative
4466:Instruction
3814:23 December
3382:Sub-vectors
3358:many others
2519:instruction
1962:inner loop.
937:; decrement
738:Description
555:. In 2000,
543:extension,
473:open source
149:to a large
7071:Categories
6961:management
6856:Multiplier
6717:Logic gate
6707:Sequential
6614:Functional
6594:Clock rate
6567:Data cache
6540:Components
6521:Multi-core
6509:Core count
5999:Preemptive
5903:Pipelining
5886:Bit-serial
5829:Wide-issue
5774:Structural
5696:Tilera ISA
5662:MicroBlaze
5632:ETRAX CRIS
5527:Comparison
5372:Load–store
5352:Endianness
5094:Starvation
4833:asymmetric
4568:PRAM model
4536:Preemptive
4202:2104.03142
3986:155860491X
3848:2065/10689
3771:5770761318
3720:References
3475:subsequent
3459:subsequent
3440:extension.
3248:fail-first
3183:critical.
2967:of adding
1775:y = mx + c
1235:See also:
1225:efficiency
612:RISC-V RVV
551:and MIPS'
539:, Sparc's
477:ForwardCom
77:(SIMD) or
57:where its
6895:Circuitry
6815:Microcode
6739:Registers
6582:coherence
6557:CPU cache
6415:Word size
6080:Processor
5724:Execution
5627:DEC Alpha
5605:Power ISA
5421:Cognitive
5228:Universal
4828:symmetric
4573:PEM model
3796:MIAOW GPU
3614:−
3580:∞
3540:∗
3531:−
3429:logarithm
3407:"swizzle"
3403:Videocore
3319:x = y + x
3315:mapreduce
3310:Iteration
3098:vredadd32
3031:vredadd32
2965:by design
2422:# store Y
2189:store32x4
1851:store32x4
1346:Videocore
1337:Videocore
1267:Videocore
1142:# 10 adds
892:; move on
763:, so the
481:Libre-SOC
405:SX series
390:Cray Y-MP
386:Cray X-MP
255:July 2023
226:does not
162:ILLIAC IV
147:algorithm
129:in their
39:computing
6833:Datapath
6526:Manycore
6498:variable
6336:Hardware
5972:Temporal
5652:OpenRISC
5347:Cellular
5337:Dataflow
5330:modified
5059:Deadlock
5047:Problems
5013:pthreads
4993:OpenHMPP
4918:Ateji PX
4879:computer
4750:Hardware
4617:Elements
4603:Slowdown
4514:Temporal
4496:Pipeline
4181:Archived
3967:(1998).
3642:See also
3292:instead.
2994:built-in
2896:load32x4
2881:load32x4
2117:load32x4
2099:load32x4
1797:load32x4
1785:load32x4
1065:hardware
674:vectors.
583:ARM SVE2
537:ARM NEON
335:Cray J90
300:CDC 7600
282:STAR-100
181:(SIMT).
151:data set
7009:Related
6940:Quantum
6930:Digital
6925:Boolean
6823:Counter
6722:Quantum
6483:512-bit
6478:256-bit
6473:128-bit
6316:(MPSoC)
6301:on chip
6299:Systems
6117:(FLOPS)
5930:Process
5779:Control
5761:Hazards
5647:Itanium
5642:Unicore
5600:PowerPC
5325:Harvard
5285:Pointer
5280:Counter
5238:Quantum
5018:RaftLib
4998:OpenACC
4973:GPUOpen
4963:C++ AMP
4938:Charm++
4680:Barrier
4624:Process
4608:Speedup
4393:General
4176:YouTube
3764:. KIT.
3480:exactly
3438:MIPS-3D
3272:AVX-512
3244:segment
3240:indexed
2978:AVX-512
2911:add32x4
2389:vmadd32
2270:# loop?
2162:add32x4
2135:mul32x4
2001:AVX-512
1986:AVX-512
1978:triples
1830:add32x4
1809:mul32x4
1756:splatx4
1574:store32
1323:AltiVec
1303:AVX-512
1071:basis.
770:latency
684:AVX-512
587:AVX-512
561:Toshiba
549:AltiVec
545:PowerPC
365:-based
355:Hitachi
351:Fujitsu
247:removed
232:sources
160:as the
131:Solomon
116:History
67:vectors
6945:Switch
6935:Analog
6673:(IMC)
6644:(MMU)
6493:others
6468:64-bit
6463:48-bit
6458:32-bit
6453:24-bit
6448:16-bit
6443:15-bit
6438:12-bit
6275:Mobile
6191:Stream
6186:Barrel
6181:Vector
6170:(GPU)
6129:(SUPS)
6097:(IPC)
5949:Memory
5942:Vector
5925:Thread
5908:Scalar
5710:Others
5657:RISC-V
5622:SuperH
5591:Power
5587:MIPS-X
5562:PDP-11
5411:Fabric
5163:Models
5111:
4988:OpenCL
4983:OpenMP
4928:Chapel
4845:shared
4840:Memory
4775:(SIMT)
4718:Models
4629:Thread
4561:Theory
4532:(SpMT)
4486:Memory
4471:Thread
4454:Levels
4236:
4112:GitHub
4006:GitHub
3983:
3868:GitHub
3768:
3745:7 July
3689:RISC-V
3455:actual
3425:cosine
3389:SPIR-V
3386:Vulkan
3371:shader
3065:vloop:
2731:load32
2686:return
2638:size_t
2584:size_t
2558:no-ops
2537:1977).
2341:vloop:
2330:actual
2303:at all
2027:vloop:
1782:vloop:
1520:load32
1505:load32
1424:size_t
1376:size_t
1305:, ARM
1282:Cray-1
1145:vstore
1085:setvli
1040:vstore
779:itself
728:Cray-1
672:within
657:vsetvl
610:, and
608:NEC SX
600:Cray-1
533:3DNow!
382:Cray-2
363:Oregon
347:ETA-10
307:Cray-1
166:GFLOPS
7001:(PPW)
6959:Power
6851:Adder
6727:Array
6694:Logic
6655:(TLB)
6638:(FPU)
6632:(AGU)
6626:(ALU)
6616:units
6552:Cache
6433:8-bit
6428:4-bit
6423:1-bit
6387:(TPU)
6381:(DSP)
6375:(PPU)
6369:(VPU)
6358:(GPU)
6327:(NoC)
6310:(SoC)
6245:(PoP)
6239:(SiP)
6233:(MCM)
6174:GPGPU
6164:(CPU)
6154:Types
6135:(PPW)
6123:(TPS)
6111:(IPS)
6103:(CPI)
5874:Level
5685:S/390
5680:S/370
5675:S/360
5617:SPARC
5595:POWER
5478:TRIPS
5446:Types
4958:Dryad
4923:Boost
4644:Array
4634:Fiber
4548:(CMT)
4521:(SMT)
4435:GPGPU
4197:arXiv
3977:751-2
3886:(PDF)
3659:GPGPU
3391:spec.
3164:vloop
3083:vld32
3068:setvl
3016:vld32
3001:setvl
2853:other
2746:add32
2728:loop:
2602:const
2513:setvl
2502:going
2491:vloop
2410:vst32
2374:vld32
2359:vld32
2344:setvl
2282:vloop
2054:shift
2012:setvl
1948:shall
1932:vloop
1553:add32
1532:mul32
1502:loop:
1394:const
1370:iaxpy
1202:other
1109:vload
1094:vload
1055:count
1019:count
1004:vload
1001:count
986:vload
980:count
943:count
934:count
862:store
814:loop:
808:count
651:has:
49:is a
6979:ACPI
6712:Glue
6604:FIFO
6547:Core
6285:ASIP
6226:CPLD
6221:FPOA
6216:FPGA
6211:ASIC
6064:SPMD
6059:MIMD
6054:MISD
6047:SWAR
6027:SIMD
6022:SISD
5937:Data
5920:Task
5891:Word
5637:M32R
5582:MIPS
5545:sets
5512:ZISC
5507:NISC
5502:OISC
5497:MISC
5490:EPIC
5485:VLIW
5473:EDGE
5463:RISC
5458:CISC
5367:HUMA
5362:NUMA
5023:ROCm
4953:CUDA
4943:Cilk
4910:APIs
4870:COMA
4865:NUMA
4796:MIMD
4791:MISD
4768:SIMD
4763:SISD
4491:Loop
4481:Data
4476:Task
4234:ISBN
3981:ISBN
3816:2022
3766:ISBN
3747:2024
3491:Let
3427:and
3421:sine
3332:CUDA
3302:Iota
3155:bnez
2875:$ 16
2860:addl
2824:out:
2818:loop
2788:subl
2767:addl
2656:<
2578:void
2511:The
2482:bnez
2322:must
2288:out:
2249:subl
2231:addl
2210:addl
1938:out:
1902:subl
1899:$ 16
1884:addl
1878:$ 16
1863:addl
1705:vadd
1678:vmul
1661:out:
1655:loop
1625:subl
1607:addl
1586:addl
1442:<
1367:void
1307:SVE2
1291:GPUs
1280:The
1239:and
1237:SIMD
1212:RISC
1124:vadd
1088:$ 10
1022:vadd
974:$ 10
971:move
949:loop
940:jnez
829:load
817:load
802:$ 10
799:move
632:EPIC
628:VLIW
624:MIMD
585:and
565:Sony
563:and
527:and
479:and
432:GPUs
388:and
357:and
284:and
230:any
228:cite
102:Cray
41:, a
28:GPUs
6974:APM
6969:PMU
6861:CPU
6818:ROM
6589:Bus
6206:PAL
5881:Bit
5667:LMC
5572:ARM
5567:x86
5557:VAX
5038:ZPL
5033:TBB
5028:UPC
5008:PVM
4978:MPI
4933:HPX
4860:UMA
4461:Bit
4226:doi
3843:hdl
3835:doi
3433:HPC
3170:ret
3140:sub
3119:add
3053:set
2969:x+x
2957:x+x
2951:x+x
2945:x+x
2939:x+x
2849:not
2827:ret
2809:jgz
2803:$ 1
2782:$ 4
2713:set
2632:for
2617:int
2605:int
2593:int
2524:and
2517:min
2467:sub
2446:add
2425:add
2291:ret
2273:jgz
2090:$ 1
2075:sub
2063:$ 1
2048:$ 4
2033:min
1994:gcc
1992:to
1941:ret
1923:jgz
1917:$ 4
1732:ret
1720:tmp
1681:tmp
1664:ret
1646:jgz
1640:$ 1
1622:$ 4
1601:$ 4
1418:for
1406:int
1397:int
1385:int
1319:SSE
1315:MMX
1271:REP
1259:REP
1255:x86
1206:not
1160:ret
1058:ret
955:ret
931:dec
928:$ 4
913:add
910:$ 4
895:add
889:$ 4
874:add
841:add
661:lvl
557:IBM
553:MSA
547:'s
541:VIS
529:AVX
525:SSE
521:MMX
420:GPU
413:HBM
241:by
45:or
37:In
7073::
6908:3D
4232:.
4220:.
4179:.
4173:.
4109:.
4003:.
3979:.
3963:;
3865:.
3841:.
3737:.
3423:,
3415:–
3260:–
3149:t0
3128:t0
3113:v0
3086:v0
3071:t0
3040:v0
3019:v0
3004:t0
2926:v1
2920:v2
2914:v1
2905:r3
2899:v2
2884:v1
2863:r3
2761:r1
2734:r1
2677:+=
2668:++
2573::
2476:t0
2455:t0
2434:t0
2413:v1
2398:v0
2392:v1
2377:v1
2362:v0
2347:t0
2318:.
2264:t0
2240:t0
2219:t0
2192:v3
2177:v2
2171:v1
2165:v3
2150:v1
2138:v1
2120:v2
2102:v1
2069:t0
2036:t0
1996:.
1854:v3
1845:v2
1839:v1
1833:v3
1824:v1
1812:v1
1800:v2
1788:v1
1759:v4
1577:r3
1568:r2
1562:r1
1556:r3
1547:r1
1535:r1
1523:r2
1508:r1
1454:++
1362::
1321:,
1317:,
1216:)
1148:v3
1139:v2
1133:v1
1127:v3
1112:v2
1097:v1
1043:v3
1037:v2
1031:v1
1025:v3
1007:v2
989:v1
865:r3
856:r2
850:r1
844:r3
832:r2
820:r1
751:.
707:.
606:,
602:,
559:,
523:,
483:.
384:,
377:.
353:,
199:A
93:.
5147:e
5140:t
5133:v
4378:e
4371:t
4364:v
4256:.
4242:.
4228::
4205:.
4199::
4095:.
4081:.
3989:.
3902:.
3851:.
3845::
3837::
3818:.
3774:.
3749:.
3695:.
3635:f
3620:)
3617:f
3611:1
3608:(
3604:/
3600:1
3577:=
3574:r
3552:]
3549:f
3546:+
3543:r
3537:)
3534:f
3528:1
3525:(
3522:[
3518:/
3514:r
3500:f
3494:r
3409:.
3360:.
3274:.
3173:y
3161:,
3158:n
3146:,
3143:n
3134:4
3131:*
3125:,
3122:x
3110:,
3107:y
3104:,
3101:y
3092:x
3089:,
3077:n
3074:,
3062:0
3059:,
3056:y
3037:,
3034:y
3025:x
3022:,
3010:n
3007:,
2923:,
2917:,
2902:,
2890:x
2887:,
2872:,
2869:x
2866:,
2830:y
2815:,
2812:n
2800:,
2797:n
2794:,
2791:n
2779:,
2776:x
2773:,
2770:x
2758:,
2755:y
2752:,
2749:y
2740:x
2737:,
2722:0
2719:,
2716:y
2695:}
2692:;
2689:y
2683:;
2680:x
2674:y
2671:)
2665:i
2662:;
2659:n
2653:i
2650:;
2647:0
2644:=
2641:i
2635:(
2629:;
2626:0
2623:=
2620:y
2614:{
2611:)
2608:x
2599:,
2596:a
2590:,
2587:n
2581:(
2571:c
2488:,
2485:n
2473:,
2470:n
2461:4
2458:*
2452:,
2449:x
2440:4
2437:*
2431:,
2428:y
2419:y
2416:,
2404:a
2401:,
2395:,
2383:y
2380:,
2368:x
2365:,
2353:n
2350:,
2279:,
2276:n
2261:,
2258:n
2255:,
2252:n
2246:4
2243:*
2237:,
2234:y
2225:4
2222:*
2216:,
2213:x
2204:m
2201:,
2198:y
2195:,
2183:m
2180:,
2174:,
2168:,
2156:m
2153:,
2147:,
2144:a
2141:,
2132:m
2129:,
2126:y
2123:,
2114:m
2111:,
2108:x
2105:,
2087:,
2084:m
2081:,
2078:m
2066:,
2060:,
2057:m
2045:,
2042:n
2039:,
1929:,
1926:n
1914:,
1911:n
1908:,
1905:n
1896:,
1893:y
1890:,
1887:y
1875:,
1872:x
1869:,
1866:x
1860:y
1857:,
1842:,
1836:,
1821:,
1818:a
1815:,
1806:y
1803:,
1794:x
1791:,
1765:a
1762:,
1726:n
1723:,
1717:,
1714:y
1711:,
1708:y
1699:n
1696:,
1693:x
1690:,
1687:a
1684:,
1652:,
1649:n
1637:,
1634:n
1631:,
1628:n
1619:,
1616:y
1613:,
1610:y
1598:,
1595:x
1592:,
1589:x
1583:y
1580:,
1565:,
1559:,
1544:,
1541:a
1538:,
1529:y
1526:,
1514:x
1511:,
1484:}
1481:;
1478:y
1475:+
1472:x
1469:*
1466:a
1463:=
1460:y
1457:)
1451:i
1448:;
1445:n
1439:i
1436:;
1433:0
1430:=
1427:i
1421:(
1415:{
1412:)
1409:y
1403:,
1400:x
1391:,
1388:a
1382:,
1379:n
1373:(
1360:C
1154:c
1151:,
1136:,
1130:,
1118:b
1115:,
1103:a
1100:,
1052:,
1049:c
1046:,
1034:,
1028:,
1016:,
1013:b
1010:,
998:,
995:a
992:,
977:,
946:,
925:,
922:c
919:,
916:c
907:,
904:b
901:,
898:b
886:,
883:a
880:,
877:a
871:c
868:,
853:,
847:,
838:b
835:,
826:a
823:,
805:,
268:)
262:(
257:)
253:(
249:.
235:.
34:.
23:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.