CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 19 - Data Parallelism II Flynn Taxonomy, SIMD > Instructors: Michael Franklin Dan Garcia http://inst.eecs.Berkeley.edu/~cs61c/Fa11 Fall 2011 - Lecture #19 ### Review - · Parallelism is one of the great ideas - Request Level Parallelism - · Data Level Parallelism - At the disk/server level scale out to solve bigger problems on more data - Map Reduce/Hadoop - At the memory level today's topic Fall 2011 -- Lecture #19 # Agenda - · Flynn Taxonomy - DLP and SIMD - · Intel SSE (part 1) Fall 2011 -- Lecture #19 # Alternative Kinds of Parallelism: Hardware vs. Software | | | Software | | |----------|----------|-------------------------------------------------------------------------------------|----------------------------------------------------------------------------------| | | | Sequential | Concurrent | | Hardware | Serial | Matrix Multiply written in MatLab<br>running on an Intel Pentium 4 | Windows Vista Operating System running on an Intel Pentium 4 | | | Parallel | Matrix Multiply written in MATLAB<br>running on an Intel Xeon e5345<br>(Clovertown) | Windows Vista Operating System<br>running on an Intel Xeon e5345<br>(Clovertown) | - Concurrent software can also run on serial hardware - · Sequential software can also run on parallel hardware - Our focus today is on sequential or concurrent software running on parallel hardware Fall 2011 -- Lecture #19 # Flynn Taxonomy | | | Data Streams | | |-------------|----------|-------------------------|-------------------------------------| | | | Single | Multiple | | Instruction | Single | SISD: Intel Pentium 4 | SIMD: SSE instructions of x86 | | Streams | Multiple | MISD: No examples today | MIMD: Intel Xeon e5345 (Clovertown) | | | | | | - In 2011, SIMD and MIMD most common parallel computers Most common parallel processing programming style: Single Program Multiple Data ("SPMD") - Single program that runs on all processors of an MIMD - Cross-processor execution coordination through conditional expressions (thread parallelism next lecture ) - SIMD (aka hw-level data parallelism): specialized function units, for handling lock-step calculations involving arrays - Scientific computing, signal processing, multimedia (audio/video) Fall 2011 -- Lecture #19 ### Flynn Taxonomy: SISD Single Instruction/Single Data Stream Sequential computer that Instruction Pool SISD exploits no parallelism in either the instruction or data streams. Data Pool PU + · Examples of SISD architecture are Processing Unit traditional uniprocessor machines Fall 2011 -- Lecture #19 # SIMD Architectures Data parallelism: executing one operation on multiple data streams Example to provide context: Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] × x[i], 0 ≤ i < n</li> Sources of performance improvement: One instruction is fetched & decoded for entire operation Multiplications are known to be independent Pipelining/concurrency in memory access as well # **Example: SIMD Array Processing** ``` for each f in array f = sqrt(f) for each f in array load f to the floating-point register calculate the square root write the result from the register to memory for each 4 members in array load 4 members to the SSE register calculate 4 square roots in one operation write the result from the register to memory Fall 2011 -- Lecture #19 ``` # **SSE Instruction Categories** for Multimedia Support | Instruction category | Operands | | |-------------------------|----------------------------|--| | Unsigned add/subtract | Eight 8-bit or Four 16-bit | | | Saturating add/subtract | Eight 8-bit or Four 16-bit | | | Max/min/minimum | Eight 8-bit or Four 16-bit | | | Average | Eight 8-bit or Four 16-bit | | | Shift right/left | Eight 8-bit or Four 16-bit | | SSE2+ supports wider data types to allow 16 x 8-bit and 8 x 16-bit operands Fall 2011 -- Lecture #19 # Intel Architecture SSE2+ 128-Bit SIMD Data Types Fundamental 128-Bit Packed SIMD Data Types Packed Bytes 127 122 121 96 95 80 79 64 63 48 47 32 31 16 15 Packed Words 8 / 128 bits 127 122 121 96 95 80 79 64 63 48 47 32 31 16 15 0 Packed Doublewords 64 63 127 96 95 32 31 4 / 128 bits Packed Quadwords 64 63 - Note: in Intel Architecture (unlike MIPS) a word is 16 bits - Single precision FP: Double word (32 bits) - Double precision FP: Quad word (64 bits) ### **XMM** Registers | | 0 | |------|----------------------------------------------| | XMM7 | | | XMM6 | | | XMM5 | | | XMM4 | | | XMM3 | | | XMM2 | | | XMM1 | | | XMM0 | | | | XMM6<br>XMM5<br>XMM4<br>XMM3<br>XMM2<br>XMM1 | - Architecture extended with eight 128-bit data registers: XMM registers - A 64-bit address architecture: available as 16 64-bit registers (XMM8 XMM15) E.g., 128-bit packed single-precision floating-point data type (doublewords), allows four single-precision operations to be performed simultaneously Fall 2011 -- Lecture #19 ### SSE/SSE2 Floating Point Instructions | Data transfer | Arithmetic | Compare | |----------------------------------------|----------------------------------|----------------------| | MOV(A/U)(SS/PS/SD/<br>PD) xmm, mem/xmm | ADD(SS/PS/SD/PD) xmm,<br>mem/xmm | CMP(SS/PS/SD/<br>PD) | | | SUB(SS/PS/SD/PD) xmm,<br>mem/xmm | | | MOV (H/L) (PS/PD)<br>xmm, mem/xmm | MUL(SS/PS/SD/PD) xmm,<br>mem/xmm | | | | DIV(SS/PS/SD/PD) xmm,<br>mem/xmm | | | | SQRT(SS/PS/SD/PD) mem/xmm | | | | MAX (SS/PS/SD/PD) mem/xmm | | | | MIN(SS/PS/SD/PD) mem/xmm | | | MINISS/PS/SD/PDI | mem/xmm | xmm: one operand is a 128-bit SSZ register | mem/xmm other operand is in memory or an SSZ register | mem/xmm: other operand is in memory or an SSZ register | (PS) Scalar Single precision FP: one 32-bit operand in a 128-bit register | (PS) Packed Single precision FP: one 64-bit operand in a 128-bit register | (SD) Scalar Double precision FP: one 64-bit operand in a 128-bit register | (SD) Packed Double precision FP: one 64-bit operand in a 128-bit register | (A) 128-bit operand is a ligned in memory | (A) 128-bit operand | signed in memory | (H) means move the high half of the 128-bit operand | (L) means move the low half of the 128-bit operand | Fall 2011 - Lecture #19 # **Example: Add Two Single Precision FP Vectors** ### Computation to be performed: vec\_res.x = v1.x + v2.x; vec\_res.y = v1.y + v2.y; vec\_res.z = v1.z + v2.z; vec\_res.w = v1.w + v2.w; SSE Instruction Sequence: mov a ps: move from mem to XMM register, memory aligned, $\mathbf{p}$ acked $\mathbf{s}$ ingle precision add ps: add from mem to XMM register, packed single precision mov a ps: move from XMM register to mem, memory aligned, packed single precision movaps address-of-v1, %x // v1.w | v1.z | address-of-v2, %xmm0 v1.y | v1.x -> xmm0 Fall 2011 -- Lecture #19 # **Displays and Pixels** · Each coordinate in frame buffer on left determines shade of corresponding coordinate for the raster scan CRT display on right. Pixel (X0, Y0) contains bit pattern 0011, a lighter shade on the screen than the bit pattern 1101 in pixel (X1, Y1) ### **Example: Image Converter** - Converts BMP (bitmap) image to a YUV (color space) image format: - Read individual pixels from the BMP image, convert pixels into YUV format - Can pack the pixels and operate on a set of pixels with a single instruction - E.g., bitmap image consists of 8 bit monochrome - Pack these pixel values in a 128 bit register (8 bit \* 16 pixels), can operate on 16 values at a time - Significant performance boost Fall 2011 -- Lecture #19 # Example: Image Converter - FMADDPS Multiply and add packed single precision floating point instruction - · One of the typical operations computed in transformations (e.g., DFT or FFT) $$P = \sum_{n=1}^{N} f(n) \times x(n)$$ Fall 2011 -- Lecture #19 # Example: Image Converter Floating point numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): for (int i =0; i< 4; i++) p = p + src1[i] \* src2[i]; Regular x86 instructions for the inner loop: //src1 is on the top of the stack; src1 \* src2 -> src1 fmul DWORD PTR \_src2\$[%esp+148] //p = ST(1), src1 = ST(0); ST(0)+ST(1) -> ST(1); ST-Stack Top faddp %ST(0), %ST(1) (Note: Destination on the right in x86 assembly) Number regular x86 Fl. Pt. instructions executed: 4 \* 2 = 8 # Example: Image Converter Floating point numbers f(n) and x(n) in src1 and src2; p in dest; C implementation for N = 4 (128 bits): ``` for (int i =0; i< 4; i++) p = p + src1[i] * src2[i]; · SSE2 instructions for the inner loop: ``` //xmm0 = p, xmm1 = src1[i], xmm2 = src2[i] mulps %xmm1, %xmm2 // xmm2 \* xmm1 -> xmm2 addps %xmm2, %xmm0 //xmm0 + xmm2 -> xmm0 - . Number regular instructions executed: 2 SSE2 instructions vs. 8 x86 - SSE5 instruction accomplishes same in one instruction: fmaddps %xmm0, %xmm1, %xmm2, %xmm0 // xmm2 \* xmm1 + xmm0 -> xmm0 // multiply xmm1 x xmm2 paired single, // then add product paired single to sum in xmm0 Number regular instructions executed: 1 SSE5 instruction vs. 8 x86 Fall 2011 -- Lecture #19 # So, in conclusion... - Flynn Taxonomy of Parallel Architectures - SIMD: Single Instruction Multiple Data - MIMD: Multiple Instruction Multiple Data - SISD: Single Instruction Single Data (unused) - MISD: Multiple Instruction Single Data - Intel SSE SIMD Instructions - One instruction fetch that operates on multiple operands simultaneously - 128/64 bit XMM registers Fall 2011 -- Lecture #19