

### **Neuromorphic Chips**

Researchers at IBM and HRL laboratories are looking into building computer chips that attempt to mimic natural thought patterns in order to more effectively solve problems in Al.



http://www.technologyreview.com/featuredstory/522476/thinking-in-silicon/ 3/08/2014 Spring 2014 – Lecture #19

## **Review of Last Lecture**

- Amdahl's Law limits benefits of parallelization
- Request Level Parallelism
  - Handle multiple requests in parallel (e.g. web search)
- MapReduce Data Level Parallelism
  - Framework to divide up data to be processed in parallel
  - Mapper outputs intermediate key-value pairs
  - Reducer "combines" intermediate values with same key
     Surve 2014 Letter #19

3/08/2014







|                        |                                           |                                                                                                                                                | Data Streams                                      |  |  |
|------------------------|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|--|--|
|                        |                                           | Single                                                                                                                                         | Multiple                                          |  |  |
| Instruction<br>Streams | Single                                    | SISD: Intel Pentium 4                                                                                                                          | SIMD: SSE instructions of x86                     |  |  |
|                        | Multiple                                  | MISD: No examples today                                                                                                                        | MIMD: Intel Xeon e5345 (Clovertown                |  |  |
| Sing<br>– S<br>– C     | le Programi<br>ingle progr<br>Cross-proce | n parallel processing pu<br>m Multiple Data ("SPM<br>ram that runs on all proces<br>essor execution coordinatic<br>(will see later in Thread L | D")<br>ssors of an MIMD<br>on through conditional |  |  |



















3/08/2014





## **SSE Instruction Categories** for Multimedia Support

| Instruction category    | Operands                   |
|-------------------------|----------------------------|
| Unsigned add/subtract   | Eight 8-bit or Four 16-bit |
| Saturating add/subtract | Eight 8-bit or Four 16-bit |
| Max/min/minimum         | Eight 8-bit or Four 16-bit |
| Average                 | Eight 8-bit or Four 16-bit |
| Shift right/left        | Eight 8-bit or Four 16-bit |

- Intel processors are CISC (complicated instrs)
- SSE-2+ supports wider data types to allow 16 × 8-bit and 8 × 16-bit operands

Spring 2014 -- Lecture #19

3/08/2014



| XN                                                   | AM Registers                                                                                       |
|------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| Architecture extend                                  | ed with eight 128-bit data registers                                                               |
| <ul> <li>64-bit address archit<br/>XMM15)</li> </ul> | tecture: available as 16 64-bit registers (XMM8 –                                                  |
|                                                      | single-precision floating-point data type<br>vs four single-precision operations to be<br>eously 0 |
| _/                                                   | XMM7                                                                                               |
|                                                      | XMM6                                                                                               |
|                                                      | XMM5                                                                                               |
|                                                      | XMM4                                                                                               |
|                                                      | XMM3                                                                                               |
|                                                      | XMM2                                                                                               |
|                                                      | XMM1                                                                                               |
|                                                      | XMM0                                                                                               |
| /08/2014                                             | Spring 2014 Lecture #19 20                                                                         |













Significant performance boost

Spring 2014 -- Lecture #19



Spring 2014 -- Lecture #19

3/08/2014









# **BONUS SLIDES**

You are responsible for the material contained on the following slides, though we may not have enough time to get to them in lecture.

They have been prepared in a way that should be easily readable and the material will be touched upon in the following lecture.

Spring 2014 -- Lecture #19



## Data Level Parallelism and SIMD

- SIMD wants adjacent values in memory that can be operated in parallel
- Usually specified in programs as loops for (i=0; i<1000; i++)</li>
  - x[i] = x[i] + s;
- How can we reveal more data level parallelism than is available in a single iteration of a loop?
   Unroll the loop and adjust iteration rate

Spring 2014 -- Lecture #19

3/08/2014

3/08/2014





## Looping in MIPS

Assumptions:

T.

33

| \$s1     | <ul> <li>\$s0 → initial address (beginning of array)</li> <li>\$s1 → scalar value s</li> <li>\$s2 → termination address (end of array)</li> </ul> |                  |           |                         |  |  |  |  |
|----------|---------------------------------------------------------------------------------------------------------------------------------------------------|------------------|-----------|-------------------------|--|--|--|--|
| loop     | :                                                                                                                                                 |                  |           |                         |  |  |  |  |
|          | lw                                                                                                                                                | \$t0,0(\$s0)     |           |                         |  |  |  |  |
|          | addu                                                                                                                                              | \$t0,\$t0,\$s1   | #         | add s to array element  |  |  |  |  |
|          | SW                                                                                                                                                | \$t0,0(\$s0)     | #         | store result            |  |  |  |  |
|          | addiu                                                                                                                                             | \$s0,\$s0,4      | #         | move to next element    |  |  |  |  |
|          | bne                                                                                                                                               | \$s0,\$s2,Loop   | #         | repeat Loop if not done |  |  |  |  |
| /08/2014 |                                                                                                                                                   | Spring 2014 – Le | cture #19 | 34                      |  |  |  |  |

## Loop Unrolling in C

```
• Instead of compiler doing loop unrolling, could do
 it yourself in C:
     for(i=0; i<1000; i++)
       x[i] = x[i] + s;
                  Loop Unroll
     for(i=0; i<1000; i=i+4) {</pre>
       x[i]
                = x[i] + s;
                                      What is
       x[i+1] = x[i+1] + s;
                                      downside
                                      of doing
       x[i+2] = x[i+2] + s;
       x[i+3] = x[i+3] + s;
                                      this in C?
    }
3/08/2014
                  Spring 2014 -- Lecture #19
                                             37
```

