### **Distributed Data Processing Environments**

José Orlando Pereira

Departamento de Informática Universidade do Minho



#### "Fast"

- <u>Latency</u>: time to complete a task
- Throughput: tasks completed in a unit of time

 Hard / expensive to achieve both at the same time



### Latency vs. throughput

- Latency vs. bandwidth trade-off changes with <u>load</u>
- When approaching system capacity, latency increases with queuing
- Optimization means pushing the curve right/down



### A model of computing



### A model of computing



- Challenges for data-intensive programs:
  - RAM memory is not big enough
  - RAM memory is not fast enough

# Memory hierarchy







| execute typical instruction         | 1/1,000,000,000  sec = 1  nanosec      |
|-------------------------------------|----------------------------------------|
| fetch from L1 cache memory          | 0.5 nanosec                            |
| branch misprediction                | 5 nanosec                              |
| fetch from L2 cache memory          | 7 nanosec                              |
| Mutex lock/unlock                   | 25 nanosec                             |
| fetch from main memory              | 100 nanosec                            |
| send 2K bytes over 1Gbps network    | 20,000 nanosec                         |
| read 1MB sequentially from memory   | 250,000 nanosec                        |
| fetch from new disk location (seek) | 8,000,000 nanosec                      |
| read 1MB sequentially from disk     | 20,000,000 nanosec                     |
| send packet US to Europe and back   | 150 milliseconds = 150,000,000 nanosec |

Source: http://norvig.com/21-days.html#answers

#### **Key Issue:**

How much data has to be moved for each operation



capacity

latency

### Memory hierarchy

- Minimize data movement to optimize performance
- General strategies:
  - Improve <u>locality</u> → Do more with data that is already loaded up in the memory hierarchy
  - Be <u>thrifty</u> → Avoid loading data that is not strictly necessary

### A model of computing



- Challenge for data-intensive programs:
  - Computation is not fast enough

### Moore's Law

#### 50 Years of Microprocessor Trend Data



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2021 by K. Rupp

Source https://github.com/karlrupp/microprocessor-trend-data

### **Pipelining**

#### instruction latency = 5 cycles



throughput = 1 instruction / cycle

## **Pipelining**



- Data dependency:
  - Trying to load a value that has not yet been computed

## **Pipelining**

| Fetch             | Decode         | Load           | Execute        | Store          |
|-------------------|----------------|----------------|----------------|----------------|
| jns <u>.L2</u>    |                |                |                |                |
| ???               | jns <u>.L2</u> |                |                |                |
| stall             | stall          | jns <u>.L2</u> |                |                |
| stall             | stall          | stall          | jns <u>.L2</u> |                |
| stall             | stall          | stall          | stall          | jns <u>.L2</u> |
| mov edx, DWORD PT |                | stall          | stall          | stall          |

- Control flow dependency:
  - Cannot predict the next instruction

#### Vectorization



- Use wide registers that can fit vectors instead of scalars:
  - Example: Intel AVX512 → 512 bits
    - 64 byte vector
    - 32 shorts
    - 16 ints
    - ..

Load, execute, and store full vectors, or slices of vectors, in a single instruction

Key technique in GPUs

#### Multi-core



### **Distributed**



### **Coordination overhead**

- Splitting a task incurs in coordination overhead
- Consider two versions of a chunked vector operation:
  - Get chunk of size 1, execute
  - Get chunk of size 2, execute one and the other



### **Coordination overhead**



Eventually, at least one core is blocked waiting for coordination

#### **Coordination overhead**



 Reducing the <u>contention</u> on coordination improves performance, even if doing the same work!

### Amdahl's Law



#### Fault tolerance



### Hardware abstraction and protection?

- How to run programs on computers with different configurations?
  - Memory capacity
  - # of CPU cores
- How to preven the running program from acessing all resources?
  - Stored data



### **Operating system**



# Hypervisor



### **Cloud computing**

- Hypervisors allow resources to be pooled and sliced
  - Elasticity
  - Computing as an utility
- Available in Infrastructure as a Service (laaS) from <u>cloud</u> <u>providers</u>
  - Cost effective for data storage and processing

#### **Key Issue:**

**Exploiting cloud computing** 

### Summary

- Key issues for distributed data processing:
  - Data movement
  - Parallelism
  - Coordination
  - Financial cost
  - Fault tolerance
- We will often justify design and implementation decisions with these issues!