Parallel & Multi-core Computing

Complete Study Guide — All 5 Units with Examples & Diagrams

UNIT I

Introduction to Multi-core Architecture

Motivation for Concurrency & Parallel Computing

Concurrency means multiple tasks making progress simultaneously. As single-core CPU clock speeds hit physical limits (~3–4 GHz), the industry shifted to multiple cores to keep performance scaling.

Amdahl's Law — the theoretical speedup of a program using multiple processors is limited by the fraction that must run serially.
Speedup = 1 / [ (1 - P) + P/N ]
P = parallel fraction, N = number of processors
Example
If 80% of your program can be parallelized (P=0.8) and you use 4 cores (N=4):
Speedup = 1 / [(0.2) + (0.8/4)] = 1 / 0.4 = 2.5× — not 4×!
Amdahl's Law — Speedup vs Cores
0 2 4 5 1 4 8 16 32 P=50% P=80% P=95% Speedup (Y) vs Number of Cores (X)
🖥️ Parallel Computing Platforms & Hyper-Threading
Shared Memory (SMP)

All CPUs share one memory space. Easy programming but limited scalability. Example: modern multi-core laptops/desktops.

Distributed Memory

Each CPU has its own memory. Communicates via message passing (MPI). Example: supercomputer clusters.

Hyper-Threading (Intel SMT)

One physical core appears as 2 logical cores to the OS. The core has duplicate registers/state for 2 threads, sharing execution units. When one thread stalls (e.g., cache miss), the other runs.

Hyper-Threading — 1 Physical Core, 2 Logical Threads
Physical Core Execution Units (ALU, FPU) Thread 0 Thread 1 OS sees 2 logical cores Share cache & memory BUS
🧵 System Overview of Threading

A thread is the smallest unit of CPU execution. Multiple threads within one process share the same code, heap, and global data — but each has its own stack and registers.

Process vs Threads Memory Model
Process Address Space Code Segment (shared) Heap / Global Variables (shared) Thread 1 Stack+Regs Thread 2 Stack+Regs Thread 3 Stack+Regs Each thread has its OWN stack
```

Thread Lifecycle (State Machine)

New
Ready
Running
Waiting/Blocked
Terminated
VMs & Virtualization: Virtual Machines provide thread-level isolation at OS level. Each VM has its own virtual CPUs, but underlying hardware threads are shared by the hypervisor.
```
UNIT II

Fundamental Concepts of Parallel Programming

🔧 Task Decomposition & Data Decomposition

Breaking a problem into parallel work is the first step. There are two main strategies:

Task Decomposition

Divide the work/functions among threads. Each thread does a different operation. Example: one thread reads data, another processes, another writes output (pipeline).

Data Decomposition

Same operation on different chunks of data. Example: sorting a 1M array — divide into 4 parts, each thread sorts its part (like parallel merge sort).

Data Decomposition — Parallel Array Sum
Original Array [0..15] 1 3 5 7 2 4 6 8 9 1 3 5 10 2 4 6 T0: [0..3]=16 T1: [4..7]=20 T2: [8..11]=18 T3:[12..15]=22 Final Sum = 76
```

Challenges You'll Face

  • Load Imbalance — some threads finish much sooner than others, wasting CPU time
  • Synchronization Overhead — barriers, locks add latency
  • False Sharing — two threads modify different variables that happen to share a cache line
  • Race Conditions — non-deterministic output due to unsynchronized shared writes
```
💢 Race Conditions & Error Diffusion Algorithm

A race condition occurs when the program output depends on the relative timing of threads. Classic example: two threads incrementing a shared counter.

C — Race Condition Example
int counter = 0;  // shared variable

// Thread A              // Thread B
read counter → 5         read counter → 5
add 1 → 6               add 1 → 6
write counter ← 6        write counter ← 6
// Both incremented but counter went 5→6, NOT 5→7 !
```

Error Diffusion Algorithm (Parallel)

Used in image processing (e.g., dithering). Quantization error of one pixel is diffused to neighbouring pixels. In parallel, threads handling adjacent rows can conflict — a classic parallel programming challenge.

Solution: Process odd and even rows in alternating phases so adjacent rows are never written simultaneously (checkerboard decomposition).

Parallel Error Diffusion — Checkerboard Pattern

Phase 1: Odd Rows | Phase 2: Even Rows
Image Rows Row 0 Row 1 Row 2 Row 3 PHASE 1 — Even rows processed in parallel PHASE 1 — Odd rows idle PHASE 1 — Even rows processed in parallel PHASE 1 — Odd rows idle PHASE 2: Reverse — Odd rows run, Even rows idle
```
UNIT III

Threading & Parallel Programming Constructs

🔒 Synchronization, Deadlocks & Semaphores

When threads share resources, they must coordinate access to prevent corruption. The key constructs are:

```
ConstructPurposeExample Use
Mutex / LockOnly one thread enters critical sectionUpdating a shared counter
SemaphoreControls access to N units of a resourceConnection pool of 5 DB connections
Condition VariableThread waits until a condition is trueProducer-consumer queue
BarrierAll threads wait until everyone reaches this pointParallel phases (render, then composite)
MonitorOOP-level sync — object with built-in lockJava synchronized methods

Deadlock — Coffman's 4 Conditions

A deadlock occurs when ALL four conditions hold simultaneously: Mutual Exclusion (resource held exclusively), Hold & Wait (holding one resource while waiting for another), No Preemption (resources not forcibly taken), Circular Wait (T1→R2, T2→R1).
Deadlock — Circular Wait Diagram
Thread A Thread B Mutex 1 Mutex 2 holds wants holds wants ⚠ DEADLOCK — Neither can proceed!

Deadlock Prevention

  • Lock Ordering: Always acquire locks in the same global order (e.g., always Mutex1 before Mutex2)
  • Try-Lock with Timeout: If lock can't be acquired in N ms, release held locks and retry
  • Banker's Algorithm: OS checks resource allocation graph before granting resource
```
💬 Message Passing & Threading APIs

Instead of shared memory, threads/processes can communicate by sending messages. This eliminates race conditions but requires explicit send/receive calls.

PThread — POSIX Threads (C)
#include <pthread.h>

void* worker(void* arg) {
int id = *(int*)arg;
printf(“Thread %d running\n”, id);
return NULL;
}

int main() {
pthread_t t1, t2;
int a=1, b=2;
pthread_create(&t1, NULL, worker, &a); // create thread
pthread_create(&t2, NULL, worker, &b);
pthread_join(t1, NULL);  // wait for t1 to finish
pthread_join(t2, NULL);
return 0;
}
Win32 Thread (Windows API)
HANDLE h = CreateThread(
NULL,           // default security
0,              // default stack size
ThreadFunc,     // function pointer
(LPVOID)&args,  // argument
0,              // run immediately
&threadId       // receives thread ID
);
WaitForSingleObject(h, INFINITE);
CloseHandle(h);
Fence Instruction: A hardware memory barrier. Ensures all memory operations before the fence are visible to all processors before any operation after the fence. Critical in lock-free programming.
UNIT IV

OpenMP — A Portable Solution for Threading

🚀 OpenMP Fundamentals & Parallel Loops

OpenMP is a shared-memory parallelism API using compiler directives (#pragma). It's portable — same code runs on any OpenMP-supporting compiler (GCC, Intel, MSVC).

C/C++ Fortran Pragma-based Fork-Join Model
OpenMP Fork-Join Execution Model
Main Serial FORK Thread 0 Thread 1 Thread 2 JOIN Serial Threads… Threads…
```
OpenMP Parallel For Loop
#pragma omp parallel for
```

for (int i = 0; i < N; i++) {
result[i] = a[i] + b[i];  // each iteration → different thread
}

// Reduction — safely sum across threads
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += data[i];
}

// Control number of threads
omp_set_num_threads(4);
printf(“Using %d threads\n”, omp_get_num_threads());
```

Work-sharing Constructs

DirectiveWhat it does
#pragma omp parallel forSplits loop iterations across threads
#pragma omp sectionsDifferent code blocks run in parallel
#pragma omp singleOnly one thread executes this block
#pragma omp barrierAll threads synchronize at this point
#pragma omp criticalOnly one thread executes at a time (mutex)
#pragma omp atomicAtomic read-modify-write on a single variable

Avoiding False Sharing

False Sharing: Threads writing to different variables on the same cache line (64 bytes) causes constant cache invalidation. Fix: pad arrays or use local variables accumulated into shared via reduction.
False Sharing Fix — Padding
// BAD: partial sums share cache line
```

double partial_sum[4];
// thread i writes partial_sum[i]

// GOOD: pad to separate cache lines (64 bytes = 8 doubles)
struct { double val; double pad[7]; } partial_sum[4];
🏎️ Intel Task Queuing & OpenMP Environment

The #pragma omp task directive creates a task — a unit of work that can be executed by any thread in the team. Useful for irregular parallelism (e.g., tree traversal).

OpenMP Tasks — Fibonacci (Recursive Parallelism)
int fib(int n) {
    if (n < 2) return n;
    int x, y;
    #pragma omp task shared(x)
        x = fib(n - 1);
    #pragma omp task shared(y)
        y = fib(n - 2);
    #pragma omp taskwait  // wait for both tasks
    return x + y;
}
```

OpenMP Environment Variables

VariablePurposeExample
OMP_NUM_THREADSSet thread countexport OMP_NUM_THREADS=8
OMP_SCHEDULELoop scheduling typeOMP_SCHEDULE="dynamic,4"
OMP_NESTEDEnable nested parallelismOMP_NESTED=TRUE
OMP_STACKSIZEPer-thread stack sizeOMP_STACKSIZE=64M

Compilation & Debugging

Compile & Run
# GCC
```

gcc -fopenmp program.c -o program

# Intel ICC

icc -qopenmp program.c -o program

# Run with Valgrind DRD for race detection

valgrind –tool=drd ./program

# Intel Inspector for race conditions

inspxe-cl -collect=ti3 – ./program
UNIT V

Solutions to Common Parallel Programming Problems

🧩 Too Many Threads, Data Races, Live Locks
🔴 Too Many Threads

Context switching overhead dominates. Rule of thumb: use N_threads = N_cores × (1 + wait_time/compute_time). For CPU-bound: threads ≈ cores.

🔴 Data Race

Two threads access same memory, at least one writes, no synchronization. Detection tools: ThreadSanitizer (TSan), Helgrind, Intel Inspector.

🔴 Live Lock

Like deadlock but threads keep changing state (are "active") yet make no progress. Example: two people politely stepping aside for each other — both move in the same direction forever.

🔴 Priority Inversion

Low-priority thread holds a lock needed by high-priority thread. High-priority thread blocked. Solution: Priority Inheritance — temporarily boost low-priority thread's priority.

```

Solutions for Heavily Contended Locks

  • Lock Striping: Instead of one lock for a HashMap, use 16 locks — each guards 1/16 of the buckets (Java ConcurrentHashMap uses this)
  • Non-blocking Algorithms (CAS): Use atomic Compare-And-Swap to update without locks
  • Reader-Writer Locks: Multiple readers allowed simultaneously; writers get exclusive access
C — Compare and Swap (CAS) — Lock-free Counter
#include <stdatomic.h>
```

atomic_int counter = 0;

void increment() {
// No mutex needed! Hardware guarantees atomicity
atomic_fetch_add(&counter, 1);
}

// Manual CAS pattern
int old_val, new_val;
do {
old_val = atomic_load(&counter);
new_val = old_val + 1;
} while (!atomic_compare_exchange_weak(&counter, &old_val, new_val));
💾 Memory & Cache Optimization

Memory Consistency Models

Different architectures allow CPUs to reorder memory operations for performance. Programmers need to insert barriers to enforce ordering.

Cache Hierarchy — Latency Comparison
L1 ~1ns L2 ~5ns L3 ~30ns RAM ~100ns SSD ~50µs HDD ~5ms Latency (log) Memory Hierarchy — Closer to CPU = Faster = Smaller
```

Cache-friendly Programming Tips

  • Row-major access: In C, 2D arrays are stored row-by-row. Always iterate rows in outer loop.
  • Struct of Arrays vs Array of Structs: For parallel access, SoA is more cache-friendly (SIMD-friendly).
  • Prefetching: Use __builtin_prefetch(addr, rw, locality) in GCC to hint CPU to load data before needed.
Cache-friendly Matrix Access (C)
// BAD — column-major: cache thrashes each row
```

for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += matrix[i][j];   // jumps N*4 bytes each step!

// GOOD — row-major: sequential cache line access
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += matrix[i][j];   // sequential 4-byte steps ✓
```

Non-blocking Algorithms & ABA Problem

ABA Problem: Thread reads value A, gets preempted. Another thread changes A→B→A. First thread resumes, CAS succeeds (sees A) but the data structure is now in a different state! Fix: use tagged pointers — store a version counter alongside the pointer.
```
🔬 High-Performance Architecture Concepts

x86-64 Architecture Essentials

ConceptDescriptionImpact
Pipeline StagesFetch → Decode → Execute → Write-backStalls hurt throughput
Branch PredictionCPU guesses if-else outcome ahead of timeMisprediction = ~15 cycle penalty
Out-of-Order ExecCPU reorders independent instructionsBetter ILP (Instruction Level Parallelism)
SIMD (SSE/AVX)Single instruction, multiple dataProcess 8 floats simultaneously with AVX
NUMANon-Uniform Memory Access — each socket has local RAMRemote RAM access 2–4× slower
```

Avoiding Pipeline Stalls

Branch Avoidance — Branchless Code
// Branchy — hard to predict
```

int max_val;
if (a > b) max_val = a;
else max_val = b;

// Branchless — no pipeline flush
int max_val = a ^ ((a ^ b) & -(a < b));
// or simply:
int max_val = (a > b) ? a : b;
// Most compilers emit CMOV (conditional move) — no branch!
```

Thread-safe Functions & Bandwidth

  • Thread-safe functions can be called concurrently without data corruption. They either use no shared state (pure functions) or protect shared state with locks.
  • Memory Bandwidth is often the bottleneck for parallel programs (not CPU speed). Use roofline model to determine if your code is compute-bound or memory-bound.
  • Working with Cache: Keep hot data in <32KB (L1 per core). Use tiling/blocking to reuse cached data.
Key Takeaway: Write parallel programs that minimize synchronization, maximize cache reuse, ensure load balance across threads, and use the right granularity of parallelism (not too fine-grained or too coarse).
```