Parallel & Multi-core Computing – Complete Study Guide

UNIT I

Introduction to Multi-core Architecture

⚡ Motivation for Concurrency & Parallel Computing

▼

Concurrency means multiple tasks making progress simultaneously. As single-core CPU clock speeds hit physical limits (~3–4 GHz), the industry shifted to multiple cores to keep performance scaling.

Amdahl's Law — the theoretical speedup of a program using multiple processors is limited by the fraction that must run serially.

Speedup = 1 / [ (1 - P) + P/N ]
P = parallel fraction, N = number of processors

Example

If 80% of your program can be parallelized (P=0.8) and you use 4 cores (N=4):
Speedup = 1 / [(0.2) + (0.8/4)] = 1 / 0.4 = 2.5× — not 4×!

Amdahl's Law — Speedup vs Cores

🖥️ Parallel Computing Platforms & Hyper-Threading

▼

Shared Memory (SMP)

All CPUs share one memory space. Easy programming but limited scalability. Example: modern multi-core laptops/desktops.

Distributed Memory

Each CPU has its own memory. Communicates via message passing (MPI). Example: supercomputer clusters.

Hyper-Threading (Intel SMT)

One physical core appears as 2 logical cores to the OS. The core has duplicate registers/state for 2 threads, sharing execution units. When one thread stalls (e.g., cache miss), the other runs.

Hyper-Threading — 1 Physical Core, 2 Logical Threads

🧵 System Overview of Threading

▼

A thread is the smallest unit of CPU execution. Multiple threads within one process share the same code, heap, and global data — but each has its own stack and registers.

Process vs Threads Memory Model

```

Thread Lifecycle (State Machine)

New

→

Ready

→

Running

→

Waiting/Blocked

→

Terminated

VMs & Virtualization: Virtual Machines provide thread-level isolation at OS level. Each VM has its own virtual CPUs, but underlying hardware threads are shared by the hypervisor.

```

UNIT II

Fundamental Concepts of Parallel Programming

🔧 Task Decomposition & Data Decomposition

▼

Breaking a problem into parallel work is the first step. There are two main strategies:

Task Decomposition

Divide the work/functions among threads. Each thread does a different operation. Example: one thread reads data, another processes, another writes output (pipeline).

Data Decomposition

Same operation on different chunks of data. Example: sorting a 1M array — divide into 4 parts, each thread sorts its part (like parallel merge sort).

Data Decomposition — Parallel Array Sum

```

Challenges You'll Face

Load Imbalance — some threads finish much sooner than others, wasting CPU time
Synchronization Overhead — barriers, locks add latency
False Sharing — two threads modify different variables that happen to share a cache line
Race Conditions — non-deterministic output due to unsynchronized shared writes

```

💢 Race Conditions & Error Diffusion Algorithm

▼

A race condition occurs when the program output depends on the relative timing of threads. Classic example: two threads incrementing a shared counter.

C — Race Condition Example

int counter = 0;  // shared variable

// Thread A              // Thread B
read counter → 5         read counter → 5
add 1 → 6               add 1 → 6
write counter ← 6        write counter ← 6
// Both incremented but counter went 5→6, NOT 5→7 !

```

Error Diffusion Algorithm (Parallel)

Used in image processing (e.g., dithering). Quantization error of one pixel is diffused to neighbouring pixels. In parallel, threads handling adjacent rows can conflict — a classic parallel programming challenge.

Solution: Process odd and even rows in alternating phases so adjacent rows are never written simultaneously (checkerboard decomposition).

Parallel Error Diffusion — Checkerboard Pattern

Phase 1: Odd Rows | Phase 2: Even Rows

```

UNIT III

Threading & Parallel Programming Constructs

🔒 Synchronization, Deadlocks & Semaphores

▼

When threads share resources, they must coordinate access to prevent corruption. The key constructs are:

```

Construct	Purpose	Example Use
Mutex / Lock	Only one thread enters critical section	Updating a shared counter
Semaphore	Controls access to N units of a resource	Connection pool of 5 DB connections
Condition Variable	Thread waits until a condition is true	Producer-consumer queue
Barrier	All threads wait until everyone reaches this point	Parallel phases (render, then composite)
Monitor	OOP-level sync — object with built-in lock	Java synchronized methods

Deadlock — Coffman's 4 Conditions

A deadlock occurs when ALL four conditions hold simultaneously: Mutual Exclusion (resource held exclusively), Hold & Wait (holding one resource while waiting for another), No Preemption (resources not forcibly taken), Circular Wait (T1→R2, T2→R1).

Deadlock — Circular Wait Diagram

Deadlock Prevention

Lock Ordering: Always acquire locks in the same global order (e.g., always Mutex1 before Mutex2)
Try-Lock with Timeout: If lock can't be acquired in N ms, release held locks and retry
Banker's Algorithm: OS checks resource allocation graph before granting resource

```

💬 Message Passing & Threading APIs

▼

Instead of shared memory, threads/processes can communicate by sending messages. This eliminates race conditions but requires explicit send/receive calls.

PThread — POSIX Threads (C)

#include <pthread.h>

void* worker(void* arg) {
int id = *(int*)arg;
printf(“Thread %d running\n”, id);
return NULL;
}

int main() {
pthread_t t1, t2;
int a=1, b=2;
pthread_create(&t1, NULL, worker, &a); // create thread
pthread_create(&t2, NULL, worker, &b);
pthread_join(t1, NULL);  // wait for t1 to finish
pthread_join(t2, NULL);
return 0;
}

Win32 Thread (Windows API)

HANDLE h = CreateThread(
NULL,           // default security
0,              // default stack size
ThreadFunc,     // function pointer
(LPVOID)&args,  // argument
0,              // run immediately
&threadId       // receives thread ID
);
WaitForSingleObject(h, INFINITE);
CloseHandle(h);

Fence Instruction: A hardware memory barrier. Ensures all memory operations before the fence are visible to all processors before any operation after the fence. Critical in lock-free programming.

UNIT IV

OpenMP — A Portable Solution for Threading

🚀 OpenMP Fundamentals & Parallel Loops

▼

OpenMP is a shared-memory parallelism API using compiler directives (#pragma). It's portable — same code runs on any OpenMP-supporting compiler (GCC, Intel, MSVC).

C/C++ Fortran Pragma-based Fork-Join Model

OpenMP Fork-Join Execution Model

```

OpenMP Parallel For Loop

#pragma omp parallel for
```

for (int i = 0; i < N; i++) {
result[i] = a[i] + b[i];  // each iteration → different thread
}

// Reduction — safely sum across threads
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += data[i];
}

// Control number of threads
omp_set_num_threads(4);
printf(“Using %d threads\n”, omp_get_num_threads());

```

Work-sharing Constructs

Directive	What it does
#pragma omp parallel for	Splits loop iterations across threads
#pragma omp sections	Different code blocks run in parallel
#pragma omp single	Only one thread executes this block
#pragma omp barrier	All threads synchronize at this point
#pragma omp critical	Only one thread executes at a time (mutex)
#pragma omp atomic	Atomic read-modify-write on a single variable

Avoiding False Sharing

False Sharing: Threads writing to different variables on the same cache line (64 bytes) causes constant cache invalidation. Fix: pad arrays or use local variables accumulated into shared via reduction.

False Sharing Fix — Padding

// BAD: partial sums share cache line
```

double partial_sum[4];
// thread i writes partial_sum[i]

// GOOD: pad to separate cache lines (64 bytes = 8 doubles)
struct { double val; double pad[7]; } partial_sum[4];

🏎️ Intel Task Queuing & OpenMP Environment

▼

The #pragma omp task directive creates a task — a unit of work that can be executed by any thread in the team. Useful for irregular parallelism (e.g., tree traversal).

OpenMP Tasks — Fibonacci (Recursive Parallelism)

int fib(int n) {
    if (n < 2) return n;
    int x, y;
    #pragma omp task shared(x)
        x = fib(n - 1);
    #pragma omp task shared(y)
        y = fib(n - 2);
    #pragma omp taskwait  // wait for both tasks
    return x + y;
}

```

OpenMP Environment Variables

Variable	Purpose	Example
OMP_NUM_THREADS	Set thread count	export OMP_NUM_THREADS=8
OMP_SCHEDULE	Loop scheduling type	OMP_SCHEDULE="dynamic,4"
OMP_NESTED	Enable nested parallelism	OMP_NESTED=TRUE
OMP_STACKSIZE	Per-thread stack size	OMP_STACKSIZE=64M

Compilation & Debugging

Compile & Run

# GCC
```

gcc -fopenmp program.c -o program

# Intel ICC

icc -qopenmp program.c -o program

# Run with Valgrind DRD for race detection

valgrind –tool=drd ./program

# Intel Inspector for race conditions

inspxe-cl -collect=ti3 – ./program

UNIT V

Solutions to Common Parallel Programming Problems

🧩 Too Many Threads, Data Races, Live Locks

▼

🔴 Too Many Threads

Context switching overhead dominates. Rule of thumb: use N_threads = N_cores × (1 + wait_time/compute_time). For CPU-bound: threads ≈ cores.

🔴 Data Race

Two threads access same memory, at least one writes, no synchronization. Detection tools: ThreadSanitizer (TSan), Helgrind, Intel Inspector.

🔴 Live Lock

Like deadlock but threads keep changing state (are "active") yet make no progress. Example: two people politely stepping aside for each other — both move in the same direction forever.

🔴 Priority Inversion

Low-priority thread holds a lock needed by high-priority thread. High-priority thread blocked. Solution: Priority Inheritance — temporarily boost low-priority thread's priority.

```

Solutions for Heavily Contended Locks

Lock Striping: Instead of one lock for a HashMap, use 16 locks — each guards 1/16 of the buckets (Java ConcurrentHashMap uses this)
Non-blocking Algorithms (CAS): Use atomic Compare-And-Swap to update without locks
Reader-Writer Locks: Multiple readers allowed simultaneously; writers get exclusive access

C — Compare and Swap (CAS) — Lock-free Counter

#include <stdatomic.h>
```

atomic_int counter = 0;

void increment() {
// No mutex needed! Hardware guarantees atomicity
atomic_fetch_add(&counter, 1);
}

// Manual CAS pattern
int old_val, new_val;
do {
old_val = atomic_load(&counter);
new_val = old_val + 1;
} while (!atomic_compare_exchange_weak(&counter, &old_val, new_val));

💾 Memory & Cache Optimization

▼

Memory Consistency Models

Different architectures allow CPUs to reorder memory operations for performance. Programmers need to insert barriers to enforce ordering.

Cache Hierarchy — Latency Comparison

```

Cache-friendly Programming Tips

Row-major access: In C, 2D arrays are stored row-by-row. Always iterate rows in outer loop.
Struct of Arrays vs Array of Structs: For parallel access, SoA is more cache-friendly (SIMD-friendly).
Prefetching: Use __builtin_prefetch(addr, rw, locality) in GCC to hint CPU to load data before needed.

Cache-friendly Matrix Access (C)

// BAD — column-major: cache thrashes each row
```

for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
sum += matrix[i][j];   // jumps N*4 bytes each step!

// GOOD — row-major: sequential cache line access
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
sum += matrix[i][j];   // sequential 4-byte steps ✓

```

Non-blocking Algorithms & ABA Problem

ABA Problem: Thread reads value A, gets preempted. Another thread changes A→B→A. First thread resumes, CAS succeeds (sees A) but the data structure is now in a different state! Fix: use tagged pointers — store a version counter alongside the pointer.

```

🔬 High-Performance Architecture Concepts

▼

x86-64 Architecture Essentials

Concept	Description	Impact
Pipeline Stages	Fetch → Decode → Execute → Write-back	Stalls hurt throughput
Branch Prediction	CPU guesses if-else outcome ahead of time	Misprediction = ~15 cycle penalty
Out-of-Order Exec	CPU reorders independent instructions	Better ILP (Instruction Level Parallelism)
SIMD (SSE/AVX)	Single instruction, multiple data	Process 8 floats simultaneously with AVX
NUMA	Non-Uniform Memory Access — each socket has local RAM	Remote RAM access 2–4× slower

```

Avoiding Pipeline Stalls

Branch Avoidance — Branchless Code

// Branchy — hard to predict
```

int max_val;
if (a > b) max_val = a;
else max_val = b;

// Branchless — no pipeline flush
int max_val = a ^ ((a ^ b) & -(a < b));
// or simply:
int max_val = (a > b) ? a : b;
// Most compilers emit CMOV (conditional move) — no branch!

```

Thread-safe Functions & Bandwidth

Thread-safe functions can be called concurrently without data corruption. They either use no shared state (pure functions) or protect shared state with locks.
Memory Bandwidth is often the bottleneck for parallel programs (not CPU speed). Use roofline model to determine if your code is compute-bound or memory-bound.
Working with Cache: Keep hot data in <32KB (L1 per core). Use tiling/blocking to reuse cached data.

Key Takeaway: Write parallel programs that minimize synchronization, maximize cache reuse, ensure load balance across threads, and use the right granularity of parallelism (not too fine-grained or too coarse).

```