Introduction to Multi-core Architecture
Concurrency means multiple tasks making progress simultaneously. As single-core CPU clock speeds hit physical limits (~3–4 GHz), the industry shifted to multiple cores to keep performance scaling.
P = parallel fraction, N = number of processors
Speedup = 1 / [(0.2) + (0.8/4)] = 1 / 0.4 = 2.5× — not 4×!
All CPUs share one memory space. Easy programming but limited scalability. Example: modern multi-core laptops/desktops.
Each CPU has its own memory. Communicates via message passing (MPI). Example: supercomputer clusters.
Hyper-Threading (Intel SMT)
One physical core appears as 2 logical cores to the OS. The core has duplicate registers/state for 2 threads, sharing execution units. When one thread stalls (e.g., cache miss), the other runs.
A thread is the smallest unit of CPU execution. Multiple threads within one process share the same code, heap, and global data — but each has its own stack and registers.
Thread Lifecycle (State Machine)
Fundamental Concepts of Parallel Programming
Breaking a problem into parallel work is the first step. There are two main strategies:
Divide the work/functions among threads. Each thread does a different operation. Example: one thread reads data, another processes, another writes output (pipeline).
Same operation on different chunks of data. Example: sorting a 1M array — divide into 4 parts, each thread sorts its part (like parallel merge sort).
Challenges You'll Face
- Load Imbalance — some threads finish much sooner than others, wasting CPU time
- Synchronization Overhead — barriers, locks add latency
- False Sharing — two threads modify different variables that happen to share a cache line
- Race Conditions — non-deterministic output due to unsynchronized shared writes
A race condition occurs when the program output depends on the relative timing of threads. Classic example: two threads incrementing a shared counter.
int counter = 0; // shared variable // Thread A // Thread B read counter → 5 read counter → 5 add 1 → 6 add 1 → 6 write counter ← 6 write counter ← 6 // Both incremented but counter went 5→6, NOT 5→7 !
Error Diffusion Algorithm (Parallel)
Used in image processing (e.g., dithering). Quantization error of one pixel is diffused to neighbouring pixels. In parallel, threads handling adjacent rows can conflict — a classic parallel programming challenge.
Parallel Error Diffusion — Checkerboard Pattern
Threading & Parallel Programming Constructs
When threads share resources, they must coordinate access to prevent corruption. The key constructs are:
```| Construct | Purpose | Example Use |
|---|---|---|
| Mutex / Lock | Only one thread enters critical section | Updating a shared counter |
| Semaphore | Controls access to N units of a resource | Connection pool of 5 DB connections |
| Condition Variable | Thread waits until a condition is true | Producer-consumer queue |
| Barrier | All threads wait until everyone reaches this point | Parallel phases (render, then composite) |
| Monitor | OOP-level sync — object with built-in lock | Java synchronized methods |
Deadlock — Coffman's 4 Conditions
Deadlock Prevention
- Lock Ordering: Always acquire locks in the same global order (e.g., always Mutex1 before Mutex2)
- Try-Lock with Timeout: If lock can't be acquired in N ms, release held locks and retry
- Banker's Algorithm: OS checks resource allocation graph before granting resource
Instead of shared memory, threads/processes can communicate by sending messages. This eliminates race conditions but requires explicit send/receive calls.
#include <pthread.h>
void* worker(void* arg) {
int id = *(int*)arg;
printf(“Thread %d running\n”, id);
return NULL;
}
int main() {
pthread_t t1, t2;
int a=1, b=2;
pthread_create(&t1, NULL, worker, &a); // create thread
pthread_create(&t2, NULL, worker, &b);
pthread_join(t1, NULL); // wait for t1 to finish
pthread_join(t2, NULL);
return 0;
}
HANDLE h = CreateThread( NULL, // default security 0, // default stack size ThreadFunc, // function pointer (LPVOID)&args, // argument 0, // run immediately &threadId // receives thread ID ); WaitForSingleObject(h, INFINITE); CloseHandle(h);
OpenMP — A Portable Solution for Threading
OpenMP is a shared-memory parallelism API using compiler directives (#pragma). It's portable — same code runs on any OpenMP-supporting compiler (GCC, Intel, MSVC).
#pragma omp parallel for
```
for (int i = 0; i < N; i++) {
result[i] = a[i] + b[i]; // each iteration → different thread
}
// Reduction — safely sum across threads
double sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += data[i];
}
// Control number of threads
omp_set_num_threads(4);
printf(“Using %d threads\n”, omp_get_num_threads());
Work-sharing Constructs
| Directive | What it does |
|---|---|
| #pragma omp parallel for | Splits loop iterations across threads |
| #pragma omp sections | Different code blocks run in parallel |
| #pragma omp single | Only one thread executes this block |
| #pragma omp barrier | All threads synchronize at this point |
| #pragma omp critical | Only one thread executes at a time (mutex) |
| #pragma omp atomic | Atomic read-modify-write on a single variable |
Avoiding False Sharing
// BAD: partial sums share cache line
```
double partial_sum[4];
// thread i writes partial_sum[i]
// GOOD: pad to separate cache lines (64 bytes = 8 doubles)
struct { double val; double pad[7]; } partial_sum[4];
The #pragma omp task directive creates a task — a unit of work that can be executed by any thread in the team. Useful for irregular parallelism (e.g., tree traversal).
int fib(int n) {
if (n < 2) return n;
int x, y;
#pragma omp task shared(x)
x = fib(n - 1);
#pragma omp task shared(y)
y = fib(n - 2);
#pragma omp taskwait // wait for both tasks
return x + y;
}
OpenMP Environment Variables
| Variable | Purpose | Example |
|---|---|---|
| OMP_NUM_THREADS | Set thread count | export OMP_NUM_THREADS=8 |
| OMP_SCHEDULE | Loop scheduling type | OMP_SCHEDULE="dynamic,4" |
| OMP_NESTED | Enable nested parallelism | OMP_NESTED=TRUE |
| OMP_STACKSIZE | Per-thread stack size | OMP_STACKSIZE=64M |
Compilation & Debugging
# GCC ``` gcc -fopenmp program.c -o program # Intel ICC icc -qopenmp program.c -o program # Run with Valgrind DRD for race detection valgrind –tool=drd ./program # Intel Inspector for race conditions inspxe-cl -collect=ti3 – ./program
Solutions to Common Parallel Programming Problems
Context switching overhead dominates. Rule of thumb: use N_threads = N_cores × (1 + wait_time/compute_time). For CPU-bound: threads ≈ cores.
Two threads access same memory, at least one writes, no synchronization. Detection tools: ThreadSanitizer (TSan), Helgrind, Intel Inspector.
Like deadlock but threads keep changing state (are "active") yet make no progress. Example: two people politely stepping aside for each other — both move in the same direction forever.
Low-priority thread holds a lock needed by high-priority thread. High-priority thread blocked. Solution: Priority Inheritance — temporarily boost low-priority thread's priority.
Solutions for Heavily Contended Locks
- Lock Striping: Instead of one lock for a HashMap, use 16 locks — each guards 1/16 of the buckets (Java ConcurrentHashMap uses this)
- Non-blocking Algorithms (CAS): Use atomic Compare-And-Swap to update without locks
- Reader-Writer Locks: Multiple readers allowed simultaneously; writers get exclusive access
#include <stdatomic.h>
```
atomic_int counter = 0;
void increment() {
// No mutex needed! Hardware guarantees atomicity
atomic_fetch_add(&counter, 1);
}
// Manual CAS pattern
int old_val, new_val;
do {
old_val = atomic_load(&counter);
new_val = old_val + 1;
} while (!atomic_compare_exchange_weak(&counter, &old_val, new_val));
Memory Consistency Models
Different architectures allow CPUs to reorder memory operations for performance. Programmers need to insert barriers to enforce ordering.
Cache-friendly Programming Tips
- Row-major access: In C, 2D arrays are stored row-by-row. Always iterate rows in outer loop.
- Struct of Arrays vs Array of Structs: For parallel access, SoA is more cache-friendly (SIMD-friendly).
- Prefetching: Use
__builtin_prefetch(addr, rw, locality)in GCC to hint CPU to load data before needed.
// BAD — column-major: cache thrashes each row ``` for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) sum += matrix[i][j]; // jumps N*4 bytes each step! // GOOD — row-major: sequential cache line access for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) sum += matrix[i][j]; // sequential 4-byte steps ✓
Non-blocking Algorithms & ABA Problem
x86-64 Architecture Essentials
| Concept | Description | Impact |
|---|---|---|
| Pipeline Stages | Fetch → Decode → Execute → Write-back | Stalls hurt throughput |
| Branch Prediction | CPU guesses if-else outcome ahead of time | Misprediction = ~15 cycle penalty |
| Out-of-Order Exec | CPU reorders independent instructions | Better ILP (Instruction Level Parallelism) |
| SIMD (SSE/AVX) | Single instruction, multiple data | Process 8 floats simultaneously with AVX |
| NUMA | Non-Uniform Memory Access — each socket has local RAM | Remote RAM access 2–4× slower |
Avoiding Pipeline Stalls
// Branchy — hard to predict ``` int max_val; if (a > b) max_val = a; else max_val = b; // Branchless — no pipeline flush int max_val = a ^ ((a ^ b) & -(a < b)); // or simply: int max_val = (a > b) ? a : b; // Most compilers emit CMOV (conditional move) — no branch!
Thread-safe Functions & Bandwidth
- Thread-safe functions can be called concurrently without data corruption. They either use no shared state (pure functions) or protect shared state with locks.
- Memory Bandwidth is often the bottleneck for parallel programs (not CPU speed). Use roofline model to determine if your code is compute-bound or memory-bound.
- Working with Cache: Keep hot data in <32KB (L1 per core). Use tiling/blocking to reuse cached data.