Fortran CUDA 并行编程

记录 Fortran CUDA 编程相关问题及实现

Question

Compute Unified Device Architecture
GPU
- SM: Streaming Multiprocessor
- SM = Compute Units Registers + L1 + SHM
  -注意 L1 和 SHM (shared memory) 共用 cache
CUDA
- Kernel 1 (host) => Grid 1 (Device)
- Gird 1 => multi-block => multi-thread
- gridDim blockId threadId
- 4096 16 256
GPU
- Device => SM => Core
- Memory => L1/SHM => Register
需要注意: SM 可映射多个 block，Core 也会映射不同的 thread，执行时自动调度
SIMT: Single Instruction Multiple Threads
- Run time divergence
Cache 管理
- memory coalescing: 数据内存连续
- banks confict: 不同 thread 访问相同 shared memory 会造成冲突，而 broadcast 操作可以执行

CPU
- Core - L1 Cache/Control - L2 Cache - L3 Cache - DRAM
GPU
- Multi-Core - L1 Cache/Control - L2 Cache - DRAM
- Use parallelization to hide latency
Heterogeneous Interaction
- 阻塞