The components of high-performance systems continue to become more complex on the road to exascale. This complexity is exposed at the level of: multi/many-core CPUs, accelerators (GPUs), interconnects (horizontal communication), and memory hierarchies (vertical communication). A crucial task is designing an algorithm and a programming model that scale to the same order of the HPC system size at multiple levels. This trend in HPC architecture more critically affects memory-intensive appli- cations than compute-bound applications. Accomplishing this task involves adopting less synchronous forms of the mathematical algorithm, reducing synchronization in the computational implementation, introducing more SIMT-style concurrency at the finest level of system hierarchy, and increasing arithmetic intensity as the bottleneck shifts from number of floating-point operations to number of memory accesses.
This dissertation addresses these challenges in scientific simulation focusing in the dominant kernels of a memory-bound application: sparse solvers in implicit model- ing, and I/O in explicit reverse time migration in seismic imaging. We introduce asynchronous task-based parallelism into iterative algebraic preconditioners. We also introduce a task-based framework that hides the latency of I/O with computation. This dissertation targets two main applications in the oil and gas industry: reservoir simulation and seismic imaging simulation. It presents results on multi- and many- core systems and GPUs on four Top500 supercomputers: Summit, TSUBAME 3.0, Shaheen II, and Makman-2. We introduce an asynchronous implementation of four major memory-bound kernels: Algebraic multigrid (MPI+OmpSs), tridiagonal solve
(MPI+OpenMP), Additive Schwarz Preconditioned Inexact Newton (MPI+MPI), and Reverse Time Migration (StarPU/StarPU+MPI and CUDA).
|Date of Award||Aug 26 2019|
|Original language||English (US)|
- Computer, Electrical and Mathematical Science and Engineering
|Supervisor||David Keyes (Supervisor)|
- Asynchronous Algorithms
- Task-based runtimes
- MPI+X approach
- Task-based RTM
- Asynchronous AMG