High-end multicore systems with GPU-based accelerators are now ubiquitous in the hardware landscape. Besides dealing with the nontrivial heterogeneous environ- ment, end users should often take into consideration the underlying memory architec- ture to decrease the overhead of data motion, especially when running on non-uniform memory access (NUMA) platforms. We propose the OmpSs parallel programming model approach using its Nanos++ dynamic runtime system to solve the two challeng- ing problems aforementioned, through 1) an innovative NUMA node-aware scheduling policy to reduce data movement between NUMA nodes and 2) a nested parallelism feature to concurrently exploit the resources available from the GPU devices as well as the CPU host, without compromising the overall performance. Our approach fea- tures separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. The Cholesky factorization is used as a benchmark representative of dense numerical linear algebra algorithms. Superior performance is also demonstrated on the symmetric matrix inversion based on Cholesky factorization, commonly used in co-variance computations in statistics. Performance on a NUMA system with Kepler-based GPUs exceeds that of existing implementations, while the OmpSs-enabled code remains very similar to its original sequential version.
|Date of Award||Jul 2013|
|Original language||English (US)|
- Computer, Electrical and Mathematical Science and Engineering
|Supervisor||David Keyes (Supervisor)|
- HIgh Productivity
- Dense Linear Algebra