Tile Low-Rank GEMM Using Batched Operations on GPUs

Ali Charara*, David Keyes, Hatem Ltaief

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Dense General Matrix-Matrix (GEMM) multiplication is a core operation of the Basic Linear Algebra Subroutines (BLAS) library, and therefore, often resides at the bottom of the traditional software stack for many scientific applications. In fact, chip manufacturers give a special attention to the GEMM kernel implementation since this is exactly where most of the high-performance software libraries extract hardware performance. With the emergence of big data applications involving large data-sparse, hierarchically low-rank matrices, the off-diagonal tiles can be compressed to reduce the algorithmic complexity and the memory footprint. The resulting tile low-rank (TLR) data format is composed of small data structures, which retain the most significant information for each tile. However, to operate on low-rank tiles, a new GEMM operation and its corresponding API have to be designed on GPUs so that the data sparsity structure of the matrix can be exploited while leveraging the underlying TLR compression format. The main idea consists of aggregating all operations into a single kernel launch to compensate for their low arithmetic intensities and to mitigate the data transfer overhead on GPUs. The new TLR-GEMM kernel outperforms the cuBLAS dense batched GEMM by more than an order of magnitude and creates new opportunities for TLR advanced algorithms.

Original languageEnglish (US)
Title of host publicationEuro-Par 2018
Subtitle of host publicationParallel Processing - 24th International Conference on Parallel and Distributed Computing, Proceedings
EditorsMassimo Torquati, Marco Aldinucci, Luca Padovani
PublisherSpringer-Verlag
Pages811-825
Number of pages15
ISBN (Print)9783319969824
DOIs
StatePublished - Jan 1 2018
Event24th International European Conference on Parallel and Distributed Computing, Euro-Par 2018 - Turin, Italy
Duration: Aug 27 2018Aug 31 2018

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume11014 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference24th International European Conference on Parallel and Distributed Computing, Euro-Par 2018
CountryItaly
CityTurin
Period08/27/1808/31/18

Keywords

  • GPU Computing
  • Hierarchical low-rank matrix computations
  • High performance computing
  • KBLAS
  • Matrix multiplication - GEMM

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Tile Low-Rank GEMM Using Batched Operations on GPUs'. Together they form a unique fingerprint.

Cite this