Performance scaling variability and energy analysis for a resilient ULFM-based PDE solver

K. Morris, F. Rizzi, B. Cook, P. Mycek, O. LeMaitre, Omar Knio, K. Sargsyan, K. Dahlgren, B. J. Debusschere

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all state information is held by the servers, while clients only serve as computational units. The task-based nature of the algorithm and the capabilities of ULFM complement each other to support missing tasks, making the application resilient to clients failing.We present weak and strong scaling results on Edison, National Energy Research Scientific Computing Center (NERSC), for a nominal and a fault-injected case, showing that even in the presence of faults, scalability tested up to 50k cores is within 90%. We then quantify the variability of weak and strong scaling due to the presence of faults. Finally, we discuss the performance of our application with respect to subdomain size, server/client configuration, and the interplay between energy and resilience.

Original languageEnglish (US)
Title of host publicationProceedings of ScalA 2016
Subtitle of host publication7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages41-48
Number of pages8
ISBN (Electronic)9781509052226
DOIs
StatePublished - Jan 30 2017
Event7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2016 - Salt Lake City, United States
Duration: Nov 13 2016Nov 18 2016

Publication series

NameProceedings of ScalA 2016: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems - Held in conjunction with SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis

Other

Other7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA 2016
CountryUnited States
CitySalt Lake City
Period11/13/1611/18/16

Keywords

  • Client-server systems
  • Dynamic voltage scaling
  • Fault tolerance
  • Partial differential equations
  • Resilience

ASJC Scopus subject areas

  • Computer Science Applications
  • Numerical Analysis
  • Software
  • Computational Mathematics

Fingerprint Dive into the research topics of 'Performance scaling variability and energy analysis for a resilient ULFM-based PDE solver'. Together they form a unique fingerprint.

Cite this