Partial differential equations preconditioner resilient to soft and hard faults

Francesco Rizzi, Karla Morris, Khachik Sargsyan, Paul Mycek, Cosmin Safta, Olivier Le Maitre, Omar Knio, Bert Debusschere

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split into overlapping subdomains, second, the target PDE is solved on each subdomain for sampled values of the local current boundary conditions, third, the subdomain solution samples are collected and fed into a regression step to build maps between the subdomains' boundary conditions, finally, the intersection of these maps yields the updated state at the subdomain boundaries. This reformulation allows us to recast the problem as a set of independent tasks. The implementation relies on an asynchronous server-client framework, where one or more reliable servers hold the data, while the clients ask for tasks and execute them. This framework provides resiliency to hard faults such that if a client crashes, it stops asking for work, and the servers simply distribute the work among all the other clients alive. Erroneous subdomain solves (e.g. due to soft faults) appear as corrupted data, which is either rejected if that causes a task to fail, or is seamlessly filtered out during the regression stage through a suitable noise model. Three different types of faults are modeled: hard faults modeling nodes (or clients) crashing, soft faults occurring during the communication of the tasks between server and clients, and soft faults occurring during task execution. We demonstrate the resiliency of the approach for a 2D elliptic PDE, and explore the effect of the faults at various failure rates.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages552-562
Number of pages11
Volume2015-October
ISBN (Electronic)9781467365987
DOIs
StatePublished - Oct 26 2015
Externally publishedYes
EventIEEE International Conference on Cluster Computing, CLUSTER 2015 - Chicago, United States
Duration: Sep 8 2015Sep 11 2015

Other

OtherIEEE International Conference on Cluster Computing, CLUSTER 2015
CountryUnited States
CityChicago
Period09/8/1509/11/15

Keywords

  • Client-server systems
  • Distributed computing
  • Fault tolerance
  • Fault tolerant systems
  • High performance computing
  • Message passing
  • Parallel algorithms
  • Parallel programming
  • Partial d
  • Resilience
  • Scientific computing
  • Software engineering
  • Supercomputers

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint Dive into the research topics of 'Partial differential equations preconditioner resilient to soft and hard faults'. Together they form a unique fingerprint.

Cite this