Scalable Discovery and Analytics on Web Linked Data

  • Ibrahim Abdelaziz

Student thesis: Doctoral Thesis

Abstract

Resource Description Framework (RDF) provides a simple way for expressing facts across the web, leading to Web linked data. Several distributed and federated RDF systems have emerged to handle the massive amounts of RDF data available nowadays. Distributed systems are optimized to query massive datasets that appear as a single graph, while federated systems are designed to query hundreds of decentralized and interlinked graphs. This thesis starts with a comprehensive experimental study of the state-of-the-art RDF systems. It identifies a set of research problems for improving the state-of-the-art, including: supporting the emerging RDF analytics required by many modern applications, querying linked data at scale, and enabling discovery on linked data. Addressing these problems is the focus of this thesis. First, we propose Spartex; a versatile framework for complex RDF analytics. Spartex extends SPARQL to seamlessly combine generic graph algorithms with SPARQL queries. Spartex implements a generic SPARQL operator as a vertex-centric program that interprets SPARQL queries and executes them efficiently using a built-in optimizer. We demonstrate that Spartex scales to datasets with billions of edges, and is at least as fast as the state-of-the-art specialized RDF engines. For analytical tasks, Spartex is an order of magnitude faster than existing alternatives. To address the scalability limitation of federated RDF engines, we propose Lusail; a scalable system for querying geo-distributed RDF graphs. Lusail follows a two-tier strategy: (i) locality-aware decomposition of the query into subqueries to maximize the computations at the endpoints and minimize intermediary results, and (ii) selectivity-aware execution to reduce network latency and increase parallelism. Our experiments on billions of triples show that Lusail outperforms existing systems by orders of magnitude in scalability and response time. Finally, enabling discovery on linked data is challenging due to the prior knowledge required to formulate SPARQL queries. To address these challenges; we develop novel techniques to (i) predict semantically equivalent SPARQL queries from a set of keywords by leveraging word embeddings, and (ii) generate fine-grained and non-blocking query plans to get fast and early results.
Date of AwardJul 2018
Original languageEnglish (US)
Awarding Institution
  • Computer, Electrical and Mathematical Science and Engineering
SupervisorPanos Kalnis (Supervisor)

Keywords

  • Graph Analytics
  • Linked Data
  • RDF
  • SPARQL
  • Data Discovery

Cite this

'