Fast crawling methods of exploring content distributed over large graphs

Pinghui Wang, Junzhou Zhao, John C. S. Lui, Don Towsley, Xiaohong Guan

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Despite recent effort to estimate topology characteristics of large graphs (e.g., online social networks and peer-to-peer networks), little attention has been given to develop a formal crawling methodology to characterize the vast amount of content distributed over these networks. Due to the large-scale nature of these networks and a limited query rate imposed by network service providers, exhaustively crawling and enumerating content maintained by each vertex is computationally prohibitive. In this paper, we show how one can obtain content properties by crawling only a small fraction of vertices and collecting their content. We first show that when sampling is naively applied, this can produce a huge bias in content statistics (i.e., average number of content replicas). To remove this bias, one may use maximum likelihood estimation to estimate content characteristics. However, our experimental results show that this straightforward method requires to sample most vertices to obtain accurate estimates. To address this challenge, we propose two efficient estimators: special copy estimator (SCE) and weighted copy estimator (WCE) to estimate content characteristics using available information in sampled content. SCE uses the special content copy indicator to compute the estimate, while WCE derives the estimate based on meta-information in sampled vertices. We conduct experiments on a variety of real-word and synthetic datasets, and the results show that WCE and SCE are cost effective and also “asymptotically unbiased”. Our methodology provides a new tool for researchers to efficiently query content distributed in large-scale networks.
Original languageEnglish (US)
Pages (from-to)67-92
Number of pages26
JournalKnowledge and Information Systems
Volume59
Issue number1
DOIs
StatePublished - Mar 15 2018

Fingerprint Dive into the research topics of 'Fast crawling methods of exploring content distributed over large graphs'. Together they form a unique fingerprint.

Cite this