No free lunch

Ferhan Ture, Tamer Elsayed, Jimmy Lin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

30 Scopus citations

Abstract

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.
Original languageEnglish (US)
Title of host publicationProceedings of the 34th international ACM SIGIR conference on Research and development in Information - SIGIR '11
PublisherAssociation for Computing Machinery (ACM)
Pages943-952
Number of pages10
ISBN (Print)9781450309349
DOIs
StatePublished - 2011

Fingerprint

Dive into the research topics of 'No free lunch'. Together they form a unique fingerprint.

Cite this