![]() |
CiteULike | ![]() |
germoglio's CiteULike | ![]() |
![]() |
|
![]() |
Register | ![]() |
Log in | ![]() |
A Comparison of Approaches to Large-Scale Data Analysis |
Reviews
[Write a review of this article]
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
Posting History
AbstractThere is currently considerable enthusiasm around the MapReduce(MR) paradigm for large-scale data analysis [17]. Although thebasic control flow of this framework has existed in parallel SQLdatabase management systems (DBMS) for over 20 years, somehave called MR a dramatically new computing model [8, 17]. Inthis paper, we describe and compare both paradigms. Furthermore,we evaluate both kinds of systems in terms of performance and de-velopment complexity. To this end, we define a benchmark con-sisting of a collection of tasks that we have run on an open sourceversion of MR as well as on two parallel DBMSs. For each task,we measure each system’s performance for various degrees of par-allelism on a cluster of 100 nodes. Our results reveal some inter-esting trade-offs. Although the process to load data into and tunethe execution of parallel DBMSs took much longer than the MRsystem, the observed performance of these DBMSs was strikinglybetter. We speculate about the causes of the dramatic performancedifference and consider implementation concepts that future sys-tems should take from both kinds of architectures.
BibTeX record
RIS record