| |
Abstract
In the origin detection problem an algorithm is given a set S of documents, ordered by creation time, and a query document D. It needs to output for every consecutive sequence of k alphanumeric terms in D the earliest document in $S$ in which the sequence appeared (if such a document exists). Algorithms for the origin detection problem can, for example, be used to detect the "origin" of text segments in D and thus to detect novel content in D. They ...
|
| |
In The 36th International Conference on Very Large Data Bases, Vol. 3 (September 2010)
|
| |
Abstract
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with ...
|
| |
SIGOPS Oper. Syst. Rev. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, Vol. 41, No. 6. (2007), pp. 205-220, doi:10.1145/1294261.1294281
Abstract
Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way ...
|
| |
Abstract
An abstract is not available. ...
|
| |
Sci. Program., Vol. 13, No. 4. (October 2005), pp. 277-298
Abstract
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and ...
|
| |
SIGOPS Oper. Syst. Rev. In Proceedings of the nineteenth ACM symposium on Operating systems principles, Vol. 37, No. 5. (October 2003), pp. 29-43, doi:10.1145/945445.945450
Abstract
An abstract is not available. ...
|
| |
Abstract
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, ...
|
| |
In Proceedings of the 7th symposium on Operating systems design and implementation (2006), pp. 335-350
Abstract
We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. ...
|
| |
In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (2008), pp. 1099-1110, doi:10.1145/1376616.1376726
Abstract
There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of data collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its ...
|
| |
In Proceedings of the 6th conference on Symposium on Opearting Systems Design \& Implementation - Volume 6 (2004), pp. 10-10
Abstract
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. ...
|
| |
In NIPS (2006), pp. 281-288
|
| |
Proc. VLDB Endow., Vol. 2, No. 2. (August 2009), pp. 1626-1629
Abstract
The size of data sets being collected and analyzed in the industry for business intelligence is growing rapidly, making traditional warehousing solutions prohibitively expensive. Hadoop [3] is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware. However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse. ...
|