Skip to main content

Posts

Showing posts from May, 2009

Why you won't be building your killer app on a distributed hash table

I ran across A case study in building layered DHT applications while doing some research on implementing load-balancing in Cassandra . The question addressed is, "Are DHTs a general-purpose tool that you can build more sophisticated services on?" Short version: no. A few specialized applications can and have been built on a plain DHT, but most applications built on DHTs have ended up having to customize the DHT's internals to achieve their functional or performance goals. This paper describes the results of attempting to build a relatively complex datastructure (prefix hash trees, for range queries) on top of OpenDHT. The result was mostly failure: A simple put-get interface was not quite enough. In particular, OpenDHT relies on timeouts to invalidate entries and has no support for atomicity primitives... In return for ease of implementation and deployment, we sacrificed performance. With the OpenDHT implementation, a PHT query operation took a median of 2–4 seconds

Cassandra 0.3 release candidate and progress

We have a release candidate out for Cassandra 0.3. Grab the download and check out how to get started . The facebook presentation from almost a year ago now is also still a good intro to some of the features and data model. Cassandra in a nutshell : Scales writes very, very well: just add more nodes! Has a much richer data model than vanilla key/value stores -- closer to what you'd be used to in a relational db. Is pretty bleeding edge -- to my knowledge, Facebook is the only group running Cassandra in production. (Their largest cluster is 120 machines and 40TB of data .) At Rackspace we are working on a Cassandra-based app now that 0.3 has the extra features we need. Moved to the Apache Incubator about 40 days ago, at which point development greatly accelerated. Changes in 0.3 include Range queries on keys, including user-defined key collation. Remove support, which is nontrivial in an eventually consistent world. Workaround for a weird bug in JDK select/register that see

A better analysis of Cassandra than most

Vladimir Sedach wrote a three-part dive into Cassandra . (Almost two months ago now. Guess I need to set up Google Alerts. Trouble is there's a surprising amount of noise around the word `cassandra.`) Part 0 Part 1 Part 2 A few notes: We now have an order-preserving partitioner as well as the hash-based one Yes, if you tell Cassandra to wait for all replicas to be ack'd before calling a write a success, then you would have traditional consistency (as opposed to "eventual") but you'd also have no tolerance for hardware failures which is a main point of this kind of system. Zookeeper is not currently used by Cassandra, although we have plans to use it in the future. Load balancing is not implemented yet. The move to Apache is finished and development is active there now.

Consistent hashing vs order-preserving partitioning in distributed databases

The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner. The reason that almost all similar systems use consistent hashing (the dynamo paper has the best description; see sections 4.1-4.3) is that it provides a kind of brain-dead load balancing from the hash algorithm spreading keys across the ring. But the dynamo authors go into some detail about how this by itself doesn't actually give good results in practice; their solution was to assign multiple tokens to each node in the cluster and they describe several approaches to that. But Cassandra's original designer considers this a hack and prefers real load balancing . An order-preserving partitioner, where keys are distributed to nodes in their natural order, has huge advantages over consistent hashing, particularly the ability to do range queries across the keys in the system (which has also been committed to Cassandra now