2010/Cassandra: Strategies for Distributed Data Storage
Cassandra is an open source, highly scalable distributed database that brings together Dynamo’s fully distributed design and Bigtable’s ColumnFamily-based data model. In this talk we’ll discuss the strategies Cassandra employs to provide an eventually consistent data model.
Speaker: Kelvin Kakugawa
Return to this session's details
Contributed notesHorizontal scaling with SQL means sharding. That pushes data logic into the application layer.
Availability versus Consistency - the CAP Therom, pick two out of consistency, availability, and partition tolerance. Availability means that CRUD access is available even when some nodes are down.
SQL replication without sharding means a write-master and read-only slaves. Not good for availability.
Hinted Handoff - how to write to an unavailable node? Write a hinted write to an available node, which will deliver the message to the dead node when it comes back.
How does a node know when another node is available? Pinging every node does not scale. The answer is Gossip.
With Cassandra a "coordinator" node proxies requests among a replicated cluster.
Anti-Entropy Gossip Protocol - distinct from Rumor Mongering. Communicates state to random selection of other nodes. State information propagates in logarithmic time. Nodes broadcast state continuously.
Phi-Accrual Failure Detector - as received heartbeat intervals increase "suspicion" level that the sender might be dead increases.
Read-Repair - when a write has not propagated Fully, repair outdated nodes on read. Quorum read means writing to at least 3 nodes at once. Check digests from the first two writes and on mismatches do a full read on all nodes and repair as necessary. Use timestamps on data to determine which data is most current. Depends on clock synchronization; vector clocks may help.
Anti-Entropy Service - detect inconsistency via hash trees and repair as necessary. Nodes hash the hashes of their data blocks. Exchange hash trees between replicas of a data set. Each replica walks received trees for mismatched hashes and sends updates for data blocks that are out of date. Again, use timestamps to determine which data is good.
Consistency levels include zero, write but don't wait for success response; one, read to or write from one node, data may not be fully up-to-date; quorum, write to or read from two (or more?) nodes at a time; all, ensures that reads and writes are fully consistent across every node.