Recently, I did a webinar on the 15 Commandments of Cassandra Admin. I can’t claim responsibility for all of these. I worked with Rachel Padreschi (@RachelPadreschi) on this presentation. Unfortunately, she wasn’t able to make it to the live broadcast, but I want to make sure she gets all the credit she deserves. We have seen quite a few installations in the wild and, like anything you do over and over, you notice you are repeating yourself. It began with 10 commandments, but soon grew to 15. Luckily, we stopped ourselves there. Maybe I’ll revise these someday but, as of February 2014, these look pretty good.
So what about these? If you are using or running Cassandra, what we are saying is pay attention to this list. If I saw you hitting yourself in the head with a hammer, I would really want to take it away. You then might say “Wow dude. Thanks for taking away that hammer. I feel so much better” That makes me happy. I know not everyone is straight admin, but trust me, they are related to administering a running cluster.
My plan in this blog is to introduce the commandments and then follow up with an in-depth article on each. That’s going to be a lot of blog posts I just created for myself. What am I thinking? Oh I know! I’m thinking you might be able to use these. Here is the list, and don’t hesitate to ask questions. Feedback is always appreciated.
Commandment 1 – Great data models start with the queries
Cassandra is not a relational system and, because of that, doesn’t do joins. Ingesting data into a data model is really with a pre-conception of just what will be asked of that data. Unlike relational modeling which is Data->Model->Queries, we reverse things and go Queries->Model->Data
Commandment 2 – It’s ok to duplicate data
If you have many questions of the same data, you might need to build different views. This will result in duplicate data, but no worries! Speed isn’t going to be a problem. Volume isn’t going to be a problem. Go ahead and duplicate away.
Commandment 3 – Disk IO is your first problem
Almost every major problem I have had to diagnose on a Cassandra cluster has come down to storage. Misconfigured, under powered or just plain wrong. Cassandra disk IO patterns are very different than other databases. Good thing to understand.
Commandment 4 – Secondary Indexes are for convenience not speed
In relational databases we add indexes (mostly) for speed. Secondary indexes in Cassandra are used to find data stored in columns on various rows. This results in a distributed query that, in some cases, can be much slower than creating our own index tables. There are good uses for them but just understand the tradeoffs and alternatives. (See Commandment 5 for an alternative)
Commandment 5 – Embrace large partitions and de-normalization
The storage model of Cassandra lends itself to really fast access of large partitions. Understanding how and why those work can lead to some very fast and server efficient data models. If you are looking for speed and found yourself at Commandment 4, this is the way to do it right.
Commandment 6 – Don’t be afraid to add nodes
Too many times I see people trying to vertically scale Cassandra. There is some call for that when trying to achieve data density, but the ultimate scaling story is in the horizontal. More nodes! This really makes capacity planning much easier and when the time comes to add capacity, don’t be afraid!
Commandment 7 – Mind your compactions
Compaction is the background process of combining SSTables in Cassandra. Completely normal operation. It serves to merge/sort row data in multiple files, but also clean out old data, It is the most impactful IO event Cassandra can create and not managed or considered, old files can build up. To keep things smooth and efficient, mind those compactions!
Commandment 8 – Never use shared storage
Yeah I said that. Just don’t. There isn’t yet a shared storage system that can bring the latency, and throughput, that local disk can deliver. Not to mention, if you are using a distributed system, why create a single point of failure in a single shared storage system? Spread the risk!
Commandment 9 – Understand cache. Disk, Key and Row
Cassandra cache is often misunderstood, To better understand how it all relates, you really need to know some internals. Each cache layer is separately important and there can even be a downside if improperly used.
Commandment 10 – Always use JNA
JNA is the system call library external to the running JVM. We use it extensively with Cassandra to enable things like off-heap storage. This opens up a lot of efficient memory access patterns that you are really going to miss if not present. Miss, as in you’ll be crying like a baby, and wish you had them. Just do it.
Commandment 11 – Learn how to load data. Bulk Load. Insert. Copy
Chances are you will need to load a lot of data into a running Cassandra cluster at some point. There are different methods based on the volume and how much control you need. Copy is good for fixed columns up to around a million rows. Insert is good for controlling any conversions of source data and can scale to any size. Bulk Load is for when you need to get a million plus rows in your cluster fast.
Commandment 12 – Repair isn’t just for broken data
Way back in the original Dynamo paper, it was known that bad things happen to good data. Consistency can vary over large clusters just through entropy. You can check and correct that consistency with the repair command in Cassandra. Also, known as anti-entropy repair, its a good thing to run all the time.
Commandment 13 – Know the relationship between Consistency Level and Replication Factor
Two things that go hand in hand, skipping down the trail of your data model. Replication Factor is how many copies of your data exist per keyspace and data center. Consistency Level is set by the client per read and write, and specifies how many replicas acknowledge or respond to the request. Knowing how these work together is critical for good performance and uptime.
Commandment 14 – More than 8G heap doesn’t mean better performance
Cassandra is written in Java so, as a result, knowing a bit about the JVM when administering isn’t a bad thing. There is a dark art to working with the JVM heap and the various settings. One common mistake is to just give the JVM more memory, in hopes it will increase the performance by giving it some sort of headroom. Unfortunately, this can backfire and result in painfully long garbage collection pauses. Keep it at or below 8G and be happy.
Commandment 15 – Get involved in the community
Ok, so not a real rule but something that will definitely improve the administration of your cluster. How? There are so many good resources for Cassandra out there and one of the best is the other people doing it with you. Be part of the community and learn from each other. Want to know the best way to model certain kinds of data? Ask. Found some great tuning parameters for a particular data access pattern? Tell the world in a tweet, blog or even better, community webinar! It’s how I got started and hopefully why you stay.