Patrick McFadin

  • Home
  • About Me

Category Archives: Uncategorized

15 Commandments of Cassandra Admin

24th February, 2014 · Patrick McFadin

Recently, I did a webinar on the 15 Commandments of Cassandra Admin. I can’t claim responsibility for all of these. I worked with Rachel Padreschi (@RachelPadreschi) on this presentation. Unfortunately, she wasn’t able to make it to the live broadcast, but I want to make sure she gets all the credit she deserves. We have seen quite a few installations in the wild and, like anything you do over and over, you notice you are repeating yourself. It began with 10 commandments, but soon grew to 15. Luckily, we stopped ourselves there. Maybe I’ll revise these someday but, as of February 2014, these look pretty good.

So what about these? If you are using or running Cassandra, what we are saying is pay attention to this list. If I saw you hitting yourself in the head with a hammer, I would really want to take it away. You then might say “Wow dude. Thanks for taking away that hammer. I feel so much better” That makes me happy. I know not everyone is straight admin, but trust me, they are related to administering a running cluster.

My plan in this blog is to introduce the commandments and then follow up with an in-depth article on each. That’s going to be a lot of blog posts I just created for myself. What am I thinking? Oh I know! I’m thinking you might be able to use these. Here is the list, and don’t hesitate to ask questions. Feedback is always appreciated.

Commandment 1 – Great data models start with the queries

Cassandra is not a relational system and, because of that, doesn’t do joins. Ingesting data into a data model is really with a pre-conception of just what will be asked of that data. Unlike relational modeling which is Data->Model->Queries, we reverse things and go Queries->Model->Data

Commandment 2 – It’s ok to duplicate data

If you have many questions of the same data, you might need to build different views. This will result in duplicate data, but no worries! Speed isn’t going to be a problem. Volume isn’t going to be a problem. Go ahead and duplicate away.

Commandment 3 – Disk IO is your first problem

Almost every major problem I have had to diagnose on a Cassandra cluster has come down to storage. Misconfigured, under powered or just plain wrong. Cassandra disk IO patterns are very different than other databases. Good thing to understand.

Commandment 4 – Secondary Indexes are for convenience not speed

In relational databases we add indexes (mostly) for speed. Secondary indexes in Cassandra are used to find data stored in columns on various rows. This results in a distributed query that, in some cases, can be much slower than creating our own index tables. There are good uses for them but just understand the tradeoffs and alternatives. (See Commandment 5 for an alternative)

Commandment 5 – Embrace large partitions and de-normalization

The storage model of Cassandra lends itself to really fast access of large partitions. Understanding how and why those work can lead to some very fast and server efficient data models. If you are looking for speed and found yourself at Commandment 4, this is the way to do it right.

Commandment 6 – Don’t be afraid to add nodes

Too many times I see people trying to vertically scale Cassandra. There is some call for that when trying to achieve data density, but the ultimate scaling story is in the horizontal. More nodes! This really makes capacity planning much easier and when the time comes to add capacity, don’t be afraid!

Commandment 7 – Mind your compactions

Compaction is the background process of combining SSTables in Cassandra. Completely normal operation. It serves to merge/sort row data in multiple files, but also clean out old data, It is the most impactful IO event Cassandra can create and not managed or considered, old files can build up. To keep things smooth and efficient, mind those compactions!

Commandment 8 – Never use shared storage

Yeah I said that. Just don’t. There isn’t yet a shared storage system that can bring the latency, and throughput, that local disk can deliver. Not to mention, if you are using a distributed system, why create a single point of failure in a single shared storage system? Spread the risk!

Commandment 9 – Understand cache. Disk, Key and Row

Cassandra cache is often misunderstood, To better understand how it all relates, you really need to know some internals. Each cache layer is separately important and there can even be a downside if improperly used.

Commandment 10 – Always use JNA

JNA is the system call library external to the running JVM. We use it extensively with Cassandra to enable things like off-heap storage. This opens up a lot of efficient memory access patterns that you are really going to miss if not present. Miss, as in you’ll be crying like a baby, and wish you had them. Just do it.

Commandment 11 – Learn how to load data. Bulk Load. Insert. Copy

Chances are you will need to load a lot of data into a running Cassandra cluster at some point. There are different methods based on the volume and how much control you need. Copy is good for fixed columns up to around a million rows. Insert is good for controlling any conversions of source data and can scale to any size. Bulk Load is for when you need to get a million plus rows in your cluster fast.

Commandment 12 – Repair isn’t just for broken data

Way back in the original Dynamo paper, it was known that bad things happen to good data. Consistency can vary over large clusters just through entropy. You can check and correct that consistency with the repair command in Cassandra. Also, known as anti-entropy repair, its a good thing to run all the time.

Commandment 13 – Know the relationship between Consistency Level and Replication Factor

Two things that go hand in hand, skipping down the trail of your data model. Replication Factor is how many copies of your data exist per keyspace and data center. Consistency Level is set by the client per read and write, and specifies how many replicas acknowledge or respond to the request. Knowing how these work together is critical for good performance and uptime.

Commandment 14 – More than 8G heap doesn’t mean better performance

Cassandra is written in Java so, as a result, knowing a bit about the JVM when administering isn’t a bad thing. There is a dark art to working with the JVM heap and the various settings. One common mistake is to just give the JVM more memory, in hopes it will increase the performance by giving it some sort of headroom. Unfortunately, this can backfire and result in painfully long garbage collection pauses. Keep it at or below 8G and be happy.

Commandment 15 – Get involved in the community

Ok, so not a real rule but something that will definitely improve the administration of your cluster. How? There are so many good resources for Cassandra out there and one of the best is the other people doing it with you. Be part of the community and learn from each other. Want to know the best way to model certain kinds of data? Ask. Found some great tuning parameters for a particular data access pattern? Tell the world in a tweet, blog or even better, community webinar! It’s how I got started and hopefully why you stay.

Posted in Planet Cassandra, Uncategorized | Tags: admin, cassandra |

Strata – Santa Clara 2014 Wrap Up

17th February, 2014 · Patrick McFadin

Last week was the annual pilgrimage of data nerds everywhere to Strata – Santa Clara. There are now more than one of these events, but this is the original and still the go to for many. The Santa Clara Convention Center is home to so many conferences, they tend to blur a bit for me. Am I at Velocity or Strata. No wait. Cloud Connect? What I find that differentiates it quickly is the sheer volume of friends and colleagues I meet every year.

DataStax Strata Booth

We brought couches. Come by and let’s talk!

DataStax sponsored a booth and it turned out to be a great gathering place for conversations and catching up. I mean look. We brought couches!

Raspberry Pi Cassandra Cluster

Raspberry Pi Cassandra Cluster

Not only did we bring couches, we brought a cluster of Cassandra! Ok, it’s all running on Raspberry Pi’s all stacked up, but it’s still pretty cool. With three nodes running we can show off the failure modes of Cassandra but easily removing one… or two.  Professor Andy Cobely from University of Dundee did a great talk at our summit last year on using the Raspberry Pi with Cassandra. You aren’t going to break any speed records but it’s a really cheap way of creating cluster of real machines.

One thing that I can say about most conferences, the value isn’t always in the great presentations lined up. I get so much out of the conversations in the hallways. Seeing where other open source projects are. Catching up with friends on their latest gigs. All good stuff. I ran into the guys from the Apache Mesos project. They are doing a lot of great work automating the deployment of Cassandra. As you can read in this tutorial, getting it done isn’t as hard as you would think. More of the great ecosystem forming around Apache Cassandra.

 

My talk this year was on Time Series with Apache Cassandra. Every year I hear sub-themes around Strata and I would say this year that was the Internet of Things or IoT. Let’s face it. Everything is getting a internet connection today. CES in January showed off new appliances and gadgets for the home. Almost all of the had an internet connection. That’s a lot of devices out there all talking and dumping data into the interwebs. Where is all that going to get stored? More importantly, how do you keep up when your refrigerator is just going on and on about how it’s exactly the same temperature. Always. So my talk was an answer to those questions.

Time Series with Apache Cassandra

The ability for Cassandra to manage large volumes of data at ridiculous speeds is pretty well proven. What you may not know is why. There is a bit of a strategy for this type of data storage and with Cassandra it’s all in the way the storage engine works. As I outlined I my talk, it’s all about ingesting data into memtables first, merge sorting partitions and then flushing a single sequential write to the file system. This write pattern serves two things. First it is the most efficient way of getting data put to disk. Second is how it sets up that data to be read in a single seek sequential access fashion. Check out my slides for some pretty picture on the topic and further explanation.

My next Strata will be New York in October. This time I’ll be doing a 3 hour tutorial on how to do what I say you should do. (Not like the Bob’s from Office Space) I hope to see you there. And if you see me in the halls, stop me and let’s talk about what you are doing. I would love to hear what you are up to and how I can help.

 

Storm Trooper at Strata

These are not the data nerds you are looking for

 

Posted in Uncategorized | Tags: cassandra, planet cassandra |

Follow me on Twitter

My Tweets
© My Website
  • About Me