Big Data Architect’s Handbook
上QQ阅读APP看书,第一时间看更新

Apache Cassandra

Cassandra is an open source Apache Project designed for NoSQL database management. It was initially designed by the Facebook team for their internal use, specifically to enhance their search engine mechanism for unstructured data. It was later released by Facebook as an Apache Project.

Cassandra runs on a distributed environment. Its architecture was designed with the idea that software and hardware fails in mind. So, if anything fails in a cluster, it doesn't mean that the whole operation also has to fail.

Some of the key features of Apache Cassandra are as follows:

  • All nodes in a Cassandra cluster have the same role; in other words, there is no master node or leader node. Every node can service a client request. This makes it a system with no single point of failure and high availability.
  • Cassandra runs across multiple nodes across multiple data centers. It replicates data across these data centers to avoid failure or downtime. This makes it a highly fault-tolerant system.
  • Cassandra supports integration with Apache Hadoop. It also supports MapReduce. 

Cassandra has a component called Gossip Protocol. The primary purpose of Gossip Protocol is that the nodes in a Cassandra cluster can communicate with each other internally as part of their core. By using such a protocol, nodes can share their current state, location, heartbeats, and much more. Information collected using Gossip is persisted. In case of any temporary failure or node restart, it will be immediately available for the node to pick up from where it left off.

The main brain behind replication in Cassandra is called Partitioner. Partitioner decides which node will get the first copy of data and how other nodes will receives remaining copies. It uses a primary key, which is also known as a partition key, to identify a unique row. An interesting point to note is that a partition key is not just used to classify the uniqueness of data but it also contains other related information. The most important information it contains is about data locality, meaning it has the address of the nodes where the data resides.

Cassandra uses it own programming language to access data across its nodes. It is called Cassandra Query Language or CQL. It is similar to SQL, which is mainly used by Relational Databases. CQL can be used by running its own application called cqlsh. Cassandra also provides many integration interfaces for multiple programming languages to build an application using Cassandra. Its integration API supports Java, C++, Python, and others.

Apache Cassandra is a very popular, free, and open-source tool for unstructured data management with no single point of failure. We will discuss Apache Cassandra in detail in our next chapter, which is all about NoSQL, its related technology, and frameworks.