From charlesreid1

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba, of Illiad fame. Apollo granted her the power of prophecy in order to seduce her, and cursed her to never be believed when she rejected him.

It's also software by Apache designed to handle large amounts of data in a scalable way. Intended for cloud and cluster solutions/applications. Used by companies like Netflix.

About Cassandra

Cassandra is a scalable database tool. It is designed to achieve linear scaling for extremely large data sets.

NoSQL

Cassandra is a NoSQL database technology. It was originally intended to implement the Bigtable concept (published by Google in 2006). The Bigtable concept goes beyond relational databases, which are difficult to scale to extremely large data sets, and instead is built on the idea of a "sparse, distributed, persistent multidimensional map".

The fact that Cassandra is NoSQL means the database cannot be modified using SQL. However, Cassandra implements a "language" for querying the database called CQL; it is intended to lower the barrier for SQL developers to understand and interact with the Cassandra database. CQL can access/read the database, but cannot change or manipulate its contents.

CQL

The Cassandra Query Language (CQL) is a query language specific to Cassandra that looks a lot like SQL.

Link to CQL documentation: https://cassandra.apache.org/doc/latest/cql/index.html#cql

Difference between CQL and SQL

In spirit, CQL and SQL are designed to be similar. They both use similar syntax - "select things where condition". The entire point of adding CQL to Cassandra was to make it easier for developers familiar with SQL to be able to pick up CQL and deal with Cassandra databases more easily.

However, they are fundamentally different because CQL sits on top of Cassandra, a NoSQL "database" that is actually designed as a big key-value map, and SQL sits on top of a traditional relational database.

They are also different in practice - there are some operations that CQL does not support. CQL does not support the following:

  • No joins
  • No subqueries
  • No transactions
  • WHERE conditions can only be applied to columns where an index has been created on that column (unless the column is a primary key)

The following operators:

  • No OR operator
  • No NOT operators
  • The AND operator is the only operator available

CQL Examples

From the CQL example below, you can see that CQL and SQL are quite similar:

USE mydb;

/* Create Tables */
CREATE TABLE IF NOT EXISTS someTable (id INT PRIMARY KEY);

/* Alter Tables /*
ALTER TABLE someTable ADD someField INT;

/* Create Indexes */
CREATE INDEX someIndex ON someTable (myField);

/* Insert Data */
INSERT INTO someTable (id, someField) VALUES (1, 100);

/* Select Data */
SELECT * FROM someTable WHERE someField = 100;

/* Count Data */
SELECT COUNT(*) FROM someTable;

/* Delete Data */
DELETE FROM someTable WHERE someField = 100;

However, CQL uses Cassandra's concept of a keyspace - analogous to an SQL database - and Cassandra's concept of a column family - analogous to an SQL table.

Here's how we create a keyspace in Cassandra using CQL:

/* Cassandra - Create new keyspace with simple strategy, not replicating any data */
CREATE KEYSPACE mydb WITH replication = 
   {'class': 'SimpleStrategy', 'replication_factor': 1};

/* Cassandra - Create new keyspace with 3x replication in each data center */
CREATE KEYSPACE mydb WITH replication = 
   {'class': 'NetworkTopologyStrategy', 'data_center_1' : 3, 'data_center_2' : 3};

Note that the names of the data centers are configured when you create the snitch. To print out the data center names and rack numbers of each node, use the command:

nodetool status

(Link to details of keyspace creation: [1])

Compare this to the way we create a new database in SQL, which is simpler:

/* SQL - Create new database */
CREATE DATABASE mydb;

Scaling

Cassandra uses a similar architecture to Hadoop, in that it distributes data over multiple nodes, and uses replication to safeguard against data loss due to the failure of nodes in the cluster. A single row that is written to the database is the responsibility of a single node (row-to-node mapping is determined by your partitioning scheme, see info about "snitch" above). However, that piece of data is mirrored to replica nodes in the cluster. The user can configure the replication factor, but by default it is 3. If/when the primary node fails, the data that was stored on it can still be accessed via a replica node.

Snitches

Cassandra configures data center, rack, and node layouts using "snitches".

Link: http://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architectureSnitchesAbout_c.html

Basic Usage

Installing

Installation works like it does for most software: extract and build from source, or use a pre-built package from an aptitude repository.

Cassandra requires Java and Python.

Cassandra works a bit like Hadoop - you set a configuration file that sets options (such as, how much replication of data should Cassandra use, and where are the various data nodes going to live, etc.).

Once you start Cassandra, it uses something called the "snitch" to figure out how to efficiently route requests over all of the nodes that comprise the Cassandra "data universe". Like Hadoop, Cassandra divides nodes by data centers and racks.

Links:

Command Line Client

Cassandra can be used from the command line via the cqlsh tool, which provides a Cassandra shell from which CQL commands can be executed.

Like MySQL, you start up the command line program, which starts a client that tries to connect to a server instance. To connect to a local instance, specify localhost as the host:

$ bin/cqlsh localhost
Connected to Test Cluster at localhost:9042.
[cqlsh 5.0.1 | Cassandra 3.8 | CQL spec 3.4.2 | Native protocol v4]
Use HELP for help.
cqlsh>

You can specify various command line options (e.g., username/password, script file, startup file, etc.)

A simple hello world query in CQL:

cqlsh> SELECT cluster_name, listen_address FROM system.local;

 cluster_name | listen_address
--------------+----------------
 Test Cluster |      127.0.0.1

(1 rows)

Full cqlsh documentation: https://cassandra.apache.org/doc/latest/tools/cqlsh.html#cqlsh

Using Other Languages

You can interface with Cassandra from a number of different languages (Java, Python, Node, Scala, Go, Rust, etc.). Full list of client drivers: https://cassandra.apache.org/doc/latest/getting_started/drivers.html

DataStax (company that contributes to Cassandra development) provides several good drivers in various languages:

Using on Google Cloud

Google Cloud offers several pre-configured virtual machine instances that run Cassandra, including one created by Google, and one created by Bitnami. See Google Cloud/Cassandra

Resources

DataStax provides managed cloud solutions involving Cassandra (also involved in developing/contributing to Cassandra): https://www.datastax.com/

DataStax training on Cassandra: https://academy.datastax.com/

Nice blog writeup on results of Cassandra benchmarking: http://blog.markedup.com/2013/02/cassandra-hive-and-hadoop-how-we-picked-our-analytics-stack/

Medium essay on CQL: https://medium.com/@alexbmeng/cassandra-query-language-cql-vs-sql-7f6ed7706b4c