Cassandra with Java: A Minimal Guide

As briefly as possible, this page describes how to create and access a Cassandra database using Java. For anything else - Cassandra architecture, configuration, optimization, maintenance - you'll need to look elsewhere. The Cassandra Wiki is good place to start.

1. The Logical Organization of Data
2. Installing Cassandra
3. Getting a Java Driver
4. Launching Cassandra (and optionally cqlsh)
5. Connecting to Cassandra
6. Creating a keyspace
7. Creating a table
8. Adding data
9. Retrieving data
10. Everything else

1. The Logical Organization of Data

For a highly distributed system, Cassandra has a relatively simple architecture (see, for example, Node of the Rings). However, when it comes to explaining how data is arranged with respect to programmatic access, the descriptions we've seen are a bit opaque. The problem arises because the logical organization of data and it's physical storage are not independent of each other.

Logically, Cassandra stores data in rows of a table. Each table has a defined set of columns and each column has a name and a data type. Columns can be added to existing tables. Except for the ability to add columns, this is no different than a standard sql table, but retrieving rows is very different.

Each row is uniquely identified by a primary key, which is built from specified columns in the table. The primary key (and its components) is the main basis for retrieving rows. Retrieval is also possible based on the values of a secondary index and a secondary index may be created on any column. But most of the action is via the primary key.

The primary key consists of one or more columns. One of these columns is the partition key. Records with the same partition key value are stored in physical proximity, on the same node (we'll get to nodes shortly). The other columns in the primary key are clustering columns. Together, they define the order in which rows with the same partition key value are stored. These components of the partition key are used in specific ways for retrieving rows; their use is subject to some limitations.

The primary key uniquely identifies a row. Using the entire primary key (i.e. specifying values for all its components) is an efficient way to retrieve a single specific row.
Using the partition key is an efficient way to retrieve all the row with the same partition key value. However, the partition key can be used to select by equality only; inequality tests on the partition key are not allowed.
The clustering columns can be used to efficiently retrieve a range of rows with the same partition key. Equality and inequality test are supported, but there are limitations. When physically stored, rows are ordered by the combined clustering columns. I.e. the clustering column values are concatenated into a single sorting key. When retrieving rows using these keys, only tests that are consistent with this sorting key are allowed E.g. suppose the clustering columns were c0, c1, c2, c3, with c0 as the partition key. It would be legal to select

c0 = 20 and c1 = 21 and c2 = 22

or

c0 = 20 and c1 = 21 and c2 = 22 and c3 = 23

but

c0 = 20 and c1 = 21 and c3 = 23

would be illegal.

When retrieving rows using several of the primary key components, only the last cluster column used can be an inequality. I.e.

c0 = 20 and c1 > 10 // legal
c0 = 20 and c1 = 10 and c2 < 100 // legal
c0 = 20 and c1 < 10 and c2 < 100 // ILLEGAL

Using an indexed column is efficient if there are many rows with the same index column value. Limitation: in a select statement that involves an indexed column, there must be at least one equality test. Using inequality tests only is illegal. (I.e. indexes are accessed with something like a hashtable, not like a btree.)

Rows are stored in nodes. A node is a physical repository for rows. It is possible to have one node per physical machine, but it is also possible to have multiple nodes on a single machine. Rows are assigned to nodes based on their partition key value. Rows with the same partition key value go on to the same node. Note that a table is typically distributed across multiple nodes and that a node usually contains records from more than one table.

Nodes are also part of the replication scheme. Based on a replication factor, a row can be stored on more than one node. E.g. if the replication factor is set to 3, each row will be stored on three nodes (i.e. there will be three copies of the row).

Nodes exist in data centers, which may be specific physical systems ("virtual data centers" are also possible). Nodes are, ideally, assigned to data centers based on workload considerations.

A collection of data centers that contain related nodes constitute a cluster, which is the outmost container in a Cassandra system. Conceptually, a cluster is similar to a sql database - it is intended to contain a set of related tables. For each cluster, there is typically one keyspace, which is the name space. Replication is specified per keyspace. Note that while there is usually one keyspace per cluster, there can be multiple keyspaces in a cluster. (Some documentation describes the keyspace as the outmost container, rather than the cluster. However, since a cluster can contain multiple keyspaces, that characterization is suspect.)

The hierarchy of data structures is (top down):

cluster

keyspace

node

table (defined as a set of columns)

row

2. Installing Cassandra

Cassandra is written in Java, so you'll need to have a jvm installed. The current version of Cassandra, 3.8.*, requires Java 8. The Apache Cassandra Wiki recommends using the Oracle/Sun jvm. To check your version of Java (and to make sure it's accessible), run java -version in a shell.

Cassandra packages are available for many systems (including most linux distributions). However, I've found it simplest to use the Apache tar file. Download apache-cassandra-3.7-bin.tar.gz from cassandra apache and unpack it wherever you want it (with tar -xzvf cassandra.apache). The file unpacks into a directory namedapache-cassandra-3.5. Excecutables are in apache-cassandra-3.5/bin.When you create tables, they will be stored in apache-cassandra-3.5/data.

For a simple test system, no configuration is required. The default installation creates a single node system, which is sufficient for our needs here.

3. Getting a Java Driver

Although Cassandra is written in Java, it does not have a native Java api and Apache Cassandra does not include a jdbc or any other means to access Cassandra from a Java program. As of this writing, the primarily mechanism for accessing a Cassandra database is the query language CQL. (CQL closely resembles SQL - a rather odd choice, since Cassandra is definitely a nosql database.)

At present, DataStax's Java Driver 3.0 for Apache Cassandra appears to be the most commonly used means for accessing Cassandra from Java. It provides a mechanism for running CQL commands from a Java program (but it does not support the jdbc). There are a few Cassandra jdbc drivers available but it not clear if they are being actively supported or widely used. We'll stick to DataStax's Java Driver.

The Java Driver can be downloaded from http://downloads.datastax.com/java-driver/cassandra-java-driver-3.0.0.tar.gz

After unpacking, the jar files will be in cassandra-java-driver-3.0.0 and in cassandra-java-driver-3.0.0/lib. You'll need to put all of them in you classpath.

4. Launching Cassandra (and optionally cqlsh)

In a shell, cd to apache-cassandra-3.5/bin and run

./cassandra -f

The '-f' keeps the instance of Cassandra in the foreground, so it can be killed off with cntrl-c, which is handy for testing (but of course not the way you'd run a production system).

During testing, it may be useful to run cqlsh, which is a command line utility for interactively executing cql commands. In particular, cqlsh has a describe command that can list keyspaces, tables, table structures, etc. describe is not part of cql itself and is not available in the java driver (although the same information can be obtained with select commands).

cqlsh is included in the Apache tar file and can be found in apache-cassandra-3.5/bin.But note that it is written in python and required python 2 - it will not work with python 3. You can launch it with

./cqlsh

if you have python 2 as the default python. If not, you'll get a rather obscure error message. The simplest solution is to explicitly execute the python file using the correct version of python. E.g. I have both python 2 and 3 on my system, so I use

python2 cqlsh.py

5. Connecting to Cassandra

Operations are performed on a specific cluster. To connect to a cluster on our local address, we use the static method Cluster.builder() to get a Cluster object from a particular IP address. To access a cluster on the local system, we use the standard local address "127.0.0.1":

Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();

Commands are issued using a Session object that we get from the cluster:

Session session = cluster.connect();

To test the connection, let's retrieve a list of all the keyspaces in the cluster. A Cassandra cluster always has some system keyspaces used for management. The system keyspace system_schema has a table named keyspaces that lists all keyspaces in the cluster. To get a list of keyspaces:

ResultSet r = session.execute("select keyspace_name from system_schema.keyspaces");

The DataStax driver uses a ResultSet class that is similar to the result set class in the jdbc. To see the results

int i = 0; for ( Iteratorit = r.iterator(); it.hasNext(); ) { Row row = it.next(); String s = row.getString("keyspace_name"); System.out.println((i++) + " " + s); }

A new cluster will have five system tables.

6. Creating a keyspace

To create a new keyspace

String keySpaceName = "a_test_keyspace";session.execute("CREATE KEYSPACE IF NOT EXISTS " + keySpaceName + " WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }");

Note that we've set the replication factor to unity.

To avoid having to use fully qualified table name, we usually specify the keyspace we'll be using.

session.execute("USE " + keySpaceName);

7. Creating a table

String tableName = "table0";
session.execute("CREATE TABLE IF NOT EXISTS " +
                        tableName + " ( " +
                        " s0 text, " +
                        " i0 int, " +
                        " d0 double, " +
                        " b0 boolean, " +
                        " PRIMARY KEY ( s0, i0, d0 ) )" );

The first item column in the primary key list (s0) is the partition key. There are Cassandra data types for all Java primitives and some other objects including lists. See CQL data types to Java types for a full list of the mappings.

Note that serialized Java objects can be stored in tables as hex strings. After serializing an object to a byte[], you can use Bytes.toHexString to create a string that can be stored in a Cassandra text column.

8. Adding data

Rows are added to a table using either the INSERT or UPDATE command. Note that the both will overwrite an existing row if it has the specified primary key. The main difference is that with UPDATE you can specify a new column, while with INSERT you are limited to columns that already exist.

Like the jdbc, the Cassandra Java Driver supports prepared statements and typed 'set' methods. So while we can add data using session.execute("...."), we more often do something like this:

        PreparedStatement ps = session.prepare(           "INSERT INTO " + tableName +                  "( s0, i0, d0, b0 ) values ( ?, ?, ?, ? ) ");           for ( int i = 0; i < 3; i++ ) {
       for ( int j = 0; j < 5; j++ ) {
           BoundStatement bs = ps.bind();
           bs.setString("s0", "str_" + Integer.toString(i));
           bs.setInt("i0", j);
           bs.setDouble("d0", (double) (i+j));
           bs.setBool("b0", true);
           session.execute(bs);
       }
   }

9. Retrieving data

CQL has a SELECT command that also looks a lot like SQL, although selection criteria are constrained as described above. Rows can be retrieved using session.execute("SELECT ...") commands. However, in many situations, it's more convenient to use the Select and QueryBuilderclasses, e.g.

Select s = QueryBuilder.select().all().from(tableName);s.where(QueryBuilder.eq("s0","str_1"));s.where(QueryBuilder.eq("i0",0));s.where(QueryBuilder.lt("d0",4.0)); ResultSet rs = session.execute(s);

And to see the results

int i = 0;for ( Iterator<Row> it = rs.iterator(); it.hasNext(); ) { Row row = it.next(); System.out.println((i++) + ": s0=" + row.getString("s0") + " i0=" + row.getInt("i0") + " d0=" + row.getDouble("d0") + " b0=" + row.getBool("b0") ); }

10. Everything else

For the full set of possible operations, see Cassandra Query Language (CQL) v3.3.1 and of course the DataStax Java Driver API.