Cassandra with Java: A Minimal Guide
As briefly as possible, this page describes how to create and access a
Cassandra database using Java. For anything else - Cassandra
architecture, configuration, optimization, maintenance - you'll need to
look elsewhere. The Cassandra
Wiki is good place to start.
2. Installing Cassandra
3. Getting a Java Driver
4. Launching Cassandra (and optionally cqlsh)
5. Connecting to Cassandra
6. Creating a keyspace
7. Creating a table
8. Adding data
9. Retrieving data
10. Everything else
1. The Logical Organization of Data
For a highly distributed system, Cassandra has a relatively simple
architecture (see, for example, Node
of the Rings). However, when it comes to explaining how data
is arranged with respect to programmatic access, the descriptions we've
seen are a bit opaque. The problem arises because the logical
organization of data and it's physical storage are not
independent of each other.
Logically, Cassandra stores data in rows of a table.
Each table has a defined set of columns and each column has a name and a
data type. Columns can be added to existing tables.
Except for the ability to add columns, this is no different than a
standard sql table, but retrieving rows is very different.
Each row is uniquely identified by a primary key, which
is built from specified columns in the table. The primary key (and
its components) is the main basis for retrieving rows. Retrieval
is also possible based on the values of a secondary index and
a secondary index may be created on any column. But most of the
action is via the primary key.
The primary key consists of one or more columns. One of these
columns is the partition key. Records
with the same partition key value are stored in physical proximity, on
the same node (we'll get to nodes shortly). The
other columns in the primary key are clustering columns.
Together, they define the order in which rows with the same
partition key value are stored. These components of the partition
key are used in specific ways for retrieving rows; their use is subject
to some limitations.
- The primary key uniquely identifies a row. Using the entire primary key (i.e. specifying values for all its components) is an efficient way to retrieve a single specific row.
- Using the partition key is an efficient way to retrieve all the row with the same partition key value. However, the partition key can be used to select by equality only; inequality tests on the partition key are not allowed.
- The clustering columns can be used to efficiently retrieve a range of rows with the same partition key. Equality and inequality test are supported, but there are limitations. When physically stored, rows are ordered by the combined clustering columns. I.e. the clustering column values are concatenated into a single sorting key. When retrieving rows using these keys, only tests that are consistent with this sorting key are allowed E.g. suppose the clustering columns were c0, c1, c2, c3, with c0 as the partition key. It would be legal to select
c0 = 20 and c1 = 21 and c2 = 22or
c0 = 20 and c1 = 21 and c2 = 22 and c3 = 23but
c0 = 20 and c1 = 21 and c3 = 23would be illegal.
When retrieving rows using several of the primary key components, only the last cluster column used can be an inequality. I.e.
c0 = 20 and c1 > 10 // legal
c0 = 20 and c1 = 10 and c2 < 100 // legal
c0 = 20 and c1 < 10 and c2 < 100 // ILLEGAL
- Using an indexed column is efficient if there are many rows with the same index column value. Limitation: in a select statement that involves an indexed column, there must be at least one equality test. Using inequality tests only is illegal. (I.e. indexes are accessed with something like a hashtable, not like a btree.)
Rows are stored in nodes. A node is a physical repository for rows. It is possible to have one node per physical machine, but it is also possible to have multiple nodes on a single machine. Rows are assigned to nodes based on their partition key value. Rows with the same partition key value go on to the same node. Note that a table is typically distributed across multiple nodes and that a node usually contains records from more than one table.
Nodes are also part of the replication scheme. Based on a replication factor, a row can be stored on more than one node. E.g. if the replication factor is set to 3, each row will be stored on three nodes (i.e. there will be three copies of the row).
Nodes exist in data centers, which may be specific physical systems ("virtual data centers" are also possible). Nodes are, ideally, assigned to data centers based on workload considerations.
A collection of data centers that contain related nodes constitute a cluster, which is the outmost container in a Cassandra system. Conceptually, a cluster is similar to a sql database - it is intended to contain a set of related tables. For each cluster, there is typically one keyspace, which is the name space. Replication is specified per keyspace. Note that while there is usually one keyspace per cluster, there can be multiple keyspaces in a cluster. (Some documentation describes the keyspace as the outmost container, rather than the cluster. However, since a cluster can contain multiple keyspaces, that characterization is suspect.)
The hierarchy of data structures is (top down):
- cluster
- keyspace
- node
- table (defined as a set of columns)
- row
2. Installing Cassandra
Cassandra is written in Java, so you'll need to have a jvm
installed. The current version of Cassandra, 3.8.*, requires
Java 8. The Apache Cassandra
Wiki recommends using the Oracle/Sun jvm. To check your
version of Java (and to make sure it's accessible), run java
-version in a shell.
Cassandra packages are available for many systems (including most linux
distributions). However, I've found it simplest to use the Apache
tar file. Download apache-cassandra-3.7-bin.tar.gz from cassandra
apache and unpack it wherever you want it (with tar -xzvf
cassandra.apache). The file unpacks into a directory named
apache-cassandra-3.5. Excecutables are in apache-cassandra-3.5/bin.
When you create tables, they will be stored in apache-cassandra-3.5/data.
For a simple test system, no configuration is required. The
default installation creates a single node system, which is sufficient
for our needs here.
3. Getting a Java Driver
Although Cassandra is written in Java, it does not have a native Java api and Apache Cassandra does not include a jdbc or any other means to access Cassandra from a Java program. As of this writing, the primarily mechanism for accessing a Cassandra database is the query language CQL. (CQL closely resembles SQL - a rather odd choice, since Cassandra is definitely a nosql database.)
At present, DataStax's Java
Driver 3.0 for Apache Cassandra appears to be the most commonly
used means for accessing Cassandra from Java. It provides a
mechanism for running CQL commands from a Java program (but it does not
support the jdbc). There are a few Cassandra jdbc drivers
available but it not clear if they are being actively supported or
widely used. We'll stick to DataStax's Java Driver.
The Java Driver can be downloaded from http://downloads.datastax.com/java-driver/cassandra-java-driver-3.0.0.tar.gz
After unpacking, the jar files will be in cassandra-java-driver-3.0.0
and in cassandra-java-driver-3.0.0/lib. You'll need to
put all of them in you classpath.
4. Launching Cassandra (and optionally cqlsh)
In a shell, cd to apache-cassandra-3.5/bin and run
./cassandra -f
The '-f' keeps the instance of Cassandra in the foreground, so it can
be killed off with cntrl-c, which is handy for testing (but of course
not the way you'd run a production system).
During testing, it may be useful to run cqlsh,
which is a command line utility for interactively executing cql
commands. In particular, cqlsh has a describe command
that can list keyspaces, tables, table structures, etc. describe
is not part of cql itself and is not available in the java driver
(although the same information can be obtained with select commands).
cqlsh is included in the Apache tar file and can be found in apache-cassandra-3.5/bin.
But note that it is written in python and required python 2
- it will not work with python 3. You can launch it
with
./cqlsh
if you have python 2 as the default python. If not,
you'll get a rather obscure error message. The simplest solution
is to explicitly execute the python file using the correct version of
python. E.g. I have both python 2 and 3 on my system, so I use
python2 cqlsh.py
5. Connecting to Cassandra
Operations are performed on a specific cluster. To connect to a
cluster on our local address, we use the static method Cluster.builder()
to get a Cluster
object from a particular IP address. To access a cluster on the
local system, we use the standard local address "127.0.0.1":
Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
Commands are issued using a Session object that we get from the cluster:
Session session = cluster.connect();
To test the connection, let's retrieve a list of all the keyspaces in the cluster. A Cassandra cluster always has some system keyspaces used for management. The system keyspace system_schema has a table named keyspaces that lists all keyspaces in the cluster. To get a list of keyspaces:
ResultSet r = session.execute("select keyspace_name from system_schema.keyspaces");
The DataStax driver uses a ResultSet class that is similar to the result set class in the jdbc. To see the results
int i = 0;
for ( Iteratorit = r.iterator(); it.hasNext(); ) {
Row row = it.next();}
String s = row.getString("keyspace_name");
System.out.println((i++) + " " + s);
A new cluster will have five system tables.
6. Creating a keyspace
To create a new keyspace
String keySpaceName = "a_test_keyspace";
session.execute("CREATE KEYSPACE IF NOT EXISTS " +
keySpaceName +
" WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }");
Note that we've set the replication factor to unity.
To avoid having to use fully qualified table name, we usually specify
the keyspace we'll be using.
session.execute("USE " + keySpaceName);
7. Creating a table
The first item column in the primary key list (s0) is the partition key. There are Cassandra data types for all Java primitives and some other objects including lists. See CQL data types to Java types for a full list of the mappings.String tableName = "table0";
session.execute("CREATE TABLE IF NOT EXISTS " +
tableName + " ( " +
" s0 text, " +
" i0 int, " +
" d0 double, " +
" b0 boolean, " +
" PRIMARY KEY ( s0, i0, d0 ) )" );
Note that serialized Java objects can be stored in tables as hex strings. After serializing an object to a byte[], you can use Bytes.toHexString to create a string that can be stored in a Cassandra text column.
8. Adding data
Rows are added to a table using either the INSERT or UPDATE
command. Note that the both will overwrite an existing row if it
has the specified primary key. The main difference is that with UPDATE
you can specify a new column, while with INSERT you are limited to
columns that already exist.
Like the jdbc, the Cassandra Java Driver supports prepared statements
and typed 'set' methods. So while we can add data using session.execute("...."),
we more often do something like this:
PreparedStatement ps =
session.prepare(
"INSERT INTO " + tableName +
"( s0, i0, d0, b0 ) values ( ?, ?, ?, ? ) ");
for ( int i = 0; i < 3; i++ ) {
for ( int j = 0; j < 5; j++
) {
BoundStatement bs = ps.bind();
bs.setString("s0", "str_" + Integer.toString(i));
bs.setInt("i0", j);
bs.setDouble("d0", (double) (i+j));
bs.setBool("b0", true);
session.execute(bs);
}
}
9. Retrieving data
CQL has a SELECT command that also looks a lot like SQL, although
selection criteria are constrained as described above. Rows can be
retrieved using session.execute("SELECT ...") commands.
However, in many situations, it's more convenient to use the Select
and QueryBuilder
classes, e.g.
Select s = QueryBuilder.select().all().from(tableName);
s.where(QueryBuilder.eq("s0","str_1"));
s.where(QueryBuilder.eq("i0",0));
s.where(QueryBuilder.lt("d0",4.0));
ResultSet rs = session.execute(s);
And to see the results
int i = 0;
for ( Iterator<Row> it = rs.iterator(); it.hasNext(); ) {
Row row = it.next();
System.out.println((i++) + ": s0=" + row.getString("s0") +
" i0=" + row.getInt("i0") + " d0=" + row.getDouble("d0") +
" b0=" + row.getBool("b0") );
}
10. Everything else
For the full set of possible operations, see Cassandra
Query Language (CQL) v3.3.1 and of course the DataStax
Java Driver API.
Comments? © 2016 Barnet Wagman