Recently, I got a rush request to get a three-node Apache Cassandra cluster with a replication factor of two working for a development job. I had little idea what that meant but needed to figure it out quickly—a typical day in a sysadmin’s job.
Here’s how to set up a basic three-node Cassandra cluster from scratch with some extra bits for replication and future node expansion.
Basic nodes needed
To start, you need some basic Linux machines. For a production install, you would likely put physical machines into racks, data centers, and diverse locations. For development, you just need something suitably sized for the scale of your development. I used three CentOS 7 virtual machines on VMware that have 20GB thin provisioned disks, two processors, and 4GB of RAM. These three machines are called: CS1 (192.168.0.110), CS2 (192.168.0.120), and CS3 (192.168.0.130).
First, do a minimal install of CentOS 7 as an operating system on each machine. To run this in production with CentOS, consider tweaking your firewalld and SELinux. Since this cluster would be used just for initial development, I turned them off.
The only other requirement is an OpenJDK 1.8 installation, which is available from the CentOS repository.
Installation
Create a cass user account on each machine. To ensure no variation between nodes, force the same UID on each install:
$ useradd --create-home \
--uid 1099 cass
$ passwd cass
Download the current version of Apache Cassandra (3.11.4 as I’m writing this). Extract the Cassandra archive in the cass home directory like this:
$ tar zfvx apache-cassandra-3.11.4-bin.tar.gz
The complete software is contained in ~cass/apache-cassandra-3.11.4. For a quick development trial, this is fine. The data files are there, and the conf/ directory has the important bits needed to tune these nodes into a real cluster.
Configuration
Out of the box, Cassandra runs as a localhost one-node cluster. That is convenient for a quick look, but the goal here is a real cluster that external clients can access and that provides the option to add additional nodes when development and tests need to broaden. The two configuration files to look at are conf/cassandra.yaml and conf/cassandra-rackdc.properties.
First, edit conf/cassandra.yaml to set the cluster name, network, and remote procedure call (RPC) interfaces; define peers; and change the strategy for routing requests and replication.
Edit conf/cassandra.yaml on each of the cluster nodes.
Change the cluster name to be the same on each node:
cluster_name: 'DevClust'
Change the following two entries to match the primary IP address of the node you are working on:
listen_address: 192.168.0.110
rpc_address: 192.168.0.110
Find the seed_provider entry and look for the – seeds: configuration line. Edit each node to include all your nodes:
- seeds: "192.168.0.110, 192.168.0.120, 192.168.0.130"
This enables the local Cassandra instance to see all its peers (including itself).
Look for the endpoint_snitch setting and change it to:
endpoint_snitch: GossipingPropertyFileSnitch
The endpoint_snitch setting enables flexibility later on if new nodes need to be joined. The Cassandra documentation indicates that GossipingPropertyFileSnitch is the preferred setting for production use; it is also necessary to set the replication strategy that will be presented below.
Save and close the cassandra.yaml file.
Open the conf/cassandra-rackdc.properties file and change the default values for dc= and rack=. They can be anything that is unique and does not conflict with other local installs. For production, you would put more thought into how to organize your racks and data centers. For this example, I used generic names like:
dc=NJDC
rack=rack001
Start the cluster
On each node, log into the account where Cassandra is installed (cass in this example), enter cd apache-cassandra-3.11.4/bin, and run ./cassandra. A long list of messages will print to the terminal, and the Java process will run in the background.
Confirm the cluster
While logged into the Cassandra user account, go to the bin directory and run $ ./nodetool status. If everything went well, you would see something like:
$ ./nodetool status
INFO [main] 2019-08-04 15:14:18,361 Gossiper.java:1715 - No gossip backlog; proceeding
Datacenter: NJDC
================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.0.110 195.26 KiB 256 69.2% 0abc7ad5-6409-4fe3-a4e5-c0a31bd73349 rack001
UN 192.168.0.120 195.18 KiB 256 63.0% b7ae87e5-1eab-4eb9-bcf7-4d07e4d5bd71 rack001
UN 192.168.0.130 117.96 KiB 256 67.8% b36bb943-8ba1-4f2e-a5f9-de1a54f8d703 rack001
This means the cluster sees all the nodes and prints some interesting information.
Note that if cassandra.yaml uses the default endpoint_snitch: SimpleSnitch, the nodetool command above indicates the default locations as Datacenter: datacenter1 and the racks as rack1. In the example output above, the cassandra-racdc.properties values are evident.
Run some CQL
This is where the replication factor setting comes in.
Create a keystore with a replication factor of two. From any one of the cluster nodes, go to the bin directory and run ./cqlsh 192.168.0.130 (substitute the appropriate cluster node IP address). You can see the default administrative keyspaces with the following:
cqlsh> SELECT * FROM system_schema.keyspaces;keyspace_name | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
system_auth | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
system_schema | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_distributed | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
system | True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
system_traces | True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}
Create a new keyspace with replication factor two, insert some rows, then recall some data:
cqlsh> CREATE KEYSPACE TestSpace WITH replication = {'class': 'NetworkTopologyStrategy', 'NJDC' : 2};
cqlsh> select * from system_schema.keyspaces where keyspace_name='testspace';keyspace_name | durable_writes | replication
---------------+----------------+--------------------------------------------------------------------------------
testspace | True | {'NJDC': '2', 'class': 'org.apache.cassandra.locator.NetworkTopologyStrategy'}
cqlsh> use testspace;
cqlsh:testspace> create table users ( userid int PRIMARY KEY, email text, name text );
cqlsh:testspace> insert into users (userid, email, name) VALUES (1, 'jd@somedomain.com', 'John Doe');
cqlsh:testspace> select * from users;userid | email | name
--------+-------------------+----------
1 | jd@somedomain.com | John Doe
Now you have a basic three-node Cassandra cluster running and ready for some development and testing work. The CQL syntax is similar to standard SQL, as you can see from the familiar commands to create a table, insert, and query data.
Conclusion
Apache Cassandra seems like an interesting NoSQL clustered database, and I’m looking forward to diving deeper into its use. This simple setup only scratches the surface of the options available. I hope this three-node primer helps you get started with it, too.