Short and Long Lived Cassandra Sessions

tl;dr

When using the Datastax driver use long lived sessions!

Symptom

We saw errors in our Java services coming from the Datastax driver trying (and retrying) to connect to Cassandra. The most frequent error was the following:

ERROR [2016-09-07 21:07:37,833] com.datastax.driver.core.Cluster: Unknown error during reconnection to /xx.x.x.xxx:9042, scheduling retry in 600000 milliseconds  
java.lang.IllegalArgumentException: rpc_address is not a column defined in this metadata  
at com.datastax.driver.core.ColumnDefinitions.getAllIdx(ColumnDefinitions.java:273) ~[cassandra-driver-core-2.1.2.jar:na]  
at com.datastax.driver.core.ColumnDefinitions.getFirstIdx(ColumnDefinitions.java:279) ~[cassandra-driver-core-2.1.2.jar:na]  

Our monitoring showed Cassandra was operating normally. Google threw up this bug in Cassandra version (2.1.3) arising when Cassandra was under load but our web services had very little load in terms of requests going into them. Digging further in the service logs the initial error showed the service failing to connect to Cassandra:

ERROR [2016-09-08 14:22:47,196] com.datastax.driver.core.Session: Error creating pool to /xx.xx.x.xx:9042  
 java.net.ConnectException: Connection refused: /xx.xx.x.xx:9042

Problem

One of my colleagues mentioned an issue he had previously encountered when load testing one of his services. He kindly pointed me to the Datastax docs on the driver (always good to read!)...

While the API of Session is centered around query execution, the Session does some heavy lifting behind the scenes as it manages the per-node connection pools. The Session instance is a long-lived object and it should not be used in a request/response short-lived fashion. Basically, you will want to share the same cluster and session instances across your application"

Sure enough, we were naively opening a session for each query to the Cassandra cluster:

public static <T> T doInCassandraSession(Function<Session, T> f, Cluster cluster) {  
   Session session = cluster.connect();
   try {
      return f.apply(session);
   } catch (DriverException e) {
      LOGGER.error("Failed to load data from cassandra", e);
      throw new Exception();
   } finally {
      session.close();
   }
}

Reproduce

So was our repeated opening of sessions with their “heavy lifting” overloading Cassandra? We recreated the errors on our test environment by doing a quick load test using the excellent Apache Bench Tool:

ab -n 10000 -c 10 localhost:8080/xxxxx  

We also saw in the Cassandra machine logs during the test that the firewall synflood protection was tripped by an aggressive client access. It was blocking packets to the Cassandra port (9042)!

1] TCP SYNFLOOD: IN=eth0 OUT= MAC=xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:  
xx:xx:xx SRC=xx.xx.xx.xxx DST=xx.xx.xx.xx LEN=52 TOS=0x00 PREC=0x00 TTL=63 ID=34028 DF PROTO=TCP SPT=40196 DPT=9042 WINDOW=29200 RES=0x00 SYN URGP=0  

The cluster was definitely being hammered!

Fix

We fixed up the services to use a long lived sessions (created when the service starts up and shared throughout the service as per the docs recommendations). Deploying to the test environment and rerunning the same load test, happily threw no errors and the firewall logs were also clean.

Kevin Duggan

Kevin Duggan is Technical Team Lead on Newsweaver’s new Cross-Channel Analytics product. He has 10 years experience delivering enterprise solutions for the Energy, Finance and Pharmaceutical sector

Cork

Subscribe to Newsweaver Technology Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!