Big Data Analytics with Apache Spark

Over the past two years, Newsweaver has taken our new Cross-Channel Analytics Platform from proof of concept to production. Our customers publish hundreds of thousands of communications daily across a variety of different channels, generating an explosion of events on that content. Email opens, page views, video plays and social comments (to mention but a few) are all captured by our platform and we process them across multiple different metrics and dimensions to deliver insights and actionable reports and dashboards to our end users. We do all this at scale using Apache Spark.

Why Spark

Spark has enjoyed a meteoric rise in popularity going from humble beginnings as a Ph.D. research project to the largest open source big data project (with 1000+ contributors) in 5 years. During the initial project phase we investigated a number of MapReduce offerings (Hive, Storm, Pig) but in the end went with Spark because it is:

  • Fast: Holds the current world record for sorting (100TB in 23 mins), sounds good enough for me!
  • General: Great language support with API offerings in Scala, Java, Python, and R.
  • Flexible: Has pluggable support for all the common data sources you might want to read from and write to.
  • Streaming and Batch: Our use case requires the ability to ingest data quickly and process it efficiently

Deliver Value Quickly

To me, the great power of Spark lies not in its speed and distributed magic but in its ability to get you thinking more about the latent value that your data contains and less about how to extract that value. The barrier to entry (both operationally and programmatically) into MapReduce processing, machine learning etc. is lessened as Spark does a great job of abstracting away the underlying complexity.

It’s simple yet elegant API is based on mainstream Scala functional transformations. This makes it especially familiar to those who have some experience in this area. I found myself getting up to speed with it’s API very quickly, almost forgetting I was dealing with large distributed volumes of data and instead focussing on getting answers from my data.

Spark Core

Spark provides the developer with and API centered on a data structure called the resilient distributed dataset (RDD). RDDs are the main abstraction upon which all the upper layers of Spark (Streaming, ML, SparkSQL etc) are built. The can be created from parallelizing a collection or from reading data from an external source (file, database, queue etc).

val data = Array(1, 2, 3, 4, 5)  
val distData = sparkContect.parallelize(data)  

Being immutable means they can be easily cached, shared across processes and replicated across machines. They are lazy meaning they are only materialized when needed. It is easier to think of them not as traditional data structures but as a recipe for making data from other data. This also means that they are fault tolerant. In the event of a partition (a portion of the RDD) being lost, it can be recreated from this recipe (or lineage in Spark terminology).

Operations

RDDs can be operated on in parallel by either:

  • Transformations: Create one RDD from another. The original RDD is not mutated rather a brand new RDD results from the transformation. Examples are map, flatmap, filter, groupByKey, reduceByKey.

  • Actions: Return a value after running a computation. These are not lazy and cause all RDDs to be materialized. At the point of an action, Spark works backward, breaking down and distributing all intermediate tasks and data across the cluster, carrying out the various transformations and bringing it all back together to compute the end result. Examples of actions are: count, first, take(n), collect and saveToCassandra.

Life Cycle of a Spark App in Newsweaver

Given the simplicity of the API, you find yourself very eager to show off how powerful it is through developing some cool apps. The process we have arrived at to bring an app from an idea through to production is as follows:

1. Complete a Dev Spike

Ideas are great but they can be based on some dodgy assumptions. We mitigate this risk by doing a small proof of concept. We don’t worry too much about the code we produce at this point. Instead, we focus on verifying the basis for our ideas. We might use the Spark REPL, or write a quick app to process a subset of our data to map out roughly (if possible) what source data and what operations will get us to our desired end result.

2. Develop the App

We then develop our apps against an in memory test harness (both unit and acceptance tests). This provides fast feedback as to the correctness of what we are doing. A key point here (that we sometimes forget) is that we might get the right answers from our operations and the tests may pass, but the app could perform very badly when distributed across a real life cluster!

3. Run Distributed Tests

Keen to see how Spark takes our single threaded driver code, splits and distributes the RDD operations across a cluster we run an integration test on a small 3 node docker based cluster. We apply some load to the app and examine the Spark UI. This shows the number (and sequence) of jobs, stages and tasks (and their locality) generated and the volume of data shuffled across the cluster. It is in tuning the code to optimize all these concerns that we have seen the greatest performance improvement in our app. This also has helped educate us in the best way to write and order our spark code.

4. Deploy to Production Cluster

We then deploy our app to production. This is not "fire and forget". We monitor our apps via the Graphite metrics sink that Spark offers out of the box displaying dashboard on Grafana. Any degradation is investigated (again using the Spark UI) and the code is tuned or resources added to the cluster to improve the situation.

Tips

Through following this process for a variety of Spark apps we have made many mistakes and learned a lot. Here are some of our recommendations:

  • Remember it is not Scala (or whatever language you choose)!
    • You are writing MapReduce code, not using Scala higher order functions.
    • Your job is to reduce the number of shuffles and the amount of data shuffled.
    • The ordering of operations matters in this respect.
  • Again remember it is distributed!
    • Be careful with variables, closures, and scope.
    • Better still use no mutable variables!
  • Use BDD
    • Spark Apps can grow large and be hard to reason about when revisiting them for bug fixing or refactoring purposes.
    • Using Scala FeatureSpec to write high-level tests helps outline apps intended behaviors.
  • Use TDD
    • Spark lends itself nicely to TDD
    • Test your transformations independently
    • Build up piece by piece to the end results
  • Streaming and Batch share a common API
    • We have found this very useful for combining both!
    • Your expertise in one is transferrable to the other.
    • You can convert your stream into traditional RDDs (the RDD API is richer) but fully investigate what the options are available in the streaming API before doing this.

Beyond Production

Over the last 2 years, I have been very impressed with Sparks scalability and flexibility in tackling a variety of use cases. We now a full data pipeline that ingests data from multiple different sources and process for end user consumption.

On the flipside, there is a large operational effort in running a cluster and resourcing (memory/cores) the apps that run on it. Troubleshooting can be interesting with the number of moving parts in play. You must have a working knowledge of drivers, executors, workers, machines and networking to diagnose failures that can arise. That being said these issues are not unique to Spark and any distributed data processing framework would require similar expertise. For me, this is not a blocker for adopting Spark. You can organically acquire these skills through writing and tuning your apps. It has been a real positive DevOps effort for us in that you must also have an appreciation of what is going on in the code to fully get to the bottom of things.

We now have two teams using Spark and we hope to expand that further. We want to expand our use of Spark in other areas, such as data science and ad-hoc data investigations. These investigations will hopefully yield more insights into out data and drive many more features onto our product roadmap in the future.

Learn more…

If you are interested in learning more about developing Spark apps I have developed a small introduction to the programming with the Spark API which you can try out here.

Kevin Duggan

Kevin Duggan is Technical Team Lead on Newsweaver’s new Cross-Channel Analytics product. He has 10 years experience delivering enterprise solutions for the Energy, Finance and Pharmaceutical sector

Cork

Subscribe to Newsweaver Technology Blog

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!