Typesafe Activator

Spark Streaming with Scala and Akka

Spark Streaming with Scala and Akka

Jacek Laskowski
Source
February 1, 2015
spark streaming akka scala

Apache Spark (http://spark.apache.org/) is a fast and general engine for large-scale data processing. This Typesafe Activator template demonstrates Apache Spark (http://spark.apache.org) for near-real-time data streaming with Scala and Akka using the Spark Streaming (http://spark.apache.org/docs/latest/streaming-programming-guide.html) extension.

How to get "Spark Streaming with Scala and Akka" on your computer

There are several ways to get this template.

Option 1: Choose spark-streaming-scala-akka in the Typesafe Activator UI.

Already have Typesafe Activator (get it here)? Launch the UI then search for spark-streaming-scala-akka in the list of templates.

Option 2: Download the spark-streaming-scala-akka project as a zip archive

If you haven't installed Activator, you can get the code by downloading the template bundle for spark-streaming-scala-akka.

  1. Download the Template Bundle for "Spark Streaming with Scala and Akka"
  2. Extract the downloaded zip file to your system
  3. The bundle includes a small bootstrap script that can start Activator. To start Typesafe Activator's UI:

    In your File Explorer, navigate into the directory that the template was extracted to, right-click on the file named "activator.bat", then select "Open", and if prompted with a warning, click to continue:

    Or from a command line:

     C:\Users\typesafe\spark-streaming-scala-akka> activator ui 
    This will start Typesafe Activator and open this template in your browser.

Option 3: Create a spark-streaming-scala-akka project from the command line

If you have Typesafe Activator, use its command line mode to create a new project from this template. Type activator new PROJECTNAME spark-streaming-scala-akka on the command line.

Option 4: View the template source

The creator of this template maintains it at https://github.com/jaceklaskowski/spark-activator#master.

Option 5: Preview the tutorial below

We've included the text of this template's tutorial below, but it may work better if you view it inside Activator on your computer. Activator tutorials are often designed to be interactive.

Preview the tutorial

Spark Streaming with Scala and Akka

Apache Spark is a fast and general engine for large-scale data processing. This Typesafe Activator template demonstrates Apache Spark for near-real-time data streaming with Scala and Akka using the Spark Streaming extension.

This tutorial demonstrates how one can use Spark Streaming and the Akka actor system so actors can be used as receivers for incoming stream. Since actors can be used to receive data from any input source it enhances Spark Streaming's built-in streaming sources.

Develop Spark Streaming application

You start developing Spark Streaming application by creating a SparkConf that's followed by a StreamingContext.


val conf = new SparkConf(false) // skip loading external settings
  .setMaster("local[*]") // run locally with enough threads
  .setAppName("Spark Streaming with Scala and Akka") // name in Spark web UI
  .set("spark.logConf", "true")
  .set("spark.driver.port", s"$driverPort")
  .set("spark.driver.host", s"$driverHost")
  .set("spark.akka.logLifecycleEvents", "true")
val ssc = new StreamingContext(conf, Seconds(1))
        
This gives you a context to access the actor system that is of type ReceiverInputDStream.

val actorName = "helloer"
val actorStream: ReceiverInputDStream[String] = ssc.actorStream[String](Props[Helloer], actorName)
        
Having DStream lets you define a high-level processing pipeline in Spark Streaming.

actorStream.print()
        
In the above case, print() method is going to print the first ten elements of each RDD generated in this DStream.

Nothing happens until start() is executed.

ssc.start()
        
With the context up and running, the code connects to a Akka remote actor system in Spark Streaming that hosts the helloer actor and sends messages that, as the above code shows, display them all to standard output.

import scala.concurrent.duration._
val actorSystem = SparkEnv.get.actorSystem
val url = s"akka.tcp://spark@$driverHost:$driverPort/user/Supervisor0/$actorName"
val timeout = 100 seconds
val helloer = Await.result(actorSystem.actorSelection(url).resolveOne(timeout), timeout)
helloer ! "Hello"
helloer ! "from"
helloer ! "Apache Spark (Streaming)"
helloer ! "and"
helloer ! "Akka"
helloer ! "and"
helloer ! "Scala"
        
The Scala version is available at StreamingApp.scala

Run the App

Let's run the sample application.

In Run, select the application to run from the drop-down list under Main Class, and select Start. Feel free to modify, compile and re-run the sample.

Conclusion

This tutorial has introduced Apache Spark with the Spark Streaming extension and demonstrated how it can be used for processing near-real-time data streams that are handled by a single actor in a Akka actor system.

Next Steps

The Spark Documentation offers Setup instructions, programming guides, and other documentation.

If you have questions don't hesitate to post them to the user@spark.apache.org mailing list.
or contact the author of the activator.

comments powered by Disqus