Productionization of Spark Streaming application

My recent spark streaming project accepts streaming data from various IOT devices and streaming application is used for data enrichment and create analytical outcomes. Our Spark streaming infrastructure comprises the below components.

  • Apache spark 1.6.1 – for spark stream processing
  • Kafka broker – to queue the data from various IOT devices
  • Zookeeper
  • Python – my language of choice


Even though with limited experience in Linux and open source, I was able to install required components (Explained through an earlier post) and able to start my development. It took its own time to develop the use cases completely. But the real struggle begins when started thinking about productionizing the streaming application. I was really stumbled by some basic productionization issues. Like below

  • How to really install spark in a production system?
    • What kind of clustering I need to use? Stand-alone / Mesos / Yarn
  • How to automatically start / stop spark & Kafka?
  • How to get notification if one of these necessary service fails?
  • How to submit spark job to spark cluster?
  • How to achieve zero data loss if one of the service (spark / Kafka / zookeeper) fails?
  • How to test my production infra and do a load test?

Through the next series of posts, will share solutions for each of the above items which we were able to come up with.. Hope you will get some useful information along the way.

A word of caution: – Please note, this post is to share my experience (and googling) and the approach taken for our project. Better, standard approaches are available and hope someone can suggest better methods so that it can be incorporated in our project as well.


Spark Stand-alone Cluster as a systemd Service (Ubuntu 16.04/CentOS 7)

Introduction

Once completed a stand-alone spark cluster installation, you can start and stop spark cluster using the below commands.

$SPARK_HOME/sbin/start-all.sh
$SPARK_HOME/sbin/stop-all.sh

But this will not be feasible for production level system. Hopefully you may want spark cluster  to

  • start whenever your system starts / reboots
  • automatically restart in case of failures

This can be achieved by adding spark to linux’s init system. If you are not familiar with linux’s new init system systemd , please check one the reference link below.

Spark systemd unit file

All systemd services are driven using a systemd unit file. With your login user,  create a systemd unit file in /etc/systemd/system

cd /etc/systemd/system
sudo nano spark.service

Keep the below code in the spark.service unit file.

[Unit]
Description=Apache Spark Master and Slave Servers
After=network.target
After=systemd-user-sessions.service
After=network-online.target

[Service]
User=spark
Type=forking
ExecStart=/opt/spark-1.6.1-bin-hadoop2.6/sbin/start-all.sh
ExecStop=/opt/spark-1.6.1-bin-hadoop2.6/sbin/stop-all.sh
TimeoutSec=30
Restart= on-failure
RestartSec= 30
StartLimitInterval=350
StartLimitBurst=10

[Install]
WantedBy=multi-user.target

As you can see we are using the start & stop script to start and stop the spark service. Part of the unit file is used for automatic restart.

After creating / modifying the unit file, you should reload the systemd process itself to pick up your changes

sudo systemctl daemon-reload

From now on, if you want to start / stop the spark stand-alone cluster manually, use the below commands.

sudo systemctl start spark.service
sudo systemctl stop spark.service

Once started the service, please check the /var/spark/logs for logs. And execute below command for know the status of sytemd service.

sudo systemctl status spark.service

By default, systemd unit files are not started automatically at boot. To configure this functionality, you need to “enable” to unit.To enable spark service to start automatically at boot, type:

sudo systemctl enable spark.service

By this, you have configured a spark-stand alone cluster as a systemd service with automatic restart. Hope you enjoyed the post.

Reference

https://www.digitalocean.com/community/tutorials/systemd-essentials-working-with-services-units-and-the-journal

https://github.com/thiagowfx/PKGBUILDs/blob/master/apache-spark/apache-spark-standalone.service

 

Configure Spark Stand-Alone Cluster (Ubuntu 16.04/CenOS 7)

Before we started to look how to proceed with a stand-alone installation, let us look, what are the ways a Spark application can run?  The choices are:

  • Local mode:  The driver and the worker not only are on the same machine, but running in the same Java process.  You can specify to run on a single core, specify a number of cores, or request to run on as many cores as the machine has. This installation can be found through my earlier post.
  • The Spark Standalone scheduler
  • A Mesos cluster manager
  • A YARN cluster

Now let us jump into the installation of a stand-alone cluster.

Pre-requisite — sshd Service

In a spark stand-alone cluster, master and worker nodes needs to communicate each other without passwords. For that we need to make sure, password less ssh is installed and configured correctly.

First check whether openssh server is installed or not

systemctl status sshd

if openssh server is not available then get it installed

sudo yum install openssh #CentOS7
sudo apt-get install openssh-server #Ubuntu most Ubuntu versions come only with ssh client

To start ssh daemon (sshd)

systemctl start sshd

After you have successfully started SSHD daemon check the sshd service status by

systemctl status sshd

Next, time you reboot your Linux box you will need to manually start ssh service again. To start sshd service automatically after reboot enter a following command into your terminal

systemctl enable sshd

By this you have installed the open ssh server.

In case you are going to install case of multiple worker nodes in different machines you need to make sure that all the worked machines have sshd installed.

Before go any further, ensure that, on the host computer, the /etc/ssh/sshd_config contains the following lines, and that they are uncommented;

PubkeyAuthentication yes
RSAAuthentication yes

Create Spark user/group

Linux always recommend minimal permissions. So better to run spark with its own user. For this add a new user and group named “spark

sudo groupadd spark
sudo useradd -g spark spark

by default, a home directory for spark should be created in /home/spark. If it not created, please do that by manually

cd /home
sudo mkdir spark
sudo chown -R spark:spark /home/spark/

Now you can switch to spark user without password using below commands

sudo su spark
cd /home/

Now you should be able to connect to localhost without pwd. For that, try the following command to check whether ssh can access the localhost without a password.

ssh localhost

If this asks for a password (local machine’s user’s password), then execute the following to generate public/private rsa key pair and allow password less ssh. Please make sure you are using “spark” user and you are in “spark” users home directory.

rm -f ~/.ssh/id_rsa
ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys

Try again the following command to check whether ssh can access localhost without a password.

ssh localhost

Install spark

Download and move the file to /opt and extract. I have used /opt for spark home. It is not necessary that you need to use the same folder.

sudo cp spark-1.6.1-bin-hadoop2.6.tgz /opt/
sudo tar -xvf spark-1.6.1-bin-hadoop2.6.tgz

Now set permission for spark user for the spark directory

sudo chown -R spark:spark /opt/spark-1.6.1-bin-hadoop2.6

Now set up a symbolic link for spark directory.. This will be helpful if you are going to multiple versions in spark

sudo ln -s /opt/spark-1.6.1-bin-hadoop2.6/ /opt/spark

Also, set up the spark working directories with appropriate permissions.

sudo mkdir -p /var/spark/{logs,work,tmp,pids}
sudo chown -R spark:spark /var/spark
sudo chmod 4755 /var/spark/tmp

Spark configurations

Now we need to start configuring spark as a stand-alone cluster. I have created a stand alone cluster with one master and two worker nodes in a single machine. Even if you are planning to have separate machines for master and worker nodes I think the below instruction should hold good.

To create a stand alone cluster create  spark-env.sh file from the template

cd /opt/spark-1.6.1-bin-hadoop2.6/conf/
sudo cp spark-env.sh.template spark-env.sh

And add the below entries to the file. Please note, you can configure cores, ports for master and workers etc.

export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY="2g"
export SPARK_WORKER_PORT=5000
export SPARK_EXECUTOR_INSTANCES=2
#export SPARK_WORKER_INSTANCES=2 #deprecated in Spark 1.0+ ... spark-submit with --num-executors to specify the number of execu$
export SPARK_CONF_DIR="/opt/spark-1.6.1-bin-hadoop2.6/conf"
#export SPARK_TMP_DIR="/var/spark/tmp"
export SPARK_PID_DIR="/var/spark/pids"
export SPARK_LOG_DIR="/var/spark/logs"
export SPARK_WORKER_DIR="/var/spark/work"
export SPARK_MASTER_IP="YOUR_MASTER_IP" #please give your master node's IP address here.
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_IP="127.0.0.1"

Now create a default.conf file from its template. We will use this later.

sudo cp spark-defaults.conf.template spark-defaults.conf

To configure worker nodes please create a salves file from its template.

sudo cp slaves.template slaves
sudo nano slaves

By default, the template contains entry for one worker node. Since I have created two worker nodes, we have added one more entry (localhost in my case since I was using same machine for master and workers) in the slaves file.

So the slaves file content should be

localhost
localhost

Additional Configurations

If mysql connection needed from spark

If you are planning to connect mysql from spark you need add a spark mysql driver. (Else you will get Exception java.sql.SQLException: No suitable driver).For that, download latest mysql-connector-java-xxx-bin.jar. I have loaded that to /usr/share/java.

Add this path to spark-defaults.conf file like below.

spark.driver.extraClassPath /usr/share/java/mysql-connector-java-5.1.37-bin.jar
spark.executor.extraClassPath /usr/share/java/mysql-connector-java-5.1.37-bin.jar

If you are using spark streaming with Kafka

If you are using spark streaming with Kafka (I do) then you need an additional library. spark-streaming-kafka-assembly.XXXX.jar. The version of this should be same as your spark version and this can be downloaded from MVN repository. Download and move that to $SPARK_HOME/lib folder (just my preference) and use that as a reference in your spark-submit command.

Since the new files will be created with root as an owner and group,  make sure you are given all the permission to user “spark” once again.

sudo chown -R spark:spark /opt/spark-1.6.1-bin-hadoop2.6

Testing Stand-alone Installation

Now we are ready with spark installation and completed all configurations. It is time to test and make sure everything is working fine as expected. Let us start the cluster and make the master and worker nodes to come on-line.

sudo su spark
cd /opt/spark-1.6.1-bin-hadoop2.6/sbin/

Now start start the cluster using below command.

./start-all.sh

Your spark stand-alone cluster will start @ –ip YOUR_IP –port 7077. A corresponding web-ui will be up at  localhost:8080. You can verify the master and salve nodes in the web UI. You can also go ahead and check the logs in /var/spark/logs/.

To stop the master and worker nodes,

./stop-all.sh

With this you are able to install and do the necessary configuration on a spark stand-alone cluster on Ubuntu/CentOS. Hope you have enjoyed.

References

https://linuxconfig.org/how-to-install-manage-start-and-autostart-ssh-service-on-rhel-7-start

http://serverfault.com/questions/41541/how-to-add-a-linux-user-with-a-random-or-invalid-password-from-a-script

https://help.ubuntu.com/community/SSH/OpenSSH/Keys#Troubleshooting

https://www.server-world.info/en/note?os=CentOS_7&p=ssh&f=4

https://fluxcapacitor.zendesk.com/hc/en-us/articles/216056637-Best-Practices-for-Deploying-Spark-Applications

Installing Kafka / Spark on Ubuntu 14.04 /16.04 LTS

I was recently started working on a stream analytics project and after a lots of googling I have zeroed in on Kafka + Spark as my tool of choices. Now came the issue.. how I will try this out?… As a newbie in this filed, I have tried several options.

Initially tried to use Hortonworks HDP stack and I have installed that.. But,

  • I required only Zookeeper , Kafka and Spark but HDP stack got many..
  • Eventhough, HDP provides Amabari installation, it was hard on my standards
  • HDP required integration to other components like mysql etc…
  • Network configurations personally really given me a hard time..
  • Ambari provided a wonderful admin / monitoring tool but it was heavy background process
  • All these for a simple development env with a stand-alone cluster

due to above reasons, thought of installing Kafka & Spark separately. This blog post is my learning on how to set up Kafka & Spark on Ubuntu Linux. It is applicable for both 14.04 LTS and 16.04 versions.

Please find the major steps I have followed below

  1. Install Prerequisites
  2. Install Kafka
  3. Install Spark

Prerequisites

Java

Both Kafka and Spark needs Java. So please install that first incase it is not available on your machine already. You can install either openJDK or Oracle-Java.  I have installed Oracle -Java 8 using below commands.


$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

It will download 180MB and install Java automatically. To automatically set up the Java 8 environment variables, you can install the following package.

$ sudo apt-get install oracle-java8-set-default

Once installation completed please make sure Java specified version is installed correctly by checking its version.

$ java -version
java version '1.8.0_91'
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)

Please find link I have referred for this here.

Install Kafka

Initially I was using apache-kafka, later on I came to know about confluent kafka. It was more impressive because

  • It is completely open source (Its base version)
  • It is apache Kafka + some additional features
  • Well documented and more examples available etc…

Confluent kafka installation instructions can be found from their documentation.. but still I will state those here.

Installation

First install Confluent’s public key, which is used to sign the packages in the apt repository.

$ wget -qO - http://packages.confluent.io/deb/3.0/archive.key | sudo apt-key add -

 

Add the repository to your /etc/apt/sources.list:

$ sudo add-apt-repository 'deb [arch=amd64] http://packages.confluent.io/deb/3.0 stable main'

Run apt-get update and install the Confluent Platform:

$ sudo apt-get update
$ sudo apt-get install confluent-platform-2.11

This will complete the installation of kafka and the packages will be installed in the below locations.

/usr/bin/                  # Driver scripts for starting/stopping services, prefixed with <package> names
/etc/<package>/            # Configuration files
/usr/share/java/<package>/ # Jars

 Testing

Now let us make sure, our installation is fine by staring kafka and produce and consume messages.

1.Start Zookeeper :- Since zookeepr is a long-running service, you should run it in its own terminal

$ sudo zookeeper-server-start /etc/kafka/zookeeper.properties
#used sudo because you need write access to /var/lib

2. Start Kafka :- also in its own terminal.

$ sudo kafka-server-start /etc/kafka/server.properties

These makes our needed services up and running. Now we are ready to produce messages to topics and consume those.

3. Start producer :- use a new terminal

Kafka supports several producers. Let us test it with a console producer because it is easy to create and test. Type the below command in its own terminal.

$ kafka-console-producer --broker-list localhost:9092 --topic test
#This creates a topic 'test'

Above command will create a producer on our kafka broker (server) and it will be able to publish messages to a topic called “test“. Once started, the process will wait for you to enter messages, one per line, and will send them to the kafka queue immediately when you hit the Enter key. Let us try this by entering couple of messages..

My name xyz
I installed kafka
It is working now

4. Start Kafka Consumer :- in a new terminal

Since we have already produced some messages through test topic, let us consume that by typing below command. –from-beginning clause is used to start consuming messages from beginning of the topic. By default the consumer only reads messages published after it starts.

$ kafka-console-consumer --zookeeper localhost:2181 --topic test --from-beginning

You should see all the messages created in the previous step written in the producer terminal on step 3. Now you can type more messages in the producer terminal and you will see messages delivered to the consumer immediately after you hit Enter for each message in the producer.

When you’re done testing, you can use Ctrl+C to shutdown each service, in the reverse order that you started them.

Install Spark

1. Download  and Install:- download spark from here

The latest version of spark is 1.6.2. I have chosen the per-built type for hadoop 2.6 with file name spark-1.6.2-bin-hadoop2.6.tgz . This is around 300MB file.

Once downloaded , have this file moved to /opt and get extracted. (You have moved this to any location /opt was my preference)

$ sudo cp ~/Downloads/ spark-1.6.2-bin-hadoop2.6.tgz /opt/
$ cd /opt/
$ tar -xvf spark-1.6.2-bin-hadoop2.6.tgz

This will extract to a new folder spark-1.6.2-bin-hadoop2.6/. This is your spark directory.

2. Configure

Since we want to run Spark from anywhere, it will be better to add its binary directory to the PATH variable in the bash startup file. For that make use of the below commands.

$ nano ~/.bashrc

then add the below lines in .bashrc file and save.

export SPARK_HOME=/opt/spark-1.6.2-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH

Once done, make it effective using

$ source .bashrc

And you’re done! You can now run the Spark binaries from the terminal no matter where you are.

3. Testing

Now let us start testing the installation and make sure everything is working as expected. Since I am not familiar with Scala, I was plnned to use Spark’s python API pyspark.

$ pyspark

It will show the spark python command line (REPL) like the one below.

pyspark

Now using pyspark, let us load a local text file to spark. Create text file with some text in your home directory ($HOME) using gedit or any kind of editor.

In my file I have inserted the text

My name xyz
I installed Spark
It is working now

I saved the file as textData.txt. To load local text file, we need to write the below command on pyspak command line.

textFile = sc.textFile('textData.txt')
textFile.count()

It will return the count as 3. Please don’t worry about the verbose logs shown in the screen.. It is sparks default behavior.

We can also inspect web SparkUI using URL http://localhost:4040/

To come out of pyspark REPL use below command

exit()

Conclusion

By this, hope you will be able to install  , Configure and Test Kafka and Spark envs on your machine.