I was recently started working on a stream analytics project and after a lots of googling I have zeroed in on Kafka + Spark as my tool of choices. Now came the issue.. how I will try this out?… As a newbie in this filed, I have tried several options.
Initially tried to use Hortonworks HDP stack and I have installed that.. But,
- I required only Zookeeper , Kafka and Spark but HDP stack got many..
- Eventhough, HDP provides Amabari installation, it was hard on my standards
- HDP required integration to other components like mysql etc…
- Network configurations personally really given me a hard time..
- Ambari provided a wonderful admin / monitoring tool but it was heavy background process
- All these for a simple development env with a stand-alone cluster
due to above reasons, thought of installing Kafka & Spark separately. This blog post is my learning on how to set up Kafka & Spark on Ubuntu Linux. It is applicable for both 14.04 LTS and 16.04 versions.
Please find the major steps I have followed below
- Install Prerequisites
- Install Kafka
- Install Spark
Both Kafka and Spark needs Java. So please install that first incase it is not available on your machine already. You can install either openJDK or Oracle-Java. I have installed Oracle -Java 8 using below commands.
$ sudo apt-add-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
It will download 180MB and install Java automatically. To automatically set up the Java 8 environment variables, you can install the following package.
$ sudo apt-get install oracle-java8-set-default
Once installation completed please make sure Java specified version is installed correctly by checking its version.
$ java -version java version '1.8.0_91' Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Please find link I have referred for this here.
Initially I was using apache-kafka, later on I came to know about confluent kafka. It was more impressive because
- It is completely open source (Its base version)
- It is apache Kafka + some additional features
- Well documented and more examples available etc…
Confluent kafka installation instructions can be found from their documentation.. but still I will state those here.
First install Confluent’s public key, which is used to sign the packages in the apt repository.
$ wget -qO - http://packages.confluent.io/deb/3.0/archive.key | sudo apt-key add -
Add the repository to your /etc/apt/sources.list:
$ sudo add-apt-repository 'deb [arch=amd64] http://packages.confluent.io/deb/3.0 stable main'
Run apt-get update and install the Confluent Platform:
$ sudo apt-get update $ sudo apt-get install confluent-platform-2.11
This will complete the installation of kafka and the packages will be installed in the below locations.
/usr/bin/ # Driver scripts for starting/stopping services, prefixed with <package> names /etc/<package>/ # Configuration files /usr/share/java/<package>/ # Jars
Now let us make sure, our installation is fine by staring kafka and produce and consume messages.
1.Start Zookeeper :- Since zookeepr is a long-running service, you should run it in its own terminal
$ sudo zookeeper-server-start /etc/kafka/zookeeper.properties #used sudo because you need write access to /var/lib
2. Start Kafka :- also in its own terminal.
$ sudo kafka-server-start /etc/kafka/server.properties
These makes our needed services up and running. Now we are ready to produce messages to topics and consume those.
3. Start producer :- use a new terminal
Kafka supports several producers. Let us test it with a console producer because it is easy to create and test. Type the below command in its own terminal.
$ kafka-console-producer --broker-list localhost:9092 --topic test #This creates a topic 'test'
Above command will create a producer on our kafka broker (server) and it will be able to publish messages to a topic called “test“. Once started, the process will wait for you to enter messages, one per line, and will send them to the kafka queue immediately when you hit the
Enter key. Let us try this by entering couple of messages..
My name xyz
I installed kafka
It is working now
4. Start Kafka Consumer :- in a new terminal
Since we have already produced some messages through
test topic, let us consume that by typing below command. –from-beginning clause is used to start consuming messages from beginning of the topic. By default the consumer only reads messages published after it starts.
$ kafka-console-consumer --zookeeper localhost:2181 --topic test --from-beginning
You should see all the messages created in the previous step written in the producer terminal on step 3. Now you can type more messages in the producer terminal and you will see messages delivered to the consumer immediately after you hit
Enter for each message in the producer.
When you’re done testing, you can use
Ctrl+C to shutdown each service, in the reverse order that you started them.
1. Download and Install:- download spark from here
The latest version of spark is 1.6.2. I have chosen the per-built type for hadoop 2.6 with file name spark-1.6.2-bin-hadoop2.6.tgz . This is around 300MB file.
Once downloaded , have this file moved to /opt and get extracted. (You have moved this to any location /opt was my preference)
$ sudo cp ~/Downloads/ spark-1.6.2-bin-hadoop2.6.tgz /opt/ $ cd /opt/ $ tar -xvf spark-1.6.2-bin-hadoop2.6.tgz
This will extract to a new folder spark-1.6.2-bin-hadoop2.6/. This is your spark directory.
Since we want to run Spark from anywhere, it will be better to add its binary directory to the PATH variable in the bash startup file. For that make use of the below commands.
$ nano ~/.bashrc
then add the below lines in .bashrc file and save.
export SPARK_HOME=/opt/spark-1.6.2-bin-hadoop2.6 export PATH=$SPARK_HOME/bin:$PATH
Once done, make it effective using
$ source .bashrc
And you’re done! You can now run the Spark binaries from the terminal no matter where you are.
Now let us start testing the installation and make sure everything is working as expected. Since I am not familiar with Scala, I was plnned to use Spark’s python API pyspark.
It will show the spark python command line (REPL) like the one below.
Now using pyspark, let us load a local text file to spark. Create text file with some text in your home directory ($HOME) using gedit or any kind of editor.
In my file I have inserted the text
My name xyz
I installed Spark
It is working now
I saved the file as textData.txt. To load local text file, we need to write the below command on pyspak command line.
textFile = sc.textFile('textData.txt') textFile.count()
It will return the count as 3. Please don’t worry about the verbose logs shown in the screen.. It is sparks default behavior.
We can also inspect web SparkUI using URL http://localhost:4040/
To come out of pyspark REPL use below command
By this, hope you will be able to install , Configure and Test Kafka and Spark envs on your machine.