Installing Kafka / Spark on Ubuntu 14.04 /16.04 LTS

I was recently started working on a stream analytics project and after a lots of googling I have zeroed in on Kafka + Spark as my tool of choices. Now came the issue.. how I will try this out?… As a newbie in this filed, I have tried several options.

Initially tried to use Hortonworks HDP stack and I have installed that.. But,

  • I required only Zookeeper , Kafka and Spark but HDP stack got many..
  • Eventhough, HDP provides Amabari installation, it was hard on my standards
  • HDP required integration to other components like mysql etc…
  • Network configurations personally really given me a hard time..
  • Ambari provided a wonderful admin / monitoring tool but it was heavy background process
  • All these for a simple development env with a stand-alone cluster

due to above reasons, thought of installing Kafka & Spark separately. This blog post is my learning on how to set up Kafka & Spark on Ubuntu Linux. It is applicable for both 14.04 LTS and 16.04 versions.

Please find the major steps I have followed below

  1. Install Prerequisites
  2. Install Kafka
  3. Install Spark

Prerequisites

Java

Both Kafka and Spark needs Java. So please install that first incase it is not available on your machine already. You can install either openJDK or Oracle-Java.  I have installed Oracle -Java 8 using below commands.


$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

It will download 180MB and install Java automatically. To automatically set up the Java 8 environment variables, you can install the following package.

$ sudo apt-get install oracle-java8-set-default

Once installation completed please make sure Java specified version is installed correctly by checking its version.

$ java -version
java version '1.8.0_91'
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)

Please find link I have referred for this here.

Install Kafka

Initially I was using apache-kafka, later on I came to know about confluent kafka. It was more impressive because

  • It is completely open source (Its base version)
  • It is apache Kafka + some additional features
  • Well documented and more examples available etc…

Confluent kafka installation instructions can be found from their documentation.. but still I will state those here.

Installation

First install Confluent’s public key, which is used to sign the packages in the apt repository.

$ wget -qO - http://packages.confluent.io/deb/3.0/archive.key | sudo apt-key add -

 

Add the repository to your /etc/apt/sources.list:

$ sudo add-apt-repository 'deb [arch=amd64] http://packages.confluent.io/deb/3.0 stable main'

Run apt-get update and install the Confluent Platform:

$ sudo apt-get update
$ sudo apt-get install confluent-platform-2.11

This will complete the installation of kafka and the packages will be installed in the below locations.

/usr/bin/                  # Driver scripts for starting/stopping services, prefixed with <package> names
/etc/<package>/            # Configuration files
/usr/share/java/<package>/ # Jars

 Testing

Now let us make sure, our installation is fine by staring kafka and produce and consume messages.

1.Start Zookeeper :- Since zookeepr is a long-running service, you should run it in its own terminal

$ sudo zookeeper-server-start /etc/kafka/zookeeper.properties
#used sudo because you need write access to /var/lib

2. Start Kafka :- also in its own terminal.

$ sudo kafka-server-start /etc/kafka/server.properties

These makes our needed services up and running. Now we are ready to produce messages to topics and consume those.

3. Start producer :- use a new terminal

Kafka supports several producers. Let us test it with a console producer because it is easy to create and test. Type the below command in its own terminal.

$ kafka-console-producer --broker-list localhost:9092 --topic test
#This creates a topic 'test'

Above command will create a producer on our kafka broker (server) and it will be able to publish messages to a topic called “test“. Once started, the process will wait for you to enter messages, one per line, and will send them to the kafka queue immediately when you hit the Enter key. Let us try this by entering couple of messages..

My name xyz
I installed kafka
It is working now

4. Start Kafka Consumer :- in a new terminal

Since we have already produced some messages through test topic, let us consume that by typing below command. –from-beginning clause is used to start consuming messages from beginning of the topic. By default the consumer only reads messages published after it starts.

$ kafka-console-consumer --zookeeper localhost:2181 --topic test --from-beginning

You should see all the messages created in the previous step written in the producer terminal on step 3. Now you can type more messages in the producer terminal and you will see messages delivered to the consumer immediately after you hit Enter for each message in the producer.

When you’re done testing, you can use Ctrl+C to shutdown each service, in the reverse order that you started them.

Install Spark

1. Download  and Install:- download spark from here

The latest version of spark is 1.6.2. I have chosen the per-built type for hadoop 2.6 with file name spark-1.6.2-bin-hadoop2.6.tgz . This is around 300MB file.

Once downloaded , have this file moved to /opt and get extracted. (You have moved this to any location /opt was my preference)

$ sudo cp ~/Downloads/ spark-1.6.2-bin-hadoop2.6.tgz /opt/
$ cd /opt/
$ tar -xvf spark-1.6.2-bin-hadoop2.6.tgz

This will extract to a new folder spark-1.6.2-bin-hadoop2.6/. This is your spark directory.

2. Configure

Since we want to run Spark from anywhere, it will be better to add its binary directory to the PATH variable in the bash startup file. For that make use of the below commands.

$ nano ~/.bashrc

then add the below lines in .bashrc file and save.

export SPARK_HOME=/opt/spark-1.6.2-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH

Once done, make it effective using

$ source .bashrc

And you’re done! You can now run the Spark binaries from the terminal no matter where you are.

3. Testing

Now let us start testing the installation and make sure everything is working as expected. Since I am not familiar with Scala, I was plnned to use Spark’s python API pyspark.

$ pyspark

It will show the spark python command line (REPL) like the one below.

pyspark

Now using pyspark, let us load a local text file to spark. Create text file with some text in your home directory ($HOME) using gedit or any kind of editor.

In my file I have inserted the text

My name xyz
I installed Spark
It is working now

I saved the file as textData.txt. To load local text file, we need to write the below command on pyspak command line.

textFile = sc.textFile('textData.txt')
textFile.count()

It will return the count as 3. Please don’t worry about the verbose logs shown in the screen.. It is sparks default behavior.

We can also inspect web SparkUI using URL http://localhost:4040/

To come out of pyspark REPL use below command

exit()

Conclusion

By this, hope you will be able to install  , Configure and Test Kafka and Spark envs on your machine.

Advertisements

1 thought on “Installing Kafka / Spark on Ubuntu 14.04 /16.04 LTS”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s