Configure Spark Stand-Alone Cluster (Ubuntu 16.04/CenOS 7)

Before we started to look how to proceed with a stand-alone installation, let us look, what are the ways a Spark application can run?  The choices are:

  • Local mode:  The driver and the worker not only are on the same machine, but running in the same Java process.  You can specify to run on a single core, specify a number of cores, or request to run on as many cores as the machine has. This installation can be found through my earlier post.
  • The Spark Standalone scheduler
  • A Mesos cluster manager
  • A YARN cluster

Now let us jump into the installation of a stand-alone cluster.

Pre-requisite — sshd Service

In a spark stand-alone cluster, master and worker nodes needs to communicate each other without passwords. For that we need to make sure, password less ssh is installed and configured correctly.

First check whether openssh server is installed or not

systemctl status sshd

if openssh server is not available then get it installed

sudo yum install openssh #CentOS7
sudo apt-get install openssh-server #Ubuntu most Ubuntu versions come only with ssh client

To start ssh daemon (sshd)

systemctl start sshd

After you have successfully started SSHD daemon check the sshd service status by

systemctl status sshd

Next, time you reboot your Linux box you will need to manually start ssh service again. To start sshd service automatically after reboot enter a following command into your terminal

systemctl enable sshd

By this you have installed the open ssh server.

In case you are going to install case of multiple worker nodes in different machines you need to make sure that all the worked machines have sshd installed.

Before go any further, ensure that, on the host computer, the /etc/ssh/sshd_config contains the following lines, and that they are uncommented;

PubkeyAuthentication yes
RSAAuthentication yes

Create Spark user/group

Linux always recommend minimal permissions. So better to run spark with its own user. For this add a new user and group named “spark

sudo groupadd spark
sudo useradd -g spark spark

by default, a home directory for spark should be created in /home/spark. If it not created, please do that by manually

cd /home
sudo mkdir spark
sudo chown -R spark:spark /home/spark/

Now you can switch to spark user without password using below commands

sudo su spark
cd /home/

Now you should be able to connect to localhost without pwd. For that, try the following command to check whether ssh can access the localhost without a password.

ssh localhost

If this asks for a password (local machine’s user’s password), then execute the following to generate public/private rsa key pair and allow password less ssh. Please make sure you are using “spark” user and you are in “spark” users home directory.

rm -f ~/.ssh/id_rsa
ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 644 ~/.ssh/authorized_keys

Try again the following command to check whether ssh can access localhost without a password.

ssh localhost

Install spark

Download and move the file to /opt and extract. I have used /opt for spark home. It is not necessary that you need to use the same folder.

sudo cp spark-1.6.1-bin-hadoop2.6.tgz /opt/
sudo tar -xvf spark-1.6.1-bin-hadoop2.6.tgz

Now set permission for spark user for the spark directory

sudo chown -R spark:spark /opt/spark-1.6.1-bin-hadoop2.6

Now set up a symbolic link for spark directory.. This will be helpful if you are going to multiple versions in spark

sudo ln -s /opt/spark-1.6.1-bin-hadoop2.6/ /opt/spark

Also, set up the spark working directories with appropriate permissions.

sudo mkdir -p /var/spark/{logs,work,tmp,pids}
sudo chown -R spark:spark /var/spark
sudo chmod 4755 /var/spark/tmp

Spark configurations

Now we need to start configuring spark as a stand-alone cluster. I have created a stand alone cluster with one master and two worker nodes in a single machine. Even if you are planning to have separate machines for master and worker nodes I think the below instruction should hold good.

To create a stand alone cluster create  spark-env.sh file from the template

cd /opt/spark-1.6.1-bin-hadoop2.6/conf/
sudo cp spark-env.sh.template spark-env.sh

And add the below entries to the file. Please note, you can configure cores, ports for master and workers etc.

export SPARK_WORKER_CORES=2
export SPARK_WORKER_MEMORY="2g"
export SPARK_WORKER_PORT=5000
export SPARK_EXECUTOR_INSTANCES=2
#export SPARK_WORKER_INSTANCES=2 #deprecated in Spark 1.0+ ... spark-submit with --num-executors to specify the number of execu$
export SPARK_CONF_DIR="/opt/spark-1.6.1-bin-hadoop2.6/conf"
#export SPARK_TMP_DIR="/var/spark/tmp"
export SPARK_PID_DIR="/var/spark/pids"
export SPARK_LOG_DIR="/var/spark/logs"
export SPARK_WORKER_DIR="/var/spark/work"
export SPARK_MASTER_IP="YOUR_MASTER_IP" #please give your master node's IP address here.
export SPARK_MASTER_PORT=7077
export SPARK_LOCAL_IP="127.0.0.1"

Now create a default.conf file from its template. We will use this later.

sudo cp spark-defaults.conf.template spark-defaults.conf

To configure worker nodes please create a salves file from its template.

sudo cp slaves.template slaves
sudo nano slaves

By default, the template contains entry for one worker node. Since I have created two worker nodes, we have added one more entry (localhost in my case since I was using same machine for master and workers) in the slaves file.

So the slaves file content should be

localhost
localhost

Additional Configurations

If mysql connection needed from spark

If you are planning to connect mysql from spark you need add a spark mysql driver. (Else you will get Exception java.sql.SQLException: No suitable driver).For that, download latest mysql-connector-java-xxx-bin.jar. I have loaded that to /usr/share/java.

Add this path to spark-defaults.conf file like below.

spark.driver.extraClassPath /usr/share/java/mysql-connector-java-5.1.37-bin.jar
spark.executor.extraClassPath /usr/share/java/mysql-connector-java-5.1.37-bin.jar

If you are using spark streaming with Kafka

If you are using spark streaming with Kafka (I do) then you need an additional library. spark-streaming-kafka-assembly.XXXX.jar. The version of this should be same as your spark version and this can be downloaded from MVN repository. Download and move that to $SPARK_HOME/lib folder (just my preference) and use that as a reference in your spark-submit command.

Since the new files will be created with root as an owner and group,  make sure you are given all the permission to user “spark” once again.

sudo chown -R spark:spark /opt/spark-1.6.1-bin-hadoop2.6

Testing Stand-alone Installation

Now we are ready with spark installation and completed all configurations. It is time to test and make sure everything is working fine as expected. Let us start the cluster and make the master and worker nodes to come on-line.

sudo su spark
cd /opt/spark-1.6.1-bin-hadoop2.6/sbin/

Now start start the cluster using below command.

./start-all.sh

Your spark stand-alone cluster will start @ –ip YOUR_IP –port 7077. A corresponding web-ui will be up at  localhost:8080. You can verify the master and salve nodes in the web UI. You can also go ahead and check the logs in /var/spark/logs/.

To stop the master and worker nodes,

./stop-all.sh

With this you are able to install and do the necessary configuration on a spark stand-alone cluster on Ubuntu/CentOS. Hope you have enjoyed.

References

https://linuxconfig.org/how-to-install-manage-start-and-autostart-ssh-service-on-rhel-7-start

http://serverfault.com/questions/41541/how-to-add-a-linux-user-with-a-random-or-invalid-password-from-a-script

https://help.ubuntu.com/community/SSH/OpenSSH/Keys#Troubleshooting

https://www.server-world.info/en/note?os=CentOS_7&p=ssh&f=4

https://fluxcapacitor.zendesk.com/hc/en-us/articles/216056637-Best-Practices-for-Deploying-Spark-Applications

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s