My recent spark streaming project accepts streaming data from various IOT devices and streaming application is used for data enrichment and create analytical outcomes. Our Spark streaming infrastructure comprises the below components.
- Apache spark 1.6.1 – for spark stream processing
- Kafka broker – to queue the data from various IOT devices
- Python – my language of choice
Even though with limited experience in Linux and open source, I was able to install required components (Explained through an earlier post) and able to start my development. It took its own time to develop the use cases completely. But the real struggle begins when started thinking about productionizing the streaming application. I was really stumbled by some basic productionization issues. Like below
How to really install spark in a production system?
- What kind of clustering I need to use? Stand-alone / Mesos / Yarn
- How to automatically start / stop spark & Kafka?
- How to get notification if one of these necessary service fails?
- How to submit spark job to spark cluster?
- How to achieve zero data loss if one of the service (spark / Kafka / zookeeper) fails?
- How to test my production infra and do a load test?
Through the next series of posts, will share solutions for each of the above items which we were able to come up with.. Hope you will get some useful information along the way.
A word of caution: – Please note, this post is to share my experience (and googling) and the approach taken for our project. Better, standard approaches are available and hope someone can suggest better methods so that it can be incorporated in our project as well.