Implementing trace processing with Apache Flink
Those days we were planning to re-write the trace processing project of delivery from a new perspective.
Now the questions arise: “What is trace?” ,” What is its significance?”, ”What are the current challenges?”,
“What is the best strategy to overcome the current challenges?”
Now the questions arise: “What is trace?” ,” What is its significance?”, ”What are the current challenges?”,
“What is the best strategy to overcome the current challenges?”
What is Trace?
Trace is the collection of latitude and longitude i.e. it is the collection of geolocations called pings of the
visit of field executive (FE) for a dispatch.
Apart from these pings trace also contains events like the package delivered, called customer, replaced,
pickup, pending, failed, etc.
visit of field executive (FE) for a dispatch.
Apart from these pings trace also contains events like the package delivered, called customer, replaced,
pickup, pending, failed, etc.
What is the significance of Trace?
Trace is an important dataset of delivery. It gives a complete picture of what has been done on the ground.
Like the number of packages delivered per day, the Number of packages failed to be delivered,
the most prominent reason for delivery failures, which locality served the most in the past 10 days,
list of top 7 star facilities serving maximum load, etc.
Trace data not only gives the cumulative numbers but also give areas of improvement.
This data is super important as it has the ability to grow the business super fast by eliminating the root causes of failures.
Old ways of processing trace and challenges
Trace was processed via python cron jobs which run daily on the data collected in AWS s3 for the current day minus two.
The processing takes 7 hours on an average. And it was the batch processing kind of system.
There were two challenges in front of us: a). To reduce the time of processing of trace. b). To make trace processing more realtime.
Deciding on best strategy to overcome the current challenges
So we decided to convert the batch system to more of a stream processing kind of architecture.
What is Stream Processing?
Life is 10% what happens to you and 90% how you react to it.
In real life, we receive information, process it and react to it. Reaction time can make all the difference between success and failure.
Eg while driving reaction time can make a difference in life and death. Reaction time is important for businesses too.
A system which receives, process and reacts in real-time is called a stream processing system.
Stream Processing vs Batch Processing
How stream Processing fits in trace project?
We want to take immediate action on important events like a delivered mark, address invalid event, pick up event, etc
and also want to process complete trace data as a batch once the dispatch is over.
Hence we were looking for an architecture that supports both stream and batch processing capabilities.
And then trace processing became our dream project. To fulfill this dream we decided to move step by step.
First, we make a decision on what stream processing engine to choose?
At that time there were 7 most popular stream processing engines in the market :
Apache Spark
Apache Storm
Apache Samza
Apache Flink
Amazon Kinesis Stream
Apache Apex
Apache Flume
After long discussions, at last, we choose Apache Flink for it’s popular “Stream First” Architecture, which is Flink’s USP (unique selling point).
Second, we have to decide on our Approach to make trace more realtime
So we decided to rewrite trace in Apache Flink and keep the source as AWS s3 only for our first release.
About Apache Flink
Flink is a stream processing technology with the added capability to do lots of other things like batch processing, graph algorithms,
Machine learning, etc. Using Flink we can build applications that are highly responsive to the latest data such as triggering trades
based on stock price movement in the stock market. Flink uses a streaming first architecture.
Data streams are first-class citizens in Flink. Since we want an end to end data processing pipeline that works with both streaming
and batch data, apache Flink is a perfect fit for us.
Implementing trace using Apache Flink
Since our goal is decided. Now the next step is to implement the dream project.
We started by setting up a Flink cluster in a development environment with one Job manager and two task managers.
Flink supports both batch and stream Apis
So we decided to go with stream API of apache Flink and to read trace files from s3 as a stream.
Trace processing has two stages: 1). Preprocessing 2). Post Processing
In the preprocessing stage, data cleaning is done and data from external sources is merged with trace data.
In post-processing geo tracker API is called which will resolve coordinates and do data science-related stuff on trace data.
A snippet of code
A Data Stream of dispatches is created with name fileKeyStream and preprocessor and postprocessor methods are called on it
The preprocessor and postprocessor methods are set of map functions on the stream.
Production Setup:
We set up a Flink cluster on AWS cloud with one Job manager and 5 task managers (2 dedicated ec2 machines and 3 spot instances).
Each task manager has 8 slots, hence 40 slots in total. So 40 dispatch trace files can be processed in parallel.
Performance Optimizations
The batch task which was taking 7 hours previously now can be processed in one hour via Flink.
With Apache Flink, we were able to process 39GB of trace data in one hour.
Hence 85 percent performance optimization, Which is certainly huge.
Please also note that the system is horizontally scalable provided downstream systems and APIs can also be scaled simultaneously.
Future Work
Our dream project is half done successfully. In the future, we will switch the source from AWS s3 to AWS kinesis to transform
trace processing close to realtime.
Also working on fallback mechanisms in case the cluster is down, it should recollect to the original state from where the fallback
occurred. Making the system more robust and realtime.
Like the number of packages delivered per day, the Number of packages failed to be delivered,
the most prominent reason for delivery failures, which locality served the most in the past 10 days,
list of top 7 star facilities serving maximum load, etc.
The processing takes 7 hours on an average. And it was the batch processing kind of system.
There were two challenges in front of us: a). To reduce the time of processing of trace. b). To make trace processing more realtime.
Eg while driving reaction time can make a difference in life and death. Reaction time is important for businesses too.
A system which receives, process and reacts in real-time is called a stream processing system.
and also want to process complete trace data as a batch once the dispatch is over.
Hence we were looking for an architecture that supports both stream and batch processing capabilities.
Machine learning, etc. Using Flink we can build applications that are highly responsive to the latest data such as triggering trades
based on stock price movement in the stock market. Flink uses a streaming first architecture.
Data streams are first-class citizens in Flink. Since we want an end to end data processing pipeline that works with both streaming
and batch data, apache Flink is a perfect fit for us.
So we decided to go with stream API of apache Flink and to read trace files from s3 as a stream.
In post-processing geo tracker API is called which will resolve coordinates and do data science-related stuff on trace data.
With Apache Flink, we were able to process 39GB of trace data in one hour.
Hence 85 percent performance optimization, Which is certainly huge.
trace processing close to realtime.
occurred. Making the system more robust and realtime.