Implementing trace processing with Apache Flink

Those days we were planning to re-write the trace processing project of delivery from a new perspective.
Now the questions arise: “What is trace?” ,” What is its significance?”, ”What are the current challenges?”,
“What is the best strategy to overcome the current challenges?”

What is Trace?

Trace is the collection of latitude and longitude i.e. it is the collection of geolocations called pings of the
visit of field executive (FE) for a dispatch.
Apart from these pings trace also contains events like the package delivered, called customer, replaced,
pickup, pending, failed, etc.

What is the significance of Trace?

Trace is an important dataset of delivery. It gives a complete picture of what has been done on the ground.
Like the number of packages delivered per day, the Number of packages failed to be delivered,
the most prominent reason for delivery failures, which locality served the most in the past 10 days,
list of top 7 star facilities serving maximum load, etc.

Trace data not only gives the cumulative numbers but also give areas of improvement.

This data is super important as it has the ability to grow the business super fast by eliminating the root causes of failures.

Old ways of processing trace and challenges

Trace was processed via python cron jobs which run daily on the data collected in AWS s3 for the current day minus two.
The processing takes 7 hours on an average. And it was the batch processing kind of system.
There were two challenges in front of us: a). To reduce the time of processing of trace. b). To make trace processing more realtime.

Deciding on best strategy to overcome the current challenges

So we decided to convert the batch system to more of a stream processing kind of architecture.

What is Stream Processing?

Life is 10% what happens to you and 90% how you react to it.

In real life, we receive information, process it and react to it. Reaction time can make all the difference between success and failure.
Eg while driving reaction time can make a difference in life and death. Reaction time is important for businesses too.
A system which receives, process and reacts in real-time is called a stream processing system.

Stream Processing vs Batch Processing

Stream Processing

Batch Processing

Monitor and track events

Learn from past

Take fast and tactical decisions.

Take big picture decisions.

Process individual events.

Periodic reports

Unbounded infinite input data set.

Bounded dataset. A finite set of records as
input.

Order important, track out of order arrival.

Order of the data received is unimportant

How stream Processing fits in trace project?

We want to take immediate action on important events like a delivered mark, address invalid event, pick up event, etc
and also want to process complete trace data as a batch once the dispatch is over.
Hence we were looking for an architecture that supports both stream and batch processing capabilities.

And then trace processing became our dream project. To fulfill this dream we decided to move step by step.

First, we make a decision on what stream processing engine to choose?

At that time there were 7 most popular stream processing engines in the market :

Apache Spark

Apache Storm

Apache Samza

Apache Flink

Amazon Kinesis Stream

Apache Apex

Apache Flume

After long discussions, at last, we choose Apache Flink for it’s popular “Stream First” Architecture, which is Flink’s USP (unique selling point).

Second, we have to decide on our Approach to make trace more realtime

So we decided to rewrite trace in Apache Flink and keep the source as AWS s3 only for our first release.

About Apache Flink

Flink is a stream processing technology with the added capability to do lots of other things like batch processing, graph algorithms,
Machine learning, etc. Using Flink we can build applications that are highly responsive to the latest data such as triggering trades
based on stock price movement in the stock market. Flink uses a streaming first architecture.
Data streams are first-class citizens in Flink. Since we want an end to end data processing pipeline that works with both streaming
and batch data, apache Flink is a perfect fit for us.

Implementing trace using Apache Flink

Since our goal is decided. Now the next step is to implement the dream project.

We started by setting up a Flink cluster in a development environment with one Job manager and two task managers.

Flink supports both batch and stream Apis

So we decided to go with stream API of apache Flink and to read trace files from s3 as a stream.

Trace processing has two stages: 1). Preprocessing 2). Post Processing

In the preprocessing stage, data cleaning is done and data from external sources is merged with trace data.
In post-processing geo tracker API is called which will resolve coordinates and do data science-related stuff on trace data.

A snippet of code

A Data Stream of dispatches is created with name fileKeyStream and preprocessor and postprocessor methods are called on it

The preprocessor and postprocessor methods are set of map functions on the stream.

Production Setup:

We set up a Flink cluster on AWS cloud with one Job manager and 5 task managers (2 dedicated ec2 machines and 3 spot instances).

Each task manager has 8 slots, hence 40 slots in total. So 40 dispatch trace files can be processed in parallel.

Performance Optimizations

The batch task which was taking 7 hours previously now can be processed in one hour via Flink.
With Apache Flink, we were able to process 39GB of trace data in one hour.
Hence 85 percent performance optimization, Which is certainly huge.

Please also note that the system is horizontally scalable provided downstream systems and APIs can also be scaled simultaneously.
Future Work
Our dream project is half done successfully. In the future, we will switch the source from AWS s3 to AWS kinesis to transform
trace processing close to realtime.

Also working on fallback mechanisms in case the cluster is down, it should recollect to the original state from where the fallback
occurred. Making the system more robust and realtime.

Search This Blog

Pubg Mobile Lite, Pubg mobile gyroscope

Implementing trace processing with Apache Flink

What is the significance of Trace?

Old ways of processing trace and challenges

Deciding on best strategy to overcome the current challenges

About Apache Flink

Implementing trace using Apache Flink

Production Setup:

Performance Optimizations

Future Work

Popular posts from this blog

PUBG WhatsApp Group Links | 2020 (1000+)

50 best short hairstyles for men (2020)

Best Beaches to visit in Goa in Hindi / गोवा के प्रमुख दर्शनीय स्थल

Qualcomm Snapdragon 710 vs 712 vs 730 Who is better?

How To Change Canon Printer Status From Offline To Online

Stream Processing	Batch Processing
Monitor and track events	Learn from past
Take fast and tactical decisions.	Take big picture decisions.
Process individual events.	Periodic reports
Unbounded infinite input data set.	Bounded dataset. A finite set of records as input.
Order important, track out of order arrival.	Order of the data received is unimportant