Why Stream Processing?

 

 

In today’s world of data processing, everyone is talking about the cloud and the latest tools for processing big, unstructured data. The cloud is a great tool in your toolbox, but it can’t handle all your big data needs. What if the large amount of data you have is not stored on one or several machines waiting to be processed, but instead is continuously being delivered via one or more large sources of data. Instead of a cloud full of data, you have a stream of data. This article will delve into what Stream Processing is and when you need to use it.

 
Brief History of Stream Processing

Stream Processing[i] has been around since the 1970s when it was called Single Instruction, Multiple Data (SIMD). The first use of SIMD[ii] was within vector supercomputers of the early 1970s. Essentially, it was many processors working in parallel, performing the same operation. SIMD is still in use now and can be seen within graphics and multimedia applications. Stream Processing is also known as Real Time Computing or Reactive Computing[iii] . Stream Processing is sometimes called Real Time Computing because it tries to perform many small operations on data in a row to keep the data flowing through the system as close to real time as possible. Stream Processing is also sometimes called Reactive Computing because it allows you to propagate changes through the data flow.

In the 1980s, stream processing was explored as dataflow programming[iv]. Dataflow programming models a program as a directed graph of the data flowing between operations. When you look at programming languages such as InfoSphere Streams and Storm, this becomes apparent in how operators or spouts and bolts are connected together in directed graphs.

 

When is the Best time to use a Stream Processing Language?

There are many reasons for when to use a Stream Processing Language, but they can be simplified into three main reasons: High throughput, Packets are independent of each other, and you want to perform the same operation on all of the data.

 
1. High Throughput
When you have a lot of data coming at you quickly and continuously, you don’t have time or space to save all the data to a hard drive and iterate over the data. Sometimes you don’t even want to keep the data around.

If you are operating on data that is coming at you very fast that you want to make quick decisions on, then Stream Processing might be right for you. Some examples are sensor readings, video or audio streams, etc.

 
2. Pieces of data are independent of each other
A lot of the Streams Processing languages promise to operate on all the data you have, but that isn’t always possible. Sometimes the throughput is too much, and you lose data because some operations are time consuming and become a bottleneck. This causes the data to slow down, back up, and eventually get lost in the process. Also, things outside of your control can go wrong, which cause you to lose data, such as connection loss, or hardware failures.

If the data you are dealing with can be operated on independent of the data that came before or after it, then Stream Processing might be what you need. Examples of this might be if a temperature reading is greater than a certain value, if a heartbeat is lower than a certain number, etc.

 

3. Same operation on all of the data
With Stream Processing, you don’t have time to perform expensive, computationally intensive operations over all the data. You don’t even have all the data at once. Most of the time you are just looking at one small piece of data, and you need to make a quick decision about it.

For example, is the speed from this traffic sensor greater or less than the speed limit, did this stock price increase or decrease by a certain percent compared to the last interval, etc.

 

Why Stream Processing?
There are many different Stream Processing languages, but they all operate on the same principles: high throughput, independent data, and similar small operations on the data. Streams Processing is meant for systems with high volumes of data, where the order the data is processed does not matter, and for performing small quick operations on the data so the flow is not slowed down and backed up. It is used in systems that need near real-time alerting or decisions using one or many data sources. It is not meant to replace cloud computing or other types of processing; depending on the situation, Streams Processing just might be the right tool for the job.

 

——

Cas Stulberger, Technical Consultant at CollabraSpace

 

[i] http://en.wikipedia.org/wiki/Stream_processing

[ii]http://en.wikipedia.org/wiki/SIMD

[iii]http://en.wikipedia.org/wiki/Real_Time_Processing

[iv]http://en.wikipedia.org/wiki/Dataflow_programming