This is the fifth in a series of blog posts on Data Platforms for Manufacturing. Throughout this series, I am trying to explain my own learning process and how we are applying these technologies in an integrated offering at Critical Manufacturing. As part of this education, I recently did a webinar with a high-level summary of thoughts around it; you can find here (requires registration).

As I started to explore in the first post of this series, many companies seem to be strategically lost among the various digitization initiatives and are experiencing severe difficulties in achieving results from the investments made. Where IoT and related data are concerned, companies are storing significant amounts of data with no immediate plans for it being used: this is called dark data

Edge solutions will decisively contribute to the convergence of IT and OT, although to be effective, will have to consider a number of requirements related to architecture, automation and central management. Edge solutions are data producers to the Ingestion, Event Data Lakes and Message Brokering Blocks of the ideal data platform.

Kafka as a standard in message brokering

In the last post in this series, we discussed Kafka as a highly efficient message broker solution. Kafka was created by a team at LinkedIn, which couldn’t find a solution in the market for what they needed—to process and derive insights from massive amounts of social data. The team was led by Jay Kreps, who then founded a company called Confluent with two additional engineers, Neha Narkhede and Jun Rao. They also continue to maintain the development of Kafka under the stewardship of Apache.

The project was extremely successful. Kreps describes Kafka as a “central nervous system” managing the streams of information that happen in a company—every customer interaction, API request or database change—can be represented as a real-time stream that anything else can tap into, process, or react to.   

The number of companies using Kafka keeps increasing as more companies adopt event streaming. Here are some of the more well-known cases:


Uber uses Apache Kafka as a message bus for connecting different parts of their ecosystem. The company collects all logs from the application as well as GPS logs and Map services, beyond event data from the rider and driver apps. They make all of this data available to several different applications (downstream consumers) via Kafka, both to what they call real-time pipelines (streaming) and batch pipelines. This may include business metrics, debugging, alerting and dashboarding. Uber manages more than a trillion events per day using Kafka, over tens of thousands of topics, providing a very clear image on how the systems can scale.

One example of Uber’s real-time applications is demand-based ride pricing: upon gathering data from user vehicles, destinations, trips, and traffic forecasts, it performs driver/rider matching and computes surge prices in real-time (these very annoying price increases of 1, 2x, 5x or more).


Another very well-known company using Apache Kafka is Netflix. They use it as the de-facto standard for their eventing, messaging, and stream processing needs.

Netflix collects over 2 million events per second(!) and queries over 1.5 trillion rows of data to get insights into how viewers are experiencing the service. These are incredible numbers. And why does Netflix collect all this viewing and interaction data? On one side, to provide real-time recommendations; but on the other, to make mid- and long-term decisions on movies and series – just bear in mind that Netflix is spent something like $15 billion in original content in 2019… those sums must be based on very clear insights generated by the data they gather and analyze.


LinkedIn is an obvious flagship example. Kafka is used extensively, with a total number of messages exceeding the 7 trillion per day! They perform activity tracking, message exchanges, metrics gathering, etc., collecting all user interactions, and then in real-time adapt the user feeds, recommendations or push-based advertising. 

Data Processing

While Kafka ensures that all messages reach their destination applications with outstanding levels of reliability, performance and scalability, the real value-added operations are done by the downstream applications. This means that all ingestion and message brokering, is what we would call plumbing; extremely sophisticated, but plumbing nonetheless.

Data Processing is the next main block of the Data Platform. The name doesn’t do it justice to the value this block provides, as this is truly where the magic happens. Data processing encompasses transformation, enrichment, common data models and machine learning, etc.

I’ll start with the basics. Applications at this level are divided between batch and stream processing. Batch involves processing a large group of transactions in a single run, including multiple operations, and handling heavy data loads. An example would be to run a report or aggregate data in a data warehouse. Stream processing, on the other hand, deals with transformations that must be executed extremely fast and usually involves less data.

We’ve seen examples by LinkedIn, Netflix and Uber, which depict the importance of real-time stream processing. But what about its use in Manufacturing? Despite the fact that very few manufacturers already use it, real-time stream processing is of critical importance.

To explain the importance to manufacturers, I usually use this curve:

Benefit / Latency Curve

What we have here is a set of latencies between the time when an event occurs and when we actually react to it. The time between the occurrence of the event and when the knowledge about the event becomes available is called insight latency. The interval after that and until the analysis is completed is called the analysis latency. After the analysis is done, the consequent measure  approved after a period of time is called decision latency. And finally the action latency is the interval between the measurement being approved and really taking effect.

As we add those latencies, the curve shows that the value / benefits of the adaptation reduces significantly. What a high performance streaming solution aims to do is shorten these latencies as much as possible, to get the most benefits or value in the shortest amount of time.

Within the stream processing side of the Data Processing block, we divide applications in four main areas:

  • The first is a generic block called Real-time Transformation, handling any time-based stream processing transformation. These can be simple transformations of data points so that they can later be sent into visualization dashboards. Or, they can be more complex transformations integrating advanced processing libraries. The following are special types of real-time transformations.
  • If This Than That – the application at this level has simple business rules and reacts upon the values received reaching certain thresholds. The name is inspired by a well-known application used for home automation, IFTTT. It derives its name from the programming conditional statement “if this, then that” and it provides a platform that connects apps, devices and services from different providers to trigger one or more actions. In manufacturing, there are numerous use cases for using such technology, such as monitoring production variables (e.g. data from a sensor; yield below a certain threshold; machine stopped for more than x minutes;…) and creating automatic actions (alarms; e-mail notifications; stop of production equipment; hold material in the line;…).
  • Predictive Models – We’ll discuss this is detail in a later post, but basically this is related to Machine Learning algorithms that have been generated using batch processing applications, modeling large training sets using common learning algorithms such as classification, regression or clustering.
  • MES Data Enrichment – This is one of the most value-added blocks and one of the key differentiators of a manufacturing-oriented data platform, when compared to generic solutions. This will also be discussed in-depth in one of our next posts.
Main Areas of the Stream Processing Block

Apache Spark

To perform all of these Data Processing tasks described above, one of the most important solutions is Apache Spark.

Apache Spark is a unified analytics engine for large-scale data processing.  Originally developed at the University of California Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Spark deals with both batch and stream processing. It can quickly perform processing tasks on very large data sets and can also provide a very robust fault tolerance by distributing data processing tasks across multiple computers.

Moreover, Spark has the ability to connect to variety of sources like text, SQL or data lakes, but also to Kafka topics. Two aspects are very important in Spark streaming: first, the ability to combine streaming data with static data sets and interactive queries; second, the native integration with advanced processing libraries that are used for batch processing, such as SQL, machine learning and graphic processing.

This powerful framework was created by Matei Zaharia  when he was doing his PhD at UC Berkeley and the result was open-sourced in 2010. Matei is currently the CTO at Databricks – founded by him and others from the Spark project. Databricks develops a web-based platform for working with Spark, providing automated cluster management and other solutions.

Databricks has been extremely successful. First, through a collaboration with IBM by integrating Databricks with IBM ML solutions, and later with Microsoft, teaming with them and now providing Azure Databricks, an Apache Spark-based analytics platform optimized for Microsoft Azure.

Over the next few posts we’ll explore topics such as Data Enrichment, Batch Processing and Machine Learning applied to manufacturing, so stay tuned! Should you want to continue reading this and other Industry 4.0 posts, subscribe here and receive notifications of new posts by email.