According to Merriam-Webster, the word “paradox” is defined as a statement that is seemingly contradictory or opposed to common sense. In the enterprise world of today, IT professionals might consider machine data a paradox. On one hand, machine data is pure gold; it holds valuable information that, when correlated and analyzed, can provide valuable insights to help IT organizations optimize applications, find security breaches or proactively prevent problems that have yet to occur.
On the other hand, machine data is one of the biggest sources of pain. The volume of data, types of information, formats and sources has become so unwieldy that it’s difficult, if not impossible, to parse. To ensure that everyone is on the same page, I’m defining machine data as metrics, log data and traces (MELT). If you’re unsure of the difference or what these are, Sysdig’s Apurva Dave does a great job of explaining in this post.
Machine data comes in many shapes and sizes
To understand the problem, consider how machine data is currently handled. Some log data is pulled off servers and stored in one or more index systems for fast search. Security logs get sent to SIEMs for correlation and threat hunting. Metrics go through a different process and get captured in a time series database for analysis. The massive amount of trace data likely gets dumped into big data lakes theoretically for future processing. I say theoretically, because the information in data lakes is often unusable due to its unstructured nature. The net result is lots of data silos, which leads to incomplete analysis.
In data sciences, there’s an axiom that states “Good data leads to good insights.” The corollary is true as well: Bad data leads to bad insights, and siloed data leads to siloed insights.
Also, many of the analytic tools are very expensive and do not work well for unstructured data. I’ve talked to companies that have spent tens of millions on log analytics. These tools can be helpful, but often they aren’t because the volume of data is so large and has so much noise in it that the output isn’t as useful as it could be. The volume of data is certainly on the rise, so this problem isn’t getting addressed any time soon with traditional tools.
Analytic and security tools have their own agents that add to the problem
Another issue is that each of the tools used to analyze machine data comes with its own agent that often collects the same data from different endpoints, often in a unique format, adding to the data clutter that IT departments need to sort. This also adds a lot of management overhead and increases resource utilization but doesn’t really add much value. Hence the paradox: The insights are hidden in the data, but the overhead required to find those “a-ha”s is often more complicated than whatever the original problem was.
A new approach to managing the data pipeline is required
What’s required is a new approach to managing machine data so the various tools can be used effectively. A good analogy for what’s needed is the network packet broker. The network industry has a similar problem with tool sprawl, because the number of network management and security tools has exploded over the past decade. There is no cost-effective way to send all network data to all tools, so, as with machine data, network managers just send everything to everything, which is expensive and limits the effectiveness of the tools. Sound familiar? In networking, along came a network packet broker that collects, normalizes, correlates data and then directs only relevant information to the specific tools.
Key attributes of machine data management
There’s no similar product with machine data, but in ideal world, the data would flow through some kind of pipeline that could address the following:
- Gather one set of data that acts as a single source of truth.
- Pre-processing of information so analytic tools only process the data it requires instead of everything. This would include suppression of duplicate information, removal of null events and dynamic sampling of the stream.
- Normalize the data so it’s consistent and in format that’s usable by all the tools.
- Optimize data flows for performance and cost.
- Direct only the data required to the specific tools. There’s no point in having a tool process data only to drop it.
This kind of machine data pipeline will dramatically reduce costs, particularly with consumption-based tools that charge on the volume of data analyzed. For example, companies incur a lot of expense-ingesting data into Splunk that they never actually consume in the tool. That might seem crazy, but unfortunately, that’s the norm. One solution could be to build a unique pipeline per tool, but that might cost more than just sending everything to everything.
Current solutions address offer partial solutions
I don’t want to make it seem like nothing has been done to improve machine data management. There are a few open source companies, such as Apache NiFi and Fluentd, but they only address part of the problem. Also, Splunk has a product called data-stream processing that does close to what I outlined, but in typical Splunk style, it only works well with Splunk. The company would be smart to broaden the use of it to other tools.
There is an old saying that every business is a technology business, but I think that narrative has gotten a bit old. Instead, every business is a data-driven business, and competitive advantage is driven by finding those key insights in the data.
The problem is that the volume of machine data has grown so much that the ecosystem of tools to analyze it for IT organizations can’t keep pace. CIOs and IT leaders should look to invest in data-processing tools to optimize what the organization has already spent on analytic tools. This will help maximize the return on investment on tool spend and delay having to spend even more.
Zeus Kerravala is an eWEEK regular contributor and the founder and principal analyst with ZK Research. He spent 10 years at Yankee Group and prior to that held a number of corporate IT positions.