4 Ways to Ensure Your Data Lake Doesn’t Become a Data Swamp

April 6, 2022
To make data lakes work for time-series data, it’s important to add good metadata, ensure the ability to connect your analytics platform to your data lake, and take steps to improve data lake performance for data analysis.

If your company is involved in a full-fledged Smart Manufacturing, Industry 4.0, or other digital transformation initiative, you’ve likely encountered the term “data lake.” A data lake is essentially a place to store all the data collected from your operations. In a data lake, the data stored there can be structured or unstructured. No prior processing is required for the data to be stored in a data lake.

Because all kinds of data can be stored in a data lake, these data storage sites hold high potential to provide guidance on matters you might not yet consider to be important. According to Amazon Web Services, having different data types stored in a central repository means that you can apply numerous types of analytics, such as SQL queries, Big Data analytics, full text search, real-time analytics, and machine learning to uncover new insights.

But, just like that junk drawer in your house was meant to store needed items that don’t quite fit elsewhere, it can easily become a catch-all depository for things you should have already thrown away. In a similar fashion, a data lake can become a data swamp.

Data lakes can also become data swamps when users need special development or technical skills to access and use the data, says Niki Driessen, chief architect at TrendMiner, a supplier of data analytics technology for the processing industries. “Currently, data lakes are becoming increasingly important to process industries that capture and store immense amounts of sensor-generated time-series data,” he explains. “To make data lakes work for time-series data, it is important to understand that [these kinds of] data cannot just be dumped into the lake with the expectation of extracting its value.”

 To avoid having your data lake became of data swamp that obscures the value of your time-series data, Driessen advises taking the following steps:

Provide the required metadata. “There is no standard data lake tool or single platform that an organization can use to magically solve data lake issues such as data mapping and correlating,” says Driessen. “To ease data ingestion (for eventual analysis of time-series data), organizations must provide the required metadata—which includes data lineage, data structure, data age, and other metadata that provides common attributes or properties that link the data together.”

Connecting analytics to the data lake. Though no single standard exists to solve the data lake issues Driessen notes in the point above, there are common aspects of data storage packages from many different vendors that can help. One of these is a query abstraction layer. “This is a tool or component in an organization’s data lake that allows for writing standard SQL language queries against the data,” Driessen notes. “It also means that any tool that has support for standard ODBC or JDBC connectivity can be used to connect to the to the data lake.”

Data lake performance. Because data lakes typically use inexpensive block storage with a massive storage capacity, fast access to stored data is not guaranteed. This is a problem when working with advanced industrial analytics, as users expect the data to be where they need it and be able to access it as fast as possible. It can be problematic for all an organization’s data to be “sitting in one huge file in the data lake, as this structure is highly inefficient for extracting data,” Driessen says. The good news is that such issues can be corrected with the use of columnar file formats, which allow users to read data columns that are only needed for a specific case. “Since the entire file would not have to be read, less data is loaded, resulting in faster response times,” he adds.

Data partitioning. Another practice recommended by Driessen to improve data lake performance is partitioning. Here, data is arranged in folder-like structures by key properties, time, or a combination of the two. Driessen says this practice splits all available data into much smaller files, allowing users to drill down to specific data sets without having to transfer as much data. This translates into less time required to process the data or query against it.   

Companies in this Article