4 Ways to Ensure Your Data Lake Doesn’t Become a Data Swamp

April 6, 2022
To make data lakes work for time-series data, it’s important to add good metadata, ensure the ability to connect your analytics platform to your data lake, and take steps to improve data lake performance for data analysis.

If your company is involved in a full-fledged Smart Manufacturing, Industry 4.0, or other digital transformation initiative, you’ve likely encountered the term “data lake.” A data lake is essentially a place to store all the data collected from your operations. In a data lake, the data stored there can be structured or unstructured. No prior processing is required for the data to be stored in a data lake.

Because all kinds of data can be stored in a data lake, these data storage sites hold high potential to provide guidance on matters you might not yet consider to be important. According to Amazon Web Services, having different data types stored in a central repository means that you can apply numerous types of analytics, such as SQL queries, Big Data analytics, full text search, real-time analytics, and machine learning to uncover new insights.

But, just like that junk drawer in your house was meant to store needed items that don’t quite fit elsewhere, it can easily become a catch-all depository for things you should have already thrown away. In a similar fashion, a data lake can become a data swamp.

Data lakes can also become data swamps when users need special development or technical skills to access and use the data, says Niki Driessen, chief architect at TrendMiner, a supplier of data analytics technology for the processing industries. “Currently, data lakes are becoming increasingly important to process industries that capture and store immense amounts of sensor-generated time-series data,” he explains. “To make data lakes work for time-series data, it is important to understand that [these kinds of] data cannot just be dumped into the lake with the expectation of extracting its value.”

 To avoid having your data lake became of data swamp that obscures the value of your time-series data, Driessen advises taking the following steps:

Provide the required metadata. “There is no standard data lake tool or single platform that an organization can use to magically solve data lake issues such as data mapping and correlating,” says Driessen. “To ease data ingestion (for eventual analysis of time-series data), organizations must provide the required metadata—which includes data lineage, data structure, data age, and other metadata that provides common attributes or properties that link the data together.”

Connecting analytics to the data lake. Though no single standard exists to solve the data lake issues Driessen notes in the point above, there are common aspects of data storage packages from many different vendors that can help. One of these is a query abstraction layer. “This is a tool or component in an organization’s data lake that allows for writing standard SQL language queries against the data,” Driessen notes. “It also means that any tool that has support for standard ODBC or JDBC connectivity can be used to connect to the to the data lake.”

Data lake performance. Because data lakes typically use inexpensive block storage with a massive storage capacity, fast access to stored data is not guaranteed. This is a problem when working with advanced industrial analytics, as users expect the data to be where they need it and be able to access it as fast as possible. It can be problematic for all an organization’s data to be “sitting in one huge file in the data lake, as this structure is highly inefficient for extracting data,” Driessen says. The good news is that such issues can be corrected with the use of columnar file formats, which allow users to read data columns that are only needed for a specific case. “Since the entire file would not have to be read, less data is loaded, resulting in faster response times,” he adds.

Data partitioning. Another practice recommended by Driessen to improve data lake performance is partitioning. Here, data is arranged in folder-like structures by key properties, time, or a combination of the two. Driessen says this practice splits all available data into much smaller files, allowing users to drill down to specific data sets without having to transfer as much data. This translates into less time required to process the data or query against it.   

About the Author

David Greenfield, editor in chief | Editor in Chief

David Greenfield joined Automation World in June 2011. Bringing a wealth of industry knowledge and media experience to his position, David’s contributions can be found in AW’s print and online editions and custom projects. Earlier in his career, David was Editorial Director of Design News at UBM Electronics, and prior to joining UBM, he was Editorial Director of Control Engineering at Reed Business Information, where he also worked on Manufacturing Business Technology as Publisher. 

Sponsored Recommendations

Crisis averted: How our AI-powered services helped prevent a factory fire

Discover how Schneider Electric's services helped a food and beverage manufacturer avoid a factory fire with AI-powered analytics.

How IT Can Support More Sustainable Manufacturing Operations

This eBook outlines how IT departments can contribute to amanufacturing organization’s sustainability goals and how Schneider Electric's products and...

Three ways generative AI is helping our services experts become superheroes

Discover how we are leveraging generative AI to empower service experts, meet electrification demands, and drive data-driven decision-making

How AI can support better health – for people and power systems

Discover how AI is revolutionizing healthcare and power system management. Learn how AI-driven analytics empower businesses to optimize electrical asset performance and how similar...