4 Ways to Ensure Your Data Lake Doesn’t Become a Data Swamp

April 6, 2022
To make data lakes work for time-series data, it’s important to add good metadata, ensure the ability to connect your analytics platform to your data lake, and take steps to improve data lake performance for data analysis.

If your company is involved in a full-fledged Smart Manufacturing, Industry 4.0, or other digital transformation initiative, you’ve likely encountered the term “data lake.” A data lake is essentially a place to store all the data collected from your operations. In a data lake, the data stored there can be structured or unstructured. No prior processing is required for the data to be stored in a data lake.

Because all kinds of data can be stored in a data lake, these data storage sites hold high potential to provide guidance on matters you might not yet consider to be important. According to Amazon Web Services, having different data types stored in a central repository means that you can apply numerous types of analytics, such as SQL queries, Big Data analytics, full text search, real-time analytics, and machine learning to uncover new insights.

But, just like that junk drawer in your house was meant to store needed items that don’t quite fit elsewhere, it can easily become a catch-all depository for things you should have already thrown away. In a similar fashion, a data lake can become a data swamp.

Data lakes can also become data swamps when users need special development or technical skills to access and use the data, says Niki Driessen, chief architect at TrendMiner, a supplier of data analytics technology for the processing industries. “Currently, data lakes are becoming increasingly important to process industries that capture and store immense amounts of sensor-generated time-series data,” he explains. “To make data lakes work for time-series data, it is important to understand that [these kinds of] data cannot just be dumped into the lake with the expectation of extracting its value.”

 To avoid having your data lake became of data swamp that obscures the value of your time-series data, Driessen advises taking the following steps:

Provide the required metadata. “There is no standard data lake tool or single platform that an organization can use to magically solve data lake issues such as data mapping and correlating,” says Driessen. “To ease data ingestion (for eventual analysis of time-series data), organizations must provide the required metadata—which includes data lineage, data structure, data age, and other metadata that provides common attributes or properties that link the data together.”

Connecting analytics to the data lake. Though no single standard exists to solve the data lake issues Driessen notes in the point above, there are common aspects of data storage packages from many different vendors that can help. One of these is a query abstraction layer. “This is a tool or component in an organization’s data lake that allows for writing standard SQL language queries against the data,” Driessen notes. “It also means that any tool that has support for standard ODBC or JDBC connectivity can be used to connect to the to the data lake.”

Data lake performance. Because data lakes typically use inexpensive block storage with a massive storage capacity, fast access to stored data is not guaranteed. This is a problem when working with advanced industrial analytics, as users expect the data to be where they need it and be able to access it as fast as possible. It can be problematic for all an organization’s data to be “sitting in one huge file in the data lake, as this structure is highly inefficient for extracting data,” Driessen says. The good news is that such issues can be corrected with the use of columnar file formats, which allow users to read data columns that are only needed for a specific case. “Since the entire file would not have to be read, less data is loaded, resulting in faster response times,” he adds.

Data partitioning. Another practice recommended by Driessen to improve data lake performance is partitioning. Here, data is arranged in folder-like structures by key properties, time, or a combination of the two. Driessen says this practice splits all available data into much smaller files, allowing users to drill down to specific data sets without having to transfer as much data. This translates into less time required to process the data or query against it.   

Companies in this Article

Sponsored Recommendations

Wireless Data Acquisition System Case Studies

Wireless data acquisition systems are vital elements of connected factories, collecting data that allows operators to remotely access and visualize equipment and process information...

Strategizing for sustainable success in material handling and packaging

Download our visual factory brochure to explore how, together, we can fully optimize your industrial operations for ongoing success in material handling and packaging. As your...

A closer look at modern design considerations for food and beverage

With new and changing safety and hygiene regulations at top of mind, its easy to understand how other crucial aspects of machine design can get pushed aside. Our whitepaper explores...

Fueling the Future of Commercial EV Charging Infrastructure

Miguel Gudino, an Associate Application Engineer at RS, addresses various EV charging challenges and opportunities, ranging from charging station design strategies to the advanced...