Story image

Gartner: Beware of the Data Lake Fallacy...

31 Jul 14

The growing hype surrounding data lakes is causing substantial confusion in the information management space, according to Gartner.

Several vendors are marketing data lakes as an essential component to capitalise on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.

"In broad terms, data lakes are marketed as enterprisewide data management platforms for analyzing disparate sources of data in its native format," says Nick Heudecker, research director, Gartner.

"The idea is simple: instead of placing data in a purpose-built data store, you move it into a data lake in its original format.

"This eliminates the upfront costs of data ingestion, like transformation. Once data is placed into the lake, it's available for analysis by everyone in the organisation."

However, while the marketing hype suggests audiences throughout an enterprise will leverage data lakes, this positioning assumes that all those audiences are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata.

"The need for increased agility and accessibility for data analysis is the primary driver for data lakes," adds Andrew White, vice president and distinguished analyst, Gartner.

"Nevertheless, while it is certainly true that data lakes can provide value to various parts of the organization, the proposition of enterprise-wide data management has yet to be realised."

Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured. The data lake concept hopes to solve two problems, one old and one new. The old problem it tries to solve is information silos.

Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.

The new problem data lakes conceptually tackle pertains to Big Data initiatives. Big Data projects require a large amount of varied information.

The information is so varied that it's not clear what it is when it is received, and constraining it in something as structured as a data warehouse or relational database management system (RDBMS) constrains future analysis.

"Addressing both of these issues with a data lake certainly benefits IT in the short term in that IT no longer has to spend time understanding how information is used — data is simply dumped into the data lake," White adds.

"However, getting value out of the data remains the responsibility of the business end user.

"Of course, technology could be applied or added to the lake to do this, but without at least some semblance of information governance, the lake will end up being a collection of disconnected data pools or information silos all in one place."

Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake.

By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.

Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents.

Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still embryonic. These issues will not be addressed if left to non-IT personnel.

Finally, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.

For these reasons, Gartner recommends that organisations focus on semantic consistency and performance in upstream applications and data stores instead of information consolidation in a data lake.

"Data lakes typically begin as ungoverned data stores," Heudecker adds.

"Meeting the needs of wider audiences require curated repositories with governance, semantic consistency and access controls — elements already found in a data warehouse.

"The fundamental issue with the data lake is that it makes certain assumptions about the users of information.

"It assumes that users recognize or understand the contextual bias of how data is captured, that they know how to merge and reconcile different data sources without 'a priori knowledge' and that they understand the incomplete nature of datasets, regardless of structure."

While these assumptions may be true for users working with data, such as data scientists, the majority of business users lack this level of sophistication or support from operational information governance routines.

Developing or acquiring these skills or obtaining such support on an individual basis, is both time-consuming and expensive, or impossible.

"There is always value to be found in data but the question your organisation has to address is this — do we allow or even encourage one-off, independent analysis of information in silos or a data lake, bringing said data together, or do we formalise to a degree that effort, and try to sustain the value-generating skills we develop?" White adds.

"If the option is the former, it is quite likely that a data lake will appeal.

"If the decision tends toward the latter, it is beneficial to move beyond a data lake concept quite quickly in order to develop a more robust logical data warehouse strategy."

TCS collaborates with Red Hat to build digital transformation solutions
“By leveraging TCS' technology skills to build more secure, intelligent and responsive solutions, we aim to deliver superior end-user experiences."
Twitter suspects state-sponsored ties to support forum breach
One of Twitter’s support forums was hit by a data breach that may have ties to a state-sponsored attack, however users' personal data was exposed.
How McAfee aims to curb enterprise data loss
McAfee DLP aims to help safeguard intellectual property and ensure compliance by protecting sensitive data.
HPE promotes 'circular economy' for end-of-use tech
HPE is planning to show businesses worldwide that throwing old tech and assets into landfill is not the best option when it comes to end-of-use disposal.
2018 sees 1,500% increase in coinmining malware - report
This issue will only continue to grow as IoT forms the foundation of connected devices and smart city grids.
CSPs ‘not capable enough’ to meet 5G demands of end-users
A new study from Gartner produced some startling findings, including the lack of readiness of communications service providers (CSPs).
Oracle announces a new set of cloud-native managed services
"Developers should have the flexibility to build and deploy their applications anywhere they choose without the threat of cloud vendor lock-in.”
How AT&T aims to help businesses recover faster from a disaster
"Companies need to be able to recover and continue operations ASAP, without pulling resources from other places to get back up and running."