Lakehouse architecture

#Lakehouse architecture driver#

Also, decoupling storage and compute-another common thread among theoretical data lakehouses-permits far greater flexibility and scalability.Įnterprises' business needs are becoming more complex and more data-dependent than ever before, particularly with the continued emergence of advanced ML and disciplines like decision intelligence (DI). Because of this, the data lakehouse architecture could potentially mitigate administrative and governance challenges that a data lake would face on its own. Advantages of a data lakehouseĪmong proponents of the data lakehouse concept, one of its major advantages is that it serves as the repository for all data, including anything that must be warehoused. Cataloging and security tools run parallel to the data lake to ensure proper data governance and protection.

It can meet any structured processing needs that data teams may have, accounting for more demanding SLAs and enabling self-service for end users. The data warehouse portion of the architecture that Menon described is outside of the data lake. Then it can go through various analytics environments and processes-such as sandboxes and artificial intelligence (AI) or machine learning (ML) frameworks-and eventually emerge from the data lake for downstream consumption. Structured, semi-structured, and unstructured data enters the data lake through ingestion services, but then goes back and forth between various zones within the architecture to a data processing service as it is cleansed, joined, and properly formatted. Microsoft data and intelligence strategist Pradeep Menon proposed a similar but much more multifaceted model in 2021. Finally, the data is transferred to data access and preparation tools that provide security and governance, before finally ending up with the applications that need it. Then, it passes through raw and refined data zones, as well as analytics sandboxes, before being integrated and processed in the data warehouse. Data ingestion processes ranging from extract, transform, and load (ETL) to stream processing funnel data from many sources into the data lake.

In an early concept of how the architecture might work from October 2017, the data warehouse is within the data lake. Why and how might you implement a data lakehouse architecture?īecause the data lakehouse is still, in many ways, more of a theoretical concept than an in-practice system, experts vary in their assessments of how data lakehouses would be put together. The pros of the data warehouse architecture offset any limitations of the data lake, and vice versa. In theory, the data lakehouse gives enterprises all of the advantages of the data warehouse and data lake and very few of the drawbacks. Data warehouses allow your business users to see the data the way they need to, while data lakes are excellent for the staging and processing layers. The two concepts can and should be used simultaneously, as they both serve valuable business functions.

#Lakehouse architecture driver#

That said, in cases where IT teams are the main driver behind data warehouses' implementation, there can be issues with agility. Traditional data warehouses are often noted as being ideal for complex queries and offer considerable security and governance.

Data warehouse: This data architecture stores structured data using hierarchical tables and dimensions.

However, governance and security are often lacking, along with The facilitation of low-cost, long-term storage for eventual use in analytics applications is arguably the key benefit of the data lake, along with flexibility.

Data lake: A collection of raw data that can be structured, semi-structured, or unstructured, with a flat architecture.

data lakehouseīefore we go any further, it's critical to quickly illustrate the key differences between the two terms from which data lakehouse is derived: The data lakehouse also supports strong schema enforcement and governance, allows for concurrent data reading and writing, uses end-to-end streaming, and is compatible with multiple data types-structured, semi-structured, and unstructured.

A data lakehouse combines the structured data management and processing ability of a data warehouse alongside the inexpensive storage capacity of a data lake.įor example, storage and compute resources are separate in the structure of the data lakehouse, allowing for greater scalability, and it typically uses standardized storage formats.