Data Lakehouses
In the previous article, we discussed why the data warehouse architecture came to prominence. We also saw how it was unsuited for unstructured data and the volumes of data inherent in Big Data. We briefly mentioned that we could combat some of these issues by first writing the raw data to a file system known as a data lake instead of writing the data directly to the relational database management system that is the data warehouse. In this manner, we obtained the best of both worlds. The data lake supported the data science access patterns, and the data warehouse supported analytical and business intelligence reporting access patterns. Put another way, the data scientists could access the unstructured data (i.e. images) directly from the data lake, and the business analysts could build their dashboards using the SQL they know and love.
Over time this dual system architecture became untenable as it required continuous effort to extract, load and transform the data from the lake to the warehouse. Every new ETL step risked incurring failures or introducing bugs that reduced data quality. Thus, keeping the data lake and the warehouse consistent became difficult and costly.
Eventually people began to ask themselves if there was a way they could avoid copying data from the file system to a separate DBMS. This is where the concept of the data lakehouse comes into play. A data lakehouse is the marriage between the data lake and the data warehouse architectures. In essence, a data lakehouse is a data warehouse that uses a data lake as the underlying file system. In other words, unlike a data warehouse, a data lakehouse does not store the structured data in a traditional DBMS, rather, it uses a technologies like Apache Iceberg & Delta Lake to create tables on top of the raw data, which can then be queried using compute engines such as Spark or Trino.
You may be asking yourself why we need Apache Iceberg or Delta Lake. Spark can, after all, read from and write to from raw Parquet files using SQL. The problem with using straight Spark is that the transactions are not ACID compliant. Suppose that you ran a Spark job that writes to a given table, and it fails midway through. If you examined the file system, you’d find out that the records written in the first half of the job persisted. If you were fortunate enough to have a timestamp associated with the records, you may be able to get back to the previous state by manually deleting them, otherwise, it would be pretty much impossible. If, after resolving the issue, you attempted to run the same job again, without first rolling back, you’d end up with duplicate records. Delta Lake solves the preceding phenomenon by borrowing from the concept of the write ahead log in DBMS while Apache Iceberg solves it by making use of metadata files.
Conclusion
Data lakes are better suited for machine learning systems that need to process large datasets using non-SQL code whereas data warehouses excel at generating reports and building dashboards. To get the benefits of both, we copied a small subset of the data from the data lake to a downstream data warehouse. In practice, however, this pipeline was difficult to manage. Fortunately, the industry leaders came up with a way to have our cake and eat it too. That is to say, we could gain all the benefits of the data warehouse without the data ever having to leave the data lake. They coined this new architecture, the data lakehouse. To combat issues inherent in data lakehouses, additional logic was needed. This took the form of Apache Iceberg, a open table format released by Netflix, and Delta Lake, an open-source storage layer developed by Databricks, both of which provide critical features such as atomic transactions, time travel, concurrent writes, and SQL-compatible table evolution.