seek
Apr 18, 2022

Can someone explain to a 5yo what's the difference between data lake vs data warehouse vs data lakehouse?

There seem to be a lot of "sponsored" articles on each but would be great to get a practical understanding of the differences and how would one decide which one to use?

5 Replies

Don't forget to add Data Mesh & Data Fabric to the buzz word bingo.

I can discuss the differences with my peers but for a 5 year old, now thats some challenge.

Data Lake = we take all your lego blocks and throw them in a big pile. When you want to build something feel free to sort through the pile and find the lego blocks you need. If we either see lego blocks in the house that we do't have, don't worry we will make sure we throw them in the pile.

Data Warehouse = we take your lego blocks and spend time sorting them, categorising them and making them easy to find. But if you want some lego blocks that we haven't already sorted, you will need to wait until we have done that before you can use them.

Data Lakehouse = We will do all the sorting we normally do for you, but we will also let you have access to the big pile of unsorted lego blocks whenever you want as well.

AI Data Lakehouse = We will invent a machine that will automagically sorts the lego blocks for you without a human being involved. And the AI Machine wont move a copy of the lego block from the unsorted pile to the sorted pile, but it will create a new thing called a virtual lego block.
- Edited

20 days ago

Hey Ayushi 👋

A bit of a shameless plug from my side, but I think it should help with the understanding. You can take a look at this article that I wrote a few months back: https://www.mighty.digital/blog/differences-between-data-lakes-and-data-warehouses. It should give a better picture on DWH vs DL situation. And regarding the Data Lakehouse, as Shane mentioned, it is something in-between those two.

Tl;dr: One is for relatively clean and structured data (DWH), another one is for everything (DL) and third one is something in between (DLH).

Hope that helps! 🎉

12 days ago

Data Warehouse: Central repository of data on which you can create business
-When you should use it: Use it when you have structure data only and you want to create Dashboards, reporting, etc. on top of that.
-Don't Use: Don't use if you have unstructured data (There are many other reasons but to keep it short and crisp)

Data Lake: Term introduced after Hadoop became open source in 2008.
-Repository for every kind of data - structure, semi-structured, unstructured.
-Bring all data under one hood (i.e. data lake) and then process it forward depending upon different use cases. All the cleaning and organization happens here only with the help of different tools and architecture.
- Problems:
-- Data updates and deletes are not easy tasks.
--Amateur Data Governance and Data Security.

data lakehouse - DataLake + Data Warehouse = Data Lakehouse
-It has solved the shortcomings of both platforms.
-It handles Unstructured and semi-structured data as well.
- It provides ACID transactions as well( deletes and updates)
-Standard storage formats are there which are fast and effective

Now data lakehouse also has some shortcomings which are solved by Data Mesh and Data Fabric.
For that, you can check out my post on LinkedIn if you are interested.
- Edited

3 days ago
Please login to reply