May 25, 2022

How is the ideal MDS toolset look like for a startup?

3 Replies

I think a startup should focus on validating their product-market fit quickly as well as economically. In other words, fail fast and fail cheap. For the first few rounds of iterations, the data platform should not be set in stone. Instead, startup should resort to a tech stack that is more friendlier and cost-effective for them in the long term. Modern cloud-native managed SaaS products reduces the burden of infrastructure and accelerates the time to market and time to insights.Always start with a data warehouse. Cloud-native warehouses BigQuery, Snowflake, Redshift, and Databricks are prominent vendors in the space. Then again, use any hosted ETL/ELT solutions including but not limited Fivetran or Airbyte to bring operational data into the warehouse. An ELT tool like dbt provides you with the transformation capability on the data in the warehouse. The last mile of the data platform is about reasoning about the refined data. That includes a whole heap of exploratory data analytics and BI tools. There are so many tools to choose from. But I will mention few famous tools including but not limited to Python (Pandas), Jupyter Notebooks, Apache Superset, Power BI, Looker, Mode, etc.The key goal here is to pick a tech stack you are comfortable with and conduct fast experiments on how it helps you to build your data analytics workloads. If their failures, find the root cause, and re-iterate with potential replacements. Rinse and repeat until you deliver the data SLAs.

2 years ago

I've now been part of the Data Engineering team for 2 pre-IPO companies / startups. In both companies, the discussion of building out a true Data Platform generally comes when the product is fairly mature and the need to use data / machine learning for product features becomes the next gate to open to drive growth and adoption. At the first company, we had early Software Engineers stand up a BigQuery instance initially which was not managed by a single team. The main use case for BigQuery at the time was for monitoring and observability of cloud infrastructure of the product and was mainly used by engineers. As the company grew, the Growth Marketing (who used SQL fairly frequently), used Stitch to add their Marketing platform data to BigQuery for additional customer funnel analysis (again this was not managed by a specific team as there was no central data team yet). Finally, product leadership identified the need to surface time series data back into product for customers as well as a growing need to mature the companies' machine learning models at scale. This drove the decision to form a centralized data engineering team to support to growing data needs of the organization. The company then invested in a number fo MDS tools such as Databricks, Amundsen, Tableau and Fivetran to name a few.At the second (current) company, when I joined an early engineering team that used aggregations of multiple data sets for billing and enrollment purposes, set up a standalone PostgreSQL instance that utilized PostgreSQL's native logical replication feature to ingest data from the various product databases (since they were all PostgreSQL instances). They also utilized a standard setup of Airflow to ingest external data sources such as Salesforce and Mixpanel. This was just about enough to support an early data analytics function. However, as the Analytics team grew (as well as their requests for insights), queries started performing significantly slower to a point many jobs failed and teams did not have their insights in a timely manner. This trickled up to product leadership who then identified the need to form a data engineering team to support the current data infra, as well as build out a scalable data platform to support to growing data needs of the business. It also helped that the company acquired a computer vision company to be embedded into the product :). As the data engineering team formed, we have already acquired Databricks, Fivetran and Tableau to build out our MDS and are actively performing proof of concepts to add to the company's MDS. Based off these experiences, it definitely feels as though the an early stage startup (Seed - Series C) can generally do with some sort of analytics platform (e.g. RDBMS, unmanaged OLAP) for simple analytics and insights. As the company scales (Series D - Series F), the company will likely have a growing data size and mature data needs (e.g. ML) where a distributed data platform (e.g. Snowflake and Databricks) is necessary, a scalable / managed ETL tool (e.g. Fivetran, Airflow), and a low-code visualization platform (e.g. Tableau) are needed. As the company grows even more (Post Series F and beyond), that's where the additional components of MDS (Transformations, Data Catalogs, Data Observability, etc.) become increasingly important. - Edited

2 years ago

I think the "right" answer to this question is rapidly evolving as the data ecosystem matures. (That is not to mention that the answer also depends on the startup and their goals.) That said, in 2022 some of the best tools are available with pay-as-you-go pricing models. Gone are the days where you need to "talk to sales" and pay a $20k platform fee to get started with any good stack. My default advice to startups these days is to use Snowflake and Airbyte. Then adopt dbt Cloud when it's clear they need more operational rigor around analytics development, and pick one of many capable BI tools when they grow out of Snowsight (Snowflake's built-in UI). I wrote up a slightly more detailed explanation on my blog: - Edited

2 years ago
Please login to reply