Folks - This site is a great repository for understanding various categories in modern data stack for structured data. have we come across modern data stack for unstructured data (image, satellite imagery such as vector/raster, video, audio, speech). In a sense, best ML pipeline for each type of datasets and comparison. Also do you think the current set of tools available in the modern data stack can handle unstructured data? and how MDS can integrate/work together with MLOps stack?
Glad to join the community. This is Jove, from timeplus.com We are building a cool service to help you to quickly build real-time applications easily, mainly with SQL. It's publicly available on timeplus.cloud, you can sign up a free account and connect to your streaming data and make sense of it with SQL and real-time charts.Just to give a few examples(usecases):* streaming ETL: filter or aggregate your data in Kafka/Confluent/Pulsar topics, route/transform the data and send to other message bus, or send to databases such as snowflake* build real-time charts: for example what are most active github repos in the past 5-10mins, or hot tweets on twitter* build realtime feature store for machine learning* build real-time alerts, for example when the user signs up your site, takes a certain actionBesides the SaaS/PaaS offering, we also provide on-prem/BYOC deployment. Comparing to Flink/Spark, much lower infra cost (no JVM) and high performance.Look forwards to your feedback if you can try this.
I did a deepdive into how the MDS is taking shape in blcokchain data. The article contains an overview of the space and a list of tools for indexing, querying, storing and transforming blockchain data.You can see it here: https://mondaymunday.substack.com/p/the-decentralised-data-stack
Reading thru submissions to the Community section of MDS they are difficult to comprehend.For readability line lengths must be less than 75 chars long - You can fix thisUser must be encouraged to use line breaks, more than 2-3 lines in a paragraphs and peeps drift - You can advise keeping paras short and succinct.
Looking for software that can facilitate extraction from and loading to a diverse range of endpoints like on-prem REST APIs, CDWs, cloud services like Salesforce, files, etc.It does not have to transform data, it merely has to move data from point A to B.Scheduling, API endpoints, scalability, high availability are all big desires.Any idea what is in the MDS space for this kind of application?Thanks.
Hey folks! Background: we're building VDP https://github.com/instill-ai/vdp, an open-source visual data ETL tool to streamline the end-to-end visual data processing pipeline. Recently, I built a prototype to analyse livestock in a drone video of a cattle farm. First, I built an object detection ETL pipeline with our tool VDP to analyse the video, and stored the analysis results in our PostgreSQL database. Then, based on the data in the database I created a "Cow Counter" Dashboard using Metabase that tracks every time a cow 🐄 appears in the video footage. Check out the step-by-step tutorial https://www.youtube.com/watch?v=0Rdv8oqqxfw
Hi, I'm a newbie here and i've got a question for you. If you have a tool which does not have an existing connector but which exposes apis, what methodology do you use to extract quickly the data and make them available in your data stack ? I mean do you have a standard way to do that or it depends on the source ?
Hey folks - I'm trying to get up to speed ASAP to understand the modern data stack. I come from a non-technical background, so much of what I've been reading is a bit hard to understand...My goal is to try to understand the modern data stack (esp at tech companies) from a first principles approach. -> What are the best resources (besides this website 😃) for someone like myself to gain a comprehensive understanding of the modern data stack?Here are some of the more important questions I would like to be able to answer (eventually).- What does the progression of stack/tools change over time as a company goes from a startup to a larger enterprise? In particular, what does this look like at the critical point when you have disparate tools (e.g. CRM, Google Analytics, Product Telemetry, etc.) and you want to consolidate them into one place?- How teams choose how to architect their stack (pros and cons and tradeoffs of different approaches)- How are the problems that data teams face at smaller companies different than those at bigger companies? - How do teams address the problems around messy, undefined, redundant data effectively? I know there are tools out there that can help, but I have a feeling that it's not necessarily a software problem.
Hi everyone! We're building VDP (https://github.com/instill-ai/vdp), an open-source ETL tool for unstructured visual data.When people say they are data-driven, most of the time it means they are driven by structured data. I will cut the part where we cite reports claiming that 80% of data are unstructured. The reality is unstructured data are more difficult to analyse and not a lot of companies know or have the resources to deal with them. That's why we decided to build VDP, an open-source, general and modularised ETL infrastructure for unstructured visual data for a broader community.VDP is built from a data-driven perspective. Although the computer vision model is the most critical component in a visual data ETL pipeline, the ultimate goal of VDP is to streamline the end-to-end visual data flow, with the transform component being able to flexibly import computer vision models from different sources.Today, the early version of VDP supports 2 sources and all Airbyte destination connectors, and it can import computer vision models from various sources including Local, GitHub, DVC, ArtiVC and Hugging Face.VDP can run locally with Docker Compose. We're working on integrating with Kubernetes and a fully managed version in Instill Cloud. Setting up a VDP pipeline is fairly easy via its low-code API and no-code Console. Please take a look at the tutorial: https://www.instill.tech/docs/tutorials/build-an-async-det-pipelineWe aim to build VDP as the single point of visual data integration, so users can sync visual data from anywhere into centralised warehouses or applications and focus on gaining insights across all data sources, just like how the modern data stack handles structured data.Thanks for reading. We are first-time open-source project maintainers. There are definitely lots to learn! Let us know what you think.
Data cleansing, modeling, and transformation are very important to data analysis or data science. But it needs SQL, python, and other program skills and sometimes labor work. We built QuickTable to help users can process data as they view data with no code.
Hi everyone. I'm curious to learn what tools people are using to run technical assessments for data analysts and analytics engineers. (Either the live sort or take-home sort.)I've been hiring data analysts for half a decade, and haven't yet found a truly smooth way to do this. Any thoughts?
There is a lot more you can do with your data lake once you can analyze realtime data. We wrote a guide on how to ingest realtime data and avoid the common pitfalls with old approaches by leveraging Iceberg, Kinesis hosted by Apache Flink to read Kafka streams. To help you get a demo up and running, we use hosted Trino via Starburst Galaxy as example:https://www.starburst.io/blog/near-real-time-ingestion-for-trino/
Hi,We are working on the modernization of data architecture. So we must go through data integration, ELT, setting up the platform (GCP) and producing the output. Question,Even though the requirements are easy to define on paper, they are usually very difficult to execute. How can I communicate this part effectively? I always find it difficult to explain to the management even though I have illustrated the effort part in Project Timeline, but still, the effort part is not that visible to them.What can I do to improve my communication?
You've got plenty of internal data in your MDS to handle day to day activities but then your users want data about a new domain (weather, retail traffic, stock market data) to help them solve problems - how do you find and source that data?
With my team at @Deepnote we're *big* ClickHouse fans, so we’ve built a ClickHouse integration into our notebooks. This lets you run SQL queries against your CH instance in a notebook environment and 100x your performance against traditional databases. The one thing that has been most helpful to me is the interoperability of SQL with Python, which means I get to save the results of my CH queries as Python variables and switch back and forth in one space + do bunch of other things like create quick visualizations and build out dashboards on the top of my notebooks for ad hoc reporting. If you’re a ClickHouse and/or a notebooks user, I’d love to hear from you and see how we can make this even more useful: https://deepnote.com/blog/clickhouse-cl4zs29ikocet0blrv08ishlj
Of late, there has been many posts/articles on dashboards being dead - deathofdashboards.com and https://go.thoughtspot.com/e-book-dashboards-are-dead.html But many of us in the data industry believe this is far away from the truth. So how would the new age dashboards look like?
Hey all 👋 So typically, you would build out your event tracking stack as follows:Web/Mobile App -> CDP -> Warehouse -> Enrichment/Transformation (dbt) -> Reverse ETL -> [Any destination]But recently, I've seen a lot of trends on the CDP not being a relevant thing anymore:* https://hightouch.io/blog/cdps-are-dead/* https://towardsdatascience.com/cdps-are-not-the-future-f8a3f56114b6And here lies the question, how do collect data? The first article mentions Snowplow for that. Are there any other options?Let me know your thoughts!Stay strong! 🇺🇦Vlad
SkyPoint’s mission is to bring people and data together. We are the industry's first Modern Data Stack Platform with built-in data lakehouse, customer 360, data privacy vault, privacy compliance automation, data governance, analytics and managed services for organizations in several industries including healthcare, life sciences, senior living, retail, hospitality, business services and financial services. Industry leaders and over 10 million end users currently use SkyPoint.
We're seeing a lot of buzz around MDS in the industry but how would one decide if it's the right thing for them? When would you switch from plain old scripts to move data around to something more sophisticated set of tools like the ones in MDS?
There seem to be a lot of "sponsored" articles on each but would be great to get a practical understanding of the differences and how would one decide which one to use?