Overview of Workflow Orchestration
The need to manage and automate workflows has been around for a long time; originally, “automation” consisted entirely of scheduling on regular intervals and cron was the obvious tool of choice. However, as enterprises began embracing the power of data, the diversity of jobs and dependencies between them became too much for a simple scheduler to manage. It is not surprising then that the emergence of “workflow orchestration” coincided with the rise of data warehouses, Hadoop, and other “big data” technology. Moreover as data science gained in popularity, workflows increased in complexity - job volume skyrocketed, workflows needed to be triggered by external events instead of pre-defined schedules, and tasks needed the ability to share state more efficiently. This has led to a new generation of workflow orchestration tooling for the modern data stack.
So what is workflow orchestration?
A workflow refers to any repeated software process; these processes may be defined in code or be entirely manual. Workflow orchestration then is the act of managing and coordinating the configuration and state of such automated processes, for example:
- Scheduling and triggering
- Dependency resolution between steps and between workflows
- Monitoring, alerting and retrying on failure
There are many different types of workflow orchestration scenarios; instead of attempting to enumerate various categories, it is more informative to consider the dimensions along which workflow frameworks vary:
- Run creation: some tools are strictly batch schedulers, others allow for combinations of both scheduled workflows and event triggered workflows.
- Performance: different workflow tools have different limitations when it comes to how fast workflows and the individual jobs composing a workflow can be run.
- State sharing: as mentioned in the introduction, many modern workflows require tasks and jobs to share various aspects of state. This allows the tool to provide features such as caching, data passage between tasks, and better support for distributed environments.
- Job definitions: this is perhaps the highest variance aspect of workflow orchestration tooling.
- GUI vs. code: some tools are entirely GUI based, meaning jobs must be specified using a finite library of options. Other tools offer pure code definitions in various languages. Some tools take a hybrid approach and allow certain aspects of jobs to be configured via GUI and others via code.
- Framework cruft: every tool requires jobs to fit within certain guardrails and as such vary greatly on how expressive they allow the user to be. Relatedly, frameworks differ on how “job-aware” they are. The more job-aware a tool is the more the line begins to blur between orchestration and actual execution.
- DAGs: all workflow execution ultimately takes the form of a Directed Acyclic Graph (DAG), but not all tools require users to write their workflow definition using a static DAG representation. For example, many use cases require the dynamic creation of jobs at runtime, and others require rich control flow, neither of which are easy to express as static DAGs.
- What is being orchestrated: every domain comes with its own unique set of requirements and consequently some frameworks have chosen to focus on specific classes of workflows and the jobs therein. There are specialized tools devoted to microservices, CI/CD jobs, ETL processes, machine learning model builds, and many more.
Consequently use cases are also highly varied depending on the team, for example, :
- Automating an ETL process to keep a data warehouse up-to-date: this is the classic workflow orchestration problem and consists of scheduling the movement of data from a production database to a data warehouse for future analysis.
- Automating the scheduling of dependent API calls: many large enterprises rely on an assortment of different specialized tools that need to be “glued” together to form an automated workflow that can be monitored and triggered from a centralized control plane.
- Instrumenting observability into a highly distributed data science pipeline: an increasingly popular use case involves orchestrating the lifecycle of a data science / machine learning model build pipeline. This typically involves triggering a model build / experiment via an API call, resulting in infrastructure creation (and eventual teardown), fine grained observability into pipeline status, and large amounts of data moment and state sharing.
- Automating report generation and dashboard updates: another classic workflow typically used by analysts for keeping their reporting dashboards up to date while hopefully detecting data issues early on. Interestingly, such workflows often have upstream dependencies on ETL workflows resulting in a powerful workflow-of-workflows pattern.
What are the traditional solutions to the orchestration problem?
Traditional solutions take many forms, including:
- Low Code / No Code: these solutions allow users to create, modify and monitor various jobs through a purpose-built GUI; many of the older enterprise orchestration tools such as ActiveBatch fall into this category, but there are modern examples as well such as AWS Step Functions.
- Open Source: these solutions allow users to self-host the full orchestration stack and tend to be more code-driven; being code-first allows for more customizable job definitions. Popular traditional examples in this category include Luigi, Airflow, and Azkaban.
- Homegrown: the orchestration problem has two properties that make it a prime candidate for custom in-house builds: first, it always arises after the fact, and teams rightfully try to avoid redefining their processes to fit an orchestration tool. Second, orchestration problems typically emerge iteratively and can initially seem “easy” - for example, the scheduling of a single script is an orchestration problem. Interestingly, many of the traditional open source orchestration tools began their life as a custom in-house solution, and many custom in-house solutions are DSLs on top of open source tools.
Why do these solutions fall short in the modern data stack?
Most of the solutions above were built with a very narrow focus; for example, as mentioned in the introduction many of the open source orchestration tools were built specifically for coordinating Hadoop jobs. Because of this, they tend to fall short along a few dimensions:
- Performance at scale
- The handling of heterogeneous environments in one workflow
- Heavily biased towards batch jobs
- The ability to serve multiple personas
- State and data sharing capabilities are limited
- Integration with testing and CI/CD frameworks
In addition, developer experience is a big component of what defines the modern data stack, and many traditional tools require wrangling your already-defined processes into their niche paradigms.
What are the main considerations when choosing an orchestration tool?
In addition to considering the various aspects discussed above, when considering a new orchestration tool it is important to ask yourself:
- How seamlessly will it fit in your current process and tech stack? Workflow orchestration problems always emerge as a byproduct of value-additive work, and as such should be minimally invasive to your current setup.
- Who are the personas that need to be able to write and configure your workflows? This will determine where on the no-code to code-first spectrum you fall, as well as what languages are supported.
- What are your major security and access considerations? For example, most enterprises will need some form of access and role enforcement. If using a workflow-of-workflows pattern, can different authors combine their workflows without risking security?
- What is the product vision for the tool? Having a sense of the direction the product will take can be just as important as its current feature set, as you need to ensure it can easily adapt to your changing business and tech stack needs.
- How much will your team need to maintain? It’s important to make sure that the maintenance cost of the tool is accounted for - orchestration ends up being a critical piece of any tech stack, and ideally your team stays focused on delivering business value.