DataFusion
From charlesreid1
Contents
- 1 About DataFusion
- 1.1 Types of Systems and Examples
- 1.1.1 Specialized Analytical Databases
- 1.1.2 Distributed SQL Query Engines & Big Data Systems
- 1.1.3 Query Language Engines & Accelerators
- 1.1.4 SQL Support for Existing Libraries & Frameworks
- 1.1.5 Streaming Data Platforms
- 1.1.6 Data Integration & ETL Tools
- 1.1.7 Data Exploration & Utility Tools
- 1.1.8 Observability Platforms
- 1.1.9 Semantic Layer Platforms
- 1.1.10 Machine Learning & AI Infrastructure
- 1.1.11 Replacements & Enhancements for Existing Systems
- 1.1.12 Research Platforms
- 1.1 Types of Systems and Examples
About DataFusion
Apache DataFusion serves as a powerful and flexible query engine that developers use as a foundation to build a wide variety of data-centric systems. Instead of building a query processing and optimization layer from scratch, projects leverage DataFusion's capabilities.
Below are some examples of what can be and has been built using Apache DataFusion:
The common thread across these examples is that DataFusion provides the core query processing capabilities (SQL parsing, logical and physical planning, optimization, and execution against various data formats like Parquet, CSV, JSON, Avro), allowing developers to focus on the unique features and domain-specific logic of their applications. Its Rust foundation offers high performance and memory safety, while Apache Arrow integration ensures efficient in-memory data handling.
Types of Systems and Examples
Specialized Analytical Databases
DataFusion's extensibility makes it suitable for creating database systems tailored for specific analytical needs, particularly in the realm of time-series data.
- InfluxDB 3.0: A widely-used time-series database that leverages DataFusion for its query engine.
- GreptimeDB, HoraeDB, CnosDB: Open-source time-series databases built using DataFusion.
- CeresDB: An analytical database.
- Seafowl: A CDN-friendly analytical database.
- ParadeDB: PostgreSQL for search and analytics.
Distributed SQL Query Engines & Big Data Systems
It can be used to create systems that distribute query processing across multiple nodes, similar to Apache Spark.
- Ballista: A distributed SQL query engine built on Apache Arrow and DataFusion, designed to compete with systems like Spark.
Query Language Engines & Accelerators
DataFusion can power new query languages or accelerate existing ones.
- Comet (by Apple, now Apache DataFusion Comet): An accelerator for Apache Spark that replaces Spark's query execution with DataFusion for improved performance.
- VegaFusion: Provides server-side acceleration for the Vega visualization grammar.
- PRQL-query: An engine for the PRQL (Pipelined Relational Query Language).
SQL Support for Existing Libraries & Frameworks
It can add SQL querying capabilities to existing data tools and libraries.
- Dask SQL: Integrates SQL query capabilities into the Dask parallel computing library in Python.
Streaming Data Platforms
DataFusion's architecture is also suitable for building systems that process continuous streams of data.
- Synnada: A streaming-first framework for data products.
- Arroyo: A distributed stream processing engine written in Rust.
- Kamu: A planet-scale streaming data pipeline.
Data Integration & ETL Tools
Its ability to read various formats and execute SQL makes it a good fit for Extract, Transform, Load (ETL) pipelines.
- While not a specific named product, DataFusion's core capabilities are well-suited for building custom ETL solutions.
Data Exploration & Utility Tools
Simple tools for quick data inspection and manipulation.
qv
: A command-line tool for quickly viewing and transcoding data in formats like Parquet, CSV, Avro, and JSON.
Observability Platforms
Systems for collecting, storing, and querying telemetry data like logs and metrics.
- OpenObserve, Parseable, ZincObserve: Cloud-native observability platforms.
Semantic Layer Platforms
Tools that provide a unified business view of data.
- Cube Store: Cube's universal semantic layer platform uses DataFusion.
Machine Learning & AI Infrastructure
Platforms that support ML workflows, often involving large-scale data processing and querying.
- LanceDB: A vector database for AI/ML that uses DataFusion to support SQL queries over multimodal data.
- Spice.ai: Develops building blocks for data-driven AI applications, using DataFusion for SQL interfaces.
Replacements & Enhancements for Existing Systems
DataFusion can be used to enhance or replace components of existing data systems for better performance or new features.
- Blaze (blaze-rs): A project aimed at providing a faster Spark runtime replacement using DataFusion.
Research Platforms
Its modularity makes it a good base for experimenting with new database technologies.
- Flock: A research platform for new database systems.