3 min read

Big Data ecosystem turning to Rust: an overview

Big Data ecosystem turning to Rust: an overview

Java is synonymous with last generation of Big Data tools and technologies. But a lot has changed since 2000s. Latest advances in CPU, Memory, Networking and Programming Languages, combined with huge uptick in raw data, has lead to a new wave of big data tools in modern languages like Go and Rust.

In this post we'll take a look at why Rust holds a promise for the modern Big Data era and also go over few of the new products written in Rust that are already out there.

Why Rust

For systems programmers, Rust seems to be the ideal language - it has the Performance and Control like C/C++, with the memory and currency Safety of Haskell. With functional paradigm support, Rust is even more attractive to build a reliable, performant system.

Here are some specific points on why Rust is the language of choice for building a new Data platform in 2022.

  • Memory model: Traditional Data platforms almost always needed JVMs to run. The overhead of configuring and optimising JVM memory parameters in Containers is one of the biggest operational and cost challenge. Even without JVMs, Garbage collecting languages and hence systems built with these languages, face challenges of random slowdowns as GC kicks in. With Rust's Memory Ownership and no GC model, things become degrees of magnitude simpler. The memory availability and usage is transparent in systems written in Rust which forms a strong base for scalable, easy to manage and run platform.
  • Cloud-native approach: Rust has its quirks around static vs dynamic binding - but if the project has pure Rust dependencies, Rust binaries are statically linked and even if there are external C/C++ dependencies (native) in Rust, there is a simple approach to build static binaries. You may be thinking, what is cloud-native about static binaries - well, static binaries mean no external dependencies. This entails simpler, easy to build containers. Static binaries are key to the success of Go and Rust has done great to follow similar approach.
  • Low level access: While Rust is a high level language, it offers great deal of control over low level contructs like memory allocation (heap vs stack),  multi-threading, concurrency. This helps build systems that have better control over the host machine and scale better.
  • Ecosystem: The Rust ecosystem is thriving with many new projects now written in Rust. The Awesome Rust repo is a great resource to check out the length and breadth of Open Source Rust projects. With new projects and the community of developers and users around these projects, Rust is no longer on the fringes, rather it is a well established, mainstream programming language.

Big Data Tools in Rust

While the future looks bright for Rust, there are several modern Data tools already built in Rust - here's a look at some of the most popular ones:

  • Arrow: Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. The Rust implementation is one of most complete Arrow libraries out there. Arrow already serves as the underlying technology for Influx IOx, Ballista, Datafusion etc. With several improvements like zero copy data transfer, Arrow is the leading in-memory data format.
  • Datafusion: DataFusion is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used to create modern, fast and efficient data pipelines, ETL processes, and database systems, which need the performance of Rust and Apache Arrow and want to provide their users the convenience of an SQL interface or a DataFrame API.
  • Polars: Polars is a blazingly fast DataFrames library implemented in Rust using Apache Arrow Format as memory model.
  • Meilisearch: Meilisearch is a fast and hyper relevant search-engine. It offers a RESTful search API.

Summary

Rust has strongly emerged as the leading ecosystem to build fast, reliable tools and technologies.  In this post, we saw why Rust is considered the new best approach to build modern Data platforms.

The new wave of Data platforms, pipelines and query engines are going to be built in Rust.