Push Down Filter for CSV files in Spark 3.0

Source: DataBricks

Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.

Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.

Here is a quick example: I load a CSV file (flights dataset from Kaggle), then filter by ORIGIN_AIRPORT, then print out the execution plan.

val df = spark
.read.option("header", true)
.filter($"ORIGIN_IRPORT" === "UA")

With Spark 3.0, you would see as below screenshot. The main highlight is the new DataFilters in the plan describing about the filters.

I also try a quick benchmark and do see performance improvement.

Spark 2.4: Time to count a 7gb CSV file with filter: 37240.61ms
Spark 3.0: Time to count a 7gb CSV file with filter: 31579.83ms

The difference is ~5,661ms, almost 15% improvement.

If I remove the filter, both yield almost the same time of 13209.25ms.

And the nicest thing is you don’t need to change anything, this optimization will just work when you switch to Spark 3.0!




I write, so I learn.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

EV Charging Station Management Best Practices

Mobile App That Speaks with DeepMind WaveNet Google Cloud Text To Speech ML API and Flutter

The Important Lesson I Learned Fixing My Fridge’s Door

Docker Networking — Containers Communication

How To Build A Simple Render Engine From Scratch

Launching an EC2 instance in AWS & attaching EBS volume to it

Creating matrices with List comprehension — Identity as normal

Rediscovering Engineering

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bao Nguyen

Bao Nguyen

I write, so I learn.

More from Medium

Configure Spark Application

Pyspark Series: How to Delete and Install Pyspark on Ubuntu

colab with pyspark

Activate Spark Authentication (more like simple authentication)