Push Down Filter for CSV files in Spark 3.0

Source: DataBricks

Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.

Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.

Here is a quick example: I load a CSV file (flights dataset from Kaggle), then filter by ORIGIN_AIRPORT, then print out the execution plan.

val df = spark
.read.option("header", true)
.filter($"ORIGIN_IRPORT" === "UA")

With Spark 3.0, you would see as below screenshot. The main highlight is the new DataFilters in the plan describing about the filters.

I also try a quick benchmark and do see performance improvement.

Spark 2.4: Time to count a 7gb CSV file with filter: 37240.61ms
Spark 3.0: Time to count a 7gb CSV file with filter: 31579.83ms

The difference is ~5,661ms, almost 15% improvement.

If I remove the filter, both yield almost the same time of 13209.25ms.

And the nicest thing is you don’t need to change anything, this optimization will just work when you switch to Spark 3.0!




I write, so I learn.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Build an AI Model to Play Super Mario

READ/DOWNLOAD$+ Requirements Engineering Fundamentals: A Study Guide for the Certified Professional…

Selenium .NET tests speedup

Synsets for a word in WordNet NLP

Launchpool AMA Recap — IAGON

Getting started in Natural Language Processing with spaCy: Part 5

LoadError: cannot load such file — 2.6/ffi_c

How to install, setup and get started with RStudio IDE on your Windows 10 device…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bao Nguyen

Bao Nguyen

I write, so I learn.

More from Medium

Setting up Pyspark in mac

Apache Spark with Scala: read files from S3 using python and Boto3.

Setting Up Spark Cluster and Submitting Your First Spark Job

How to create a Spark data frame from a pre-signed S3 URL? : Data Engineering Series