Push Down Filter for CSV files in Spark 3.0

Source: DataBricks

Spark 3.0 is out, and there are ton of improvements! But there are a nice improvement that is not yet highlighted in the announcement post: Push down filter for CSV file.

Prior to Spark 3.0, when you load a CSV file, the CSV file is read to memory then apply filter, which is a waste of CPU cycle and bandwidth. Now, the data can be filtered as the files are read. This is similar to push down filter in Parquet but now for CSV files.

Here is a quick example: I load a CSV file (flights dataset from Kaggle), then filter by ORIGIN_AIRPORT, then print out the execution plan.

With Spark 3.0, you would see as below screenshot. The main highlight is the new DataFilters in the plan describing about the filters.

I also try a quick benchmark and do see performance improvement.

Spark 2.4: Time to count a 7gb CSV file with filter: 37240.61ms
Spark 3.0: Time to count a 7gb CSV file with filter: 31579.83ms

The difference is ~5,661ms, almost 15% improvement.

If I remove the filter, both yield almost the same time of 13209.25ms.

And the nicest thing is you don’t need to change anything, this optimization will just work when you switch to Spark 3.0!



I write, so I learn.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store