Predicate Pushdown for Apache Spark with Google BigQuery

Does predicate pushdown for Databricks on Google Cloud with BigQuery work? It does! And here is how to verify it.

Frank Munz
Geek Culture

When I tested the features of the recently released Databricks on the Google Cloud platform, I checked out the BigQuery integration. Databricks is using a fork of the open-source Google Spark Connector for BigQuery. So I wondered how to check if a certain predicate of a query is indeed pushed down to BigQuery (or not). It turns out it is easy!

Let’s take the natality public data set in Google BigQuery. The code from the notebook cell below uses the Spark storage API to load a table from BigQuery to a dataframe, pushing down the filter() predicate to BigQuery.

You can get the execution plan of the query including its optimizations with explain() method. For a more verbose output use explain(“extended”)

df.explain()

The mini example above is available as part of a notebook I used for further Databricks on GCP BigQuery tests. The full output of explain() contains the execution plan and lists all the optimizations applied. Look for the part describing the predicates pushed to BigQuery containing the keyword PushedFilters:

PushedFilters: [*IsNotNull(state), *IsNotNull(weight_pounds), *EqualTo(state,CA), *GreaterThan(weight_pounds,11.0)]

The output above shows that the predicates pushed down to BigQuery are exactly the conditions of the Spark query.

Databricks Spark on GCP optimizes for

  • nested filter pushdown and nested column pruning
  • array pushdown
  • expression pushdown

A notebook containing all the Spark query optimization examples above can be found in the Databricks documentation.

Where to go from here?

[1] Databricks Integration with BigQuery blog post
[2] Databricks on Google Cloud free trial
[3] BigQuery Sample Notebook

Please clap for this article if you enjoyed reading it as much as I enjoyed writing it. I spend way too much time on Twitter — feel free to connect: @frankmunz.

Responses (2)

Write a response

I see in the explain plan that you have in one of the git link , the number rows is equivalent to the total number of rows in the table. Is this what it would show instead of showing the actual number of rows that it has read?
bigquery-public-data.sam…

How do you know this is actually working tho, the pushdown filter may not actually be acting on bq. Have you done some testing, reading the same data with this method and not to compare the input size?