Predicate Pushdown for Apache Spark with Google BigQuery

Does predicate pushdown for Databricks on Google Cloud with BigQuery work? It does! And here is how to verify it.

Published in

Geek Culture

2 min readApr 14, 2021

When I tested the features of the recently released Databricks on the Google Cloud platform, I checked out the BigQuery integration. Databricks is using a fork of the open-source Google Spark Connector for BigQuery. So I wondered how to check if a certain predicate of a query is indeed pushed down to BigQuery (or not). It turns out it is easy!

Let’s take the natality public data set in Google BigQuery. The code from the notebook cell below uses the Spark storage API to load a table from BigQuery to a dataframe, pushing down the filter() predicate to BigQuery.

You can get the execution plan of the query including its optimizations with explain() method. For a more verbose output use explain(“extended”)

df.explain()

The mini example above is available as part of a notebook I used for further Databricks on GCP BigQuery tests. The full output of explain() contains the execution plan and lists all the optimizations applied. Look for the part describing the predicates pushed to BigQuery containing the keyword PushedFilters:

PushedFilters: [*IsNotNull(state), *IsNotNull(weight_pounds), *EqualTo(state,CA), *GreaterThan(weight_pounds,11.0)]

The output above shows that the predicates pushed down to BigQuery are exactly the conditions of the Spark query.

Databricks Spark on GCP optimizes for

nested filter pushdown and nested column pruning
array pushdown
expression pushdown

A notebook containing all the Spark query optimization examples above can be found in the Databricks documentation.

Where to go from here?

[1] Databricks Integration with BigQuery blog post
[2] Databricks on Google Cloud free trial
[3] BigQuery Sample Notebook

Please clap for this article if you enjoyed reading it as much as I enjoyed writing it. I spend way too much time on Twitter — feel free to connect: @frankmunz.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Published in Geek Culture

32K Followers

Last published Sep 1, 2023

A new tech publication by Start it up (https://medium.com/swlh).

Written by Frank Munz

298 Followers

130 Following

Cloudy things, large-scale data & compute. Twitter @frankmunz. Former Tech Evangelist @awscloud, Principal @Databricks now. personal opinions here. #devrel ❤️.

Responses (2)

Write a response

What are your thoughts?

Also publish to my profile

Hemanga Pegu

Mar 23, 2024

I see in the explain plan that you have in one of the git link , the number rows is equivalent to the total number of rows in the table. Is this what it would show instead of showing the actual number of rows that it has read?
bigquery-public-data.sam…

Diogo Oliveira

Oct 19, 2022

How do you know this is actually working tho, the pushdown filter may not actually be acting on bq. Have you done some testing, reading the same data with this method and not to compare the input size?

More from Frank Munz and Geek Culture

how I passed the dbt certification, @frankmunz

Geek Culture

Frank Munz

How I passed the dbt Fundamentals certification with Databricks

tl;dr dbt is an open source project for ELT. It enables analytics engineers to transform data by writing SQL in a re-usable way. This…

Mar 21, 2022

Circuit Breaker Pattern (Design Patterns for Microservices)

Geek Culture

Hasitha Subhashana

Circuit Breaker Pattern (Design Patterns for Microservices)

In a distributed system we have no idea how other components would fail. Network issues could occur, components could fail or a router or a…

Jun 12, 2021

React Native Generate APK — Debug and Release APK

Geek Culture

Anshul Borawake

React Native Generate APK — Debug and Release APK

Generate Debug and Release APK in React Native; Windows, iOS and Linux

Apr 3, 2021

Frank Munz

kafkacat for AmazonLinux2 / CentOS

When I was preparing some demos about Amazon Managed Streaming for Kafka (Amazon MSK) I realised that one of the CLI tools that I like to…

Jun 14, 2019

See all from Frank Munz

See all from Geek Culture

Recommended from Medium

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Mayurkumar Surani

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Part 1: Foundation and Setup

6d ago

🔥 PySpark 3.5.4: The Must-Know Features That Will Supercharge Your Data Processing 🚀

Think Data

🔥 PySpark 3.5.4: The Must-Know Features That Will Supercharge Your Data Processing 🚀

PySpark just got a major upgrade with version 3.5.4, and trust me — you don’t want to miss these game-changing features! Whether you’re a…

5d ago

Lists

Natural Language Processing

1958 stories1605 saves

JP Morgan PySpark Interview Question, Medium Level

Data Engineer Things

B V Sarath Chandra

JP Morgan PySpark Interview Question, Medium Level

Problem Statement:

Jan 2

Delta Lake 4.0: Next-Level Big Data Management

Vijay Gadhave

Delta Lake 4.0: Next-Level Big Data Management

Note: If you’re not a medium member, CLICK HERE

6d ago

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Feb 19

Rahul Sahoo

Spark Stateful Stream Deduplication

Stateful Deduplication: Ensuring Clean and Reliable IoT Data Streams with Spark Streaming

Jan 19

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams