Highlights from the Data and AI Summit (former Apache Spark Summit) for the Busy IT Professional

Right now, I am actually waiting for the keynote of day 2 to kick off. So let me summarize the announcements from day 1 of the Data and AI Summit (DAIS) for you.

Delta.io

Delta Lake has grown up :-). The Linux Foundation owned, open source project Delta.io that brings reliability to your data lake matured to version 1.0 now. Happy Birthday, Delta! Key features of Delta 1.0:

  • Rust implementation…


How to Connect Google Looker to Databricks Delta Lake

Databricks, with its lakehouse architecture, is available on the Google Cloud Platform (GCP) now. That means you can use your data science notebooks, the optimized Apache Spark engine, SQL Analytics, and Delta Lake with its open formats on all major clouds: You can do your analytics now where your data is.

Databricks also tightly integrates with GCP cloud services. For those with a BI or analytics background, the combination of Google’s Looker and Databricks lakehouse is particularly interesting. Looker is Google’s cloud-based enterprise platform for BI that lets you easily create stunning dashboards.

With Databricks on GCP, you can now…


Does predicate pushdown for Databricks on Google Cloud with BigQuery work? It does! And here is how to verify it.

When I tested the features of the recently released Databricks on the Google Cloud platform, I checked out the BigQuery integration. Databricks is using a fork of the open-source Google Spark Connector for BigQuery. So I wondered how to check if a certain predicate of a query is indeed pushed down to BigQuery (or not). It turns out it is easy!

Let’s take the natality public data set in Google BigQuery. The code from the notebook cell below uses the Spark storage API to load a table from BigQuery to a dataframe, pushing down the filter() predicate to BigQuery.


AWS: What is New and What you Need to Know in 2020

Let’s face it! AWS biggest conference re:Invent in Las Vega can be a bit overwhelming with more than 65.000 attendees and more than 3000 conference sessions.

Myself, I am part of the AWS technical evangelism team and as every year, we are running dozens of technical re:Invent recap sessions all over the world. The picture below was taken at the recap session in Munich were almost 150 people attended before Christmas.

If you want to browse the recap content on your own you can find the slides below…


You missed out on AWS re:Invent 2019? You went to Vegas but you were too busy with other topics? Don’t panic! What happened in Vegas, will stay in Vegas. The exception is all the tech announced at re:Invent. This is my (very) subjective must-see list of AWS container sessions from re:Invent 2019.

AWS made some major announcements in the container space this year. Yet here is the challenge: To catch up with all of the sessions in detail you will be busy for at least half a day. And your head might be spinning after binge watching all of them.

Here are some suggestions to catch up quickly and easily with all announcements:


Without writing Dockerfiles or Kubernetes YAML files.

This article describes a quick and easy way to go from writing a light-weight Java, Kotlin or Scala application to a running Kubernetes service.

Introduction

Kubernetes (K8s) is an open-source container-orchestration system driven by a large, enthusiastic group of developers. Companies such as Amadeus, Bose, CERN, …, Zalando, and thousands of others build on top of Kubernetes. And yes, it’s free!

Despite Kubernetes being popular, stable, and well documented, some developers new to Kubernetes get lost in the technical details. How long should it take the average developer to go from an empty directory to a RESTful Java application running as…


In this article I like to list some resources that you might find useful when starting with Apache Kafka on AWS. This could be with Amazon MSK, but also for all other Kafka projects.

  • The recording is online now:
  • kafkacat


Photo by Bing Han

When I prepared some demos about Amazon Managed Streaming for Kafka (Amazon MSK) I realised that one of the CLI-tools that I like to use was not available on my EC2 instance: the useful open-source kafkacat tool.

There seems to be no binary available for Amazon Linux via yum install, but you can quickly build it yourself. Doing so requires UNIX make, as well as a C and C++ compiler (so the whole task mentally throws you back to the time before the “write once, run anywhere” became a thing).

kafkacat Installation on Amazon Linux2

To build kafkacat yourself, just run the following commands on…


About a year ago I started looking into service meshes, in particular Istio in combination with Envoy proxy. Having spent a decade on and off in service bus projects I wondered how a service mesh could replace the classical service bus that we have been using for more than 10 years now. I have summarised my talk from CODE One 2018 conference in San Francisco in this article, added the presentation (video and slides) as well as further references.

Frank Munz

Cloudy things, large scale data & compute. Developer Relations @Databricks EMEA. Twitter @frankmunz. Former Tech Evangelist @awscloud. vi/vim.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store