Blog Databricks on Google Cloud

How this exciting development will simplify your modern data platform across cloud vendors

Abstract concept of a data overlay on a cloud illustration

Databricks, the modern cloud data platform, is coming to Google Cloud. So why is this such a big deal and how will this change benefit your organization?

In the last few years, Databricks has been making waves in the world of data and Artificial Intelligence (AI) by providing a platform for solving innovative data challenges. These challenges range from massive Hadoop migrations (shifting expensive, on-premises, legacy vendors into a more agile and cost-effective cloud framework), to creating highly dynamic and scalable data science environments — all while allowing teams to collaborate in order to drive business outcomes and value.

Databricks has been a focal point of many modern data architectures implemented in the cloud today. The demand for Databricks is only growing, leading to an exciting announcement about the debut of Databricks on Google Cloud.

How will this benefit organizations?

Databricks provides the capabilities for a single platform experience within organizations, creating an avenue for many different types of users to leverage scalable compute, efficient data storage, and collaborative development and machine learning experimentation.

It provides a workspace for data engineers to transform and store unlimited amounts of data in cloud storage through batch-based processing or streaming Extract, Transform, Load (ETL) jobs. Data analysts can query that data with user-friendly interfaces for finding quick insights, exploring data or connecting their own Business Intelligence (BI) tools for driving reporting outcomes. Data scientists can use the platform to explore data, create and experiment with machine learning models, and build iterative MLOps processes for driving change and innovation.

The fact that these services are now offered in Google Cloud, as a fully managed, integrated Software as a Service (SaaS) solution, provides a first-class experience for organizations to innovate with data and AI. Let's evaluate a few different scenarios where Databricks on Google Cloud can enable that further.

  • Scalable compute: Running on Google Kubernetes Engine (GKE), Databricks clusters can be provisioned in minutes. Databricks executes workloads utilizing open-source Spark, with in-memory processing for fast and efficient data pipelines. Using built-in options, such as auto-scaling clusters, allows compute to be scaled to the workload demands of the client, while also providing optimal cost savings when the cluster is not being heavily utilized.
  • Improved productivity and collaboration: Databricks provides an integrated view for developing, collaborating, deploying and sharing progress between data engineers, data scientists and data analysts. It provides these capabilities through support for co-developing in the same notebooks, sharing SQL queries, versioning changes, storing Machine Learning (ML) models and publishing dashboards.
  • Flexibility: Databricks allows users to develop with a host of languages for Spark. This includes SQL, Python, R, Scala and Java.
  • Faster insights: Databricks provides a one-stop shop for querying data, either in the platform itself, or through an Integrated Development Environment (IDE) or BI tool of the users’ choice.

Google Cloud integrations

Coming out of this announcement, Databricks on Google Cloud supports several integrations within the Google Cloud Platform (GCP) ecosystem. Databricks provides seamless integration with GCP storage services such as Google Cloud Storage, Google Cloud SQL, Google Pub/Sub and Google BigQuery. Additional integrations for end-to-end analytics and ML include Looker and the Google AI Platform.

Integration of Google Cloud Identity for Single Sign On (SSO) and credential passthrough help to easily onboard your user base into the environment within your existing infrastructure. These native integrations serve to make it easier for users to spend time driving value and allow for embedding in the platform using existing GCP service, or for making it a component of future architecture leveraging these services.

Bringing the Lakehouse to your Google Cloud

With the support of these many integrations within Google Cloud, the next step for organizations is taking advantage of the next evolution in open data architectures. The Lakehouse — a new data architecture that combines some of the best elements of scalable data lakes and the more familiar business-driven data warehouses — is seen as the next step in this journey.

The capabilities of the architecture allow for redesigning data warehouses in the modern world, or breaking down the data silos and hurdles present across business groups or data science teams. The point of this is to make it easy for an organization to be very agile in their data ingestion, storage, processing and analysis, in order to drive quick insights and business value.

Data warehousing chart

Here are a few key drivers we at Insight see for the Lakehouse concept — and how Databricks on Google Cloud works to address these problems.

  • Separate compute and storage: Using Google Cloud Storage provides the ability to scale storage for raw and transformed data coming from today’s diverse IT systems. Using scalable compute from Databricks provides the ability to scale compute vertically and horizontally over that data to allow concurrent users to access and transform data as needed.
  • Support for transactions: Many users and groups may be reading and writing data concurrently, something that data lakes often struggle with. Support for Atomicity, Consistency, Isolation, Durability (ACID) transactions is essential in order to make sure there are no conflicts among different parties when reading and writing this data concurrently. Using Delta Lake, an open-source format with Databricks, provides this ACID support natively.
  • Support for structured, semi-structured and unstructured data: Something that data warehouses often struggle with — being able to efficiently store, transform, analyze and query a wide array of data types — is essential to the Lakehouse. This can range for anything from text, images and video to JavaScript Object Notation (JSON), Extensible Markup Language (XML) and any other type of data needed to support additional data and AI efforts. Storing all this data inside Google Cloud Storage, and being able to access it natively and in parallel with Databricks, opens up potential for many use cases that were difficult to process before because of the lack of flexibility on data types.
  • Build scalable compute depending on the business case: The Lakehouse supports use cases like streaming Internet of Things (IoT) or sensor data, daily ETL workloads, and machine learning ranging from simple regression and classification models to complex deep learning with Graphics Processing Units (GPUs). Mixing and matching workloads — or spinning up separate compute for the streaming cluster vs. the machine learning cluster utilizing GPUs — provides flexibility on optimizing performance and cost.
  • Support for BI tools: Connecting BI tools, such as Looker, to analyze data allows business analysts to query data at all points of the lifecycle, using Databricks compute to do much of the heavy lifting. Using a BI tool to look raw, transformed and finalized data can provide a lot of flexibility for regular business users to use the tools they know and love.

There are other benefits to the Lakehouse concept as well, but in my experience, these are some of the main ones that resonate most with combining the best of both worlds of the data lake and the data warehouse architectures.

Multicloud? No problem.

One of the highlights of this announcement is that Databricks is now offered on Amazon Web Services (AWS), Azure and GCP as a fully managed solution. The impacts of this are tremendous in a world where many organizations are opting for multicloud approaches.

Some organizations are opting for building analytics capabilities that can avoid vendor lock-in. That is, opting for technologies and platforms built on open source, or that scale across different clouds, to allow for flexibility and agility when developing complex solutions. With Databricks offering the same experience across three of the largest cloud platforms, the ability for customers to migrate from one cloud to the other, or run their big data and data science workloads in the cloud of their choice, just got easier.

With integrations supported across all three clouds’ services as well, this provides flexibility for business units to choose where they want to work. A common example we're starting to see is data being hosted in one cloud, and the data science team deciding they’d prefer to work in another cloud, like GCP, then processing all their ML models in the cloud of their choosing. Migrating from an existing cloud-Databricks workspace into a Databricks workspace on GCP is a much easier task since it's all the same experience and doesn't require any re-platforming or recoding much of the existing work.

Final thoughts

There's a lot to consider with how organizations can choose to utilize Databricks, and this post barely scratches the surface.

There's a reason Databricks has gained so much momentum and is becoming a large player in the data and AI space. Their ability to work across the three major clouds is going to further provide more flexibility and ease when it comes to solving these challenges — which is how Databricks became popular in the first place.

About the Author:

Headshot of Stream Author

DJ Maley

Former National Cloud Data & AI Architect (2019–2021), Insight

DJ was a national cloud data & AI architect for Insight’s Digital Innovation team. He is certified in Azure, AWS and GCP, and is committed to helping guide clients through data platform modernization and transformation.