Understanding Azure Databricks and resource groups

brick wall

Azure Databricks, an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud, is a highly effective open-source tool, but it automatically creates resource groups and workspaces and protects them with a system-level lock, all of which can be confusing and frustrating unless you understand how and why.

The Databricks platform provides an interactive workspace that streamlines collaboration between data scientists, data engineers and business analysts. The Spark analytics engine supports machine learning and large-scale distributed data processing, combining many aspects of big data analysis all in one process.

Spark works on large volumes of data either in batch (rest) or streaming processing (live) mode. The live processing capability is how Databricks/Spark differs from Hadoop (which uses MapReduce algorithms to process only batch data).

Resource groups are key to managing the resources bound to Databricks. Typically, you specify which groups in which your resources are created. This changes slightly when you create an Azure Databricks service instance and specify a new or existing resource group. Say, for example, we are creating a new resource group, Azure will create the group and place a workspace within it. That workspace is an instance of the Azure Databricks service.

Along with the directly specified resource group, it will also create a second resource group. This is called a “Managed resource group” and it starts with the word “databricks.” This Azure-managed group of resources allows Azure to provide Databricks as a managed service. Initially this managed resource group will contain only a few workspace resources (a virtual network, a security group and a storage account). Later, when you create a cluster, the associated resources for that cluster will be linked to this managed resource group.

The “databricks-xxx” resource group is locked when it is created since the resources in this group provide the Databricks service to the user. You are not able to directly delete the locked group nor directly delete the system-owned lock for that group. The only option is to delete the service, which in turn deletes the infrastructure lock.

Azure Resource Manager

With respect to Azure tagging, the lock placed upon that Databricks managed resource group prevents you from adding any custom tags, from deleting any of the resources or doing any write operations on a managed resource group resource.

Example Deployment

Let’s talk a look at what happens when you create an instance of the Azure Databricks service with respect to resources and resource groups:

Steps

  1. Create an instance of the Azure Databricks service
  2. Specify the name of the workspace (here we used nwoekcmdbworkspace)
  3. Specify to create a new resource group (here we used nwoekcmdbrg) or choose an existing one
  4. Hit Create

Results

  1. Creates nwoekcmdbrg resource group
  2. Automatically creates nwoekcmdbworkspace, which is the Azure Databricks Service. This is contained within the nwoekcmdbrg resource group.
  3. Automatically creates databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km resource group. This contains a single storage account, a network security group and a virtual network.

resource group

Click on the workspace (Azure Databricks service), and it brings up the workspace with a “Launch Workspace” button.

click launch workspace

Launching the workspace uses AAD to sign you into the Azure Databricks service. This is where you can create a Databricks cluster or run queries, import data, create a table, or create a notebook to start querying, visualizing and modifying your data. I decided to create a new cluster to demonstrate where the resources are stored for the appliance. Here, we create a cluster to see where the resources land.

create cluster

After the cluster is created, a number of resources were created in the Azure Databricks managed resource group databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km. Instead of merely containing a single VNet, NSG and storage account as it did initially, it now contains multiple VMs, disks, network interfaces, and public IP addresses.

new resources created

The workspace nwoekcmdbworkspace and the original resource group nwoekcmdbrg both remain unchanged as all changes are made in the managed resource group databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km. If you click on “Locks,” you can see there is a read-only lock placed on it to prevent deletion. Clicking on the “Delete” button yields an error saying the lock was not able to be deleted. If you make changes to the original resource group in the tags, they will be reflected in the “databricks-xxx” resource group.  But you cannot change tag values in the databricks-xxx resource group directly.

locks

Summary

When using Azure Databricks, it can be confusing when a new workspace and managed resource group just appear. Azure automatically creates a Databricks workspace, as well as a managed resource group containing all the resources needed to run the cluster. This is protected by a system-level lock to prevent deletions and modifications. The only way to directly remove the lock is to delete the service. This can be a tremendous limitation if changes need to be made to tags in the managed resource group.  However, by making changes to the parent resource group, those changes will be correspondingly updated in the managed resource group.


Mike McKeown is an Azure solutions architect for DXC Technology and is a member of the DXC Azure Center of Excellence team. Mike spent 20+ years with Microsoft and has been working with Azure since 2011. He published a book on AzureAutomation (MS Press), developed four Azure courses with Pluralsight, written a number of whitepapers and articles for MSDN, and has spoken at many conferences about Azure. Connect with Mike on LinkedIn.

Speak Your Mind

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.