Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Learn about and configure the Git server proxy for Git folders, which enables you to proxy Git commands from Databricks Git folders to your private Git repositories that cannot be accessed from the internet.
What is Git server proxy for Databricks Git folders?
Databricks Git server proxy for Git folders is a feature that allows you to proxy Git commands from your Azure Databricks workspace to a private Git server. A Git server is private if it cannot be accessed on the internet.
Databricks Git folders (formerly Repos) represents your connected Git repositories as folders. The contents of these folders are version-controlled by syncing them to the connected Git repository. By default, Git folders can synchronize only with those accessible on the internet. If you host your own private Git server (such as GitHub Enterprise Server, Bitbucket Server, or GitLab self-managed), or if your Git server is behind a firewall, you must use Git server proxy with Git folders to provide Databricks access to your Git server and the Git server must be accessible from your Azure Databricks compute plane.
How does Git Server Proxy for Databricks Git folders work?
Git server proxy for Databricks Git folders proxies Git commands from the Databricks control plane to a proxy cluster running in your Databricks workspace's compute plane. In this context, the proxy cluster is a cluster configured to run a proxy service for Git commands from Databricks Git folders to your self-hosted Git repository. Proxying does not affect the security architecture of your Databricks control plane. This proxy service receives Git commands from the Databricks control plane and forwards them to your Git server instance.
The diagram below illustrates the overall system architecture:
Important
Databricks provides an enablement notebook you can run to configure your Git server instance to proxy commands for Databricks Git folders. Get the enablement notebook on GitHub. The Databricks Git server proxy is specifically designed to work with the version of the Databricks Runtime included in the configuration notebook. Do not update the Databricks Runtime version of the proxy cluster.
How do I set up Git Server Proxy for Databricks Git folders?
This section describes how to prepare your Git server instance for the Git server proxy, create the proxy, and validate your configuration.
Before you begin
Before enabling the proxy, ensure that:
- Your workspace has the Databricks Git folders feature enabled.
- Your Git server instance is accessible from your Azure Databricks workspace's compute plane VPC, and has both HTTPS and personal access tokens (PATs) enabled.
Note
Git server proxy for Databricks works in all regions supported by your VPC.
Step 1: Prepare your Git server instance
Important
To create a compute resource and complete this task, you must be a workspace admin with access rights.
To configure your Git server instance:
Give the proxy cluster's driver node access to your Git server.
Your enterprise Git server can have an
allowlist
of IP addresses from which access is permitted.- Associate a static outbound IP address for traffic that originates from your proxy cluster. You can do this by using Azure Firewall or an egress appliance.
- Add the IP address from the previous step to your Git server's allowlist.
- Set your Git server instance to allow HTTPS transport.
- For GitHub Enterprise, see Which remote URL should I use in the GitHub Enterprise help.
- For Bitbucket, go to the Bitbucket server administration page and select server settings. In the HTTP(S) SCM hosting section, enable the HTTP(S) enabled checkbox.
Step 2: Run the enablement notebook
To enable the proxy:
Log into your Azure Databricks workspace as a workspace admin with access rights to create a cluster.
Import this notebook, which chooses the smallest instance type available from your cloud provider to run the Git proxy.:
Click Run All to run the notebook, which performs the following tasks:
- Creates a single node compute resource named “Databricks Git Proxy”, which does not auto-terminate. This is the Git proxy service that will process and forward Git commands from your Azure Databricks workspace to your private Git server.
- Enables a feature flag that controls whether Git requests in Databricks Git folders are proxied via the compute instance.
As a best practice, consider creating a simple job to run the Git proxy compute resource. This can be a simple notebook that prints or logs status, such as “The Git proxy service is running.” Set the job to run on regular time intervals to ensure the Git proxy service is always available for your users.
Note
Running an additional long-running compute resource to host the proxy software incurs extra DBUs. To minimize costs, the notebook configures the proxy to use a single-node compute resource with an inexpensive node type. However, you might want to modify the compute options to suit your needs. For more information on compute instance pricing, see the Databricks pricing calculator.
Step 3: Validate your Git server configuration
To validate your Git server configuration, try to clone a repository hosted on your private Git server via the proxy cluster. A successful clone means that you have successfully enabled the Git server proxy for your workspace.
Step 4: Create proxy-enabled Git repositories
After users configure their Git credentials, no further steps are required to create or synchronize your repos. To configure credentials and access the repositories for your Git folders programmatically, see Configure Git credentials & connect a remote repo to Azure Databricks.
Remove global CAN_ATTACH_TO permissions
A Git server proxy does not require CAN_ATTACH_TO
permission for any user. To prevent users from running arbitrary workloads on the proxy cluster and causing reliability issues, admins should restrict cluster ACL permissions on the proxy server:
Select Compute from the sidebar, and then click the
kebab menu next to the Compute entry for the Git Server Proxy you're running:
From the dialog, remove the Can Attach To entry for All Users:
Troubleshooting
Did you encounter an error while configuring Git server proxy for Databricks Git folders? Here are some common issues and ways to diagnose them more effectively.
Checklist for common problems
Before you start diagnosing an error, confirm that you've completed the following steps:
- Confirm that your proxy cluster is running with this Git proxy server debug notebook.
- Confirm that you are a workspace administrator.
- Run the rest of the debug notebook and capture the results. If you are unable to debug the issue, or do not see any failures reported from the debug notebook, Databricks support can review the results. You can export and send the debug notebook as a DBC archive, if requested.
Change your Git proxy configuration
If your Git proxy service is not working with the default configuration, you can set specific environment variables to better support your network infrastructure.
Use the following environment variables to update the configuration for your Git proxy service:
Environment variable | Format | Description |
---|---|---|
GIT_PROXY_ENABLE_SSL_VERIFICATION |
true /false |
Set this to false if you are using a self-signed certificate for your private Git server. |
GIT_PROXY_CA_CERT_PATH |
File path (string) | Set this to the path to a CA certificate file used for SSL verification. Example: /FileStore/myCA.pem |
GIT_PROXY_HTTP_PROXY |
https://<hostname>:<port #> |
Set this to the HTTPS URL for your network's firewall proxy for HTTP traffic. |
GIT_PROXY_CUSTOM_HTTP_PORT |
Port number (integer) | Set this to the port number assigned to your Git server's HTTP port. |
To set these environment variables, go to the Compute tab in your Azure Databricks workspace and select the compute configuration for your Git proxy service. At the bottom of the Configuration pane, expand Advanced and select the Spark tab under it. Set one or more of these environment variables by adding them to the Environment variables text area.
Inspect logs on the proxy cluster
The file at /databricks/git-proxy/git-proxy.log
on the proxy cluster contains logs that are useful for debugging purposes.
The log file should start with the line Data-plane proxy server binding to ('', 8000)…
. If it does not, this means that the proxy server did not start properly. Try restarting the cluster, or delete the cluster you created and run the enablement notebook again.
If the log file starts with this line, review the log statements that follow that line for each Git request initiated by a Git operation in Databricks Git folders.
For example:
do_GET: https://server-address/path/to/repo/info/refs?service=git-upload-pack 10.139.0.25 - - [09/Jun/2021 06:53:02] /
"GET /server-address/path/to/repo/info/refs?service=git-upload-pack HTTP/1.1" 200`
Error logs written to this file can be useful to help you or Databricks Support debug issues.
Common error messages and their resolution
Secure connection could not be established because of SSL problems
You might see the following error:
https://git.consult-prodigy.com/Prodigy/databricks_test: Secure connection to https://git.consult-prodigy.com/Prodigy/databricks_test could not be established because of SSL problems
Often, this means that you are using a repository that requires special SSL certificates. Check the content of the
/databricks/git-proxy/git-proxy.log
file on the proxy cluster. If it says that certificate validation failed, then you must add the certificate of authority to the system certificate chain. First, extract the root certificate (using the browser or other option) and upload it to DBFS. Then, edit the Git folders Git Proxy cluster to use theGIT_PROXY_CA_CERT_PATH
environment variable to point to the root certificate file. For more information about editing cluster environment variables, see Environment variables.After you have completed that step, restart the cluster.
Frequently asked questions
How do I find out if the Git proxy server is running in my workspace?
Import and run the Git proxy debug notebook. The results of the notebook run show if there are issues with the Git proxy service.
Can multiple workspaces share one proxy cluster? Can a workspace have multiple proxy clusters?
You need one proxy cluster per Azure Databricks workspace and cannot share one across multiple workspaces. You can only have one Git proxy server cluster per workspace.
Could the proxy cluster route only parts of Git traffic?
All Databricks Git folders-related Git traffic is routed through the proxy cluster, even for public Git repositories. Your Azure Databricks workspace does not differentiate between proxied and non-proxied repositories.
Does the Git proxy feature work with other Git enterprise server providers?
Databricks Git folders support GitHub Enterprise, Bitbucket Server, Azure DevOps Server, and GitLab self-managed. Other enterprise Git server providers should work as well if they conform to common Git specifications.
Do Databricks Git folders support GPG signing of commits?
No.
Do Databricks Git folders support SSH transport for Git operations?
No. Only HTTPS is supported.
Is the use of a non-default HTTPS port on the Git server supported?
Currently, the enablement notebook assumes that your Git server uses the default HTTPS port 443. You can set the environment variable GIT_PROXY_CUSTOM_HTTP_PORT
to overwrite the port value with a preferred one.
Can Databricks hide Git server URLs that are proxied? Could users enter the original Git server URLs rather than proxied URLs?
Yes to both questions. Users do not need to adjust their behavior for the proxy. With the current proxy implementation, all Git traffic for Databricks Git folders is routed through the proxy. Users enter the normal Git repository URL, such as https://git.company.com/org/repo-name.git
.
Does the feature transparently proxy authentication data to the Git server?
Yes, the proxy uses the user account's Git credential to authenticate to the Git server. Access is restricted by the permissions specified in the user's Git credential.