Configure Lakeflow Declarative Pipelines

2025-08-18

This article describes the basic configuration for Lakeflow Declarative Pipelines using the workspace UI.

Databricks recommends developing new pipelines using serverless. For configuration instructions for serverless pipelines, see Configure a serverless pipeline.

The configuration instructions in this article use Unity Catalog. For instructions for configuring pipelines with legacy Hive metastore, see Use Lakeflow Declarative Pipelines with legacy Hive metastore.

This article discusses functionality for the current default publishing mode for pipelines. Pipelines created before February 5, 2025, might use the legacy publishing mode and LIVE virtual schema. See LIVE schema (legacy).

Note

The UI has an option to display and edit settings in JSON. You can configure most settings with either the UI or a JSON specification. Some advanced options are only available using the JSON configuration.

JSON configuration files are also helpful when deploying pipelines to new environments or using the CLI or REST API.

For a complete reference to the Lakeflow Declarative Pipelines JSON configuration settings, see Lakeflow Declarative Pipelines configurations.

Configure a new pipeline

To configure a new pipeline, do the following:

In your workspace, click Jobs & Pipelines in the sidebar.
Under New, click ETL Pipeline.
Provide a unique Pipeline name.
(Optional) Use the file picker to configure notebooks and workspace files as Source code.
- If you don't add any source code, a new notebook is created for the pipeline. The notebook is created in a new directory in your user directory, and a link to access this notebook is shown in the Source code field in the Pipeline details pane after you've created the pipeline.
  - You can access this notebook with the URL presented under the Source code field in the Pipeline details panel once you've created your pipeline.
- Use the Add source code button to add additional source code assets.
Select Unity Catalog under Storage options.
Select a Catalog. This setting controls the default catalog and the storage location for pipeline metadata.
Select a Schema in the catalog. By default, streaming tables and materialized views defined in the pipeline are created in this schema.
In the Compute section, check the box next to Use Photon Acceleration. For additional compute configuration considerations, see Compute configuration options.
Click Create.

These recommended configurations create a new pipeline configured to run in Triggered mode and use the Current channel. This configuration is recommended for many use cases, including development and testing, and is well-suited to production workloads that should run on a schedule. For details on scheduling pipelines, see Pipeline task for jobs.

Compute configuration options

Databricks recommends always using Enhanced autoscaling. Default values for other compute configurations work well for many pipelines.

Serverless pipelines remove compute configuration options. For configuration instructions for serverless pipelines, see Configure a serverless pipeline.

Use the following settings to customize compute configurations:

Workspace admins can configure a Cluster policy. Compute policies allow admins to control what compute options are available to users. See Select a compute policy.
You can optionally configure Cluster mode to run with Fixed size or Legacy autoscaling. See Optimize the cluster utilization of Lakeflow Declarative Pipelines with Autoscaling.
For workloads with autoscaling enabled, set Min workers and Max workers to set limits for scaling behaviors. See Configure classic compute for Lakeflow Declarative Pipelines.
You can optionally turn off Photon acceleration. See What is Photon?.

Use Cluster tags to help monitor costs associated with Lakeflow Declarative Pipelines. See Configure compute tags.
Configure Instance types to specify the type of virtual machines used to run your pipeline. See Select instance types to run a pipeline.
- Select a Worker type optimized for the workloads configured in your pipeline.
- You can optionally select a Driver type that differs from your worker type. This can be useful for reducing costs in pipelines with large worker types and low driver compute utilization or for choosing a larger driver type to avoid out-of-memory issues in workloads with many small workers.

Set the run-as user

Run-as user allows you to change the identity that a pipeline uses to run, and the ownership of the tables it creates or updates. This is useful in situations where the original user who created the pipeline has been deactivated—for example, if they left the company. In those cases, the pipeline can stop working, and the tables it published can become inaccessible to others. By updating the pipeline to run as a different identity—such as a service principal—and reassigning ownership of the published tables, you can restore access and ensure the pipeline continues to function. Running pipelines as service principals is considered a best practice because they are not tied to individual users, making them more secure, stable, and reliable for automated workloads.

Required permissions

For the user making the change:

CAN_MANAGE permissions on the pipeline
CAN_USE role on the service principal (if setting run-as to a service principal)

For the run-as user or service principal:

Workspace Access:

Workspace access permission to operate within the workspace
Can use permission on cluster policies used by the pipeline
Compute creation permission in the workspace

Source Code Access:

Can read permission on all notebooks included in the pipeline source code
Can read permission on workspace files if the pipeline uses them

Unity Catalog Permissions (for pipelines using Unity Catalog):

USE CATALOG on the target catalog
USE SCHEMA and CREATE TABLE on the target schema
MODIFY permission on existing tables that the pipeline updates
CREATE SCHEMA permission if the pipeline creates new schemas

Legacy Hive metastore Permissions (for pipelines using Hive metastore):

SELECT and MODIFY permissions on target databases and tables

Additional Cloud Storage Access (if applicable):

Permissions to read from source storage locations
Permissions to write to target storage locations

How to set the run-as user

In the pipeline details page, click Edit next to Run as.
In the edit widget, select one of the following options:
- Your own user account
- A service principal for which you have CAN_USE permission
Click Save to apply the changes.

When you successfully update the run-as user:

The pipeline identity changes to use the new user or service principal for all future runs
In Unity Catalog pipelines, the owner of tables published by the pipeline is updated to match the new run-as identity
Future pipeline updates will use the permissions and credentials of the new run-as identity
Continuous pipelines automatically restart with the new identity. Triggered pipelines do not automatically restart, and the run-as change can interrupt an active update

Note

If the update of run-as fails, you receive an error message explaining the reason for the failure. Common issues include insufficient permissions on the service principal.

Other configuration considerations

The following configuration options are also available for pipelines:

The Advanced product edition gives you access to all Lakeflow Declarative Pipelines features. You can optionally run pipelines using the Pro or Core product editions. See Choose a product edition.
You might choose to use the Continuous pipeline mode when running pipelines in production. See Triggered vs. continuous pipeline mode.
If your workspace is not configured for Unity Catalog or your workload needs to use legacy Hive metastore, see Use Lakeflow Declarative Pipelines with legacy Hive metastore.
Add Notifications for email updates based on success or failure conditions. See Add email notifications for pipeline events.
Use the Configuration field to set key-value pairs for the pipeline. These configurations serve two purposes:
- Set arbitrary parameters you can reference in your source code. See Use parameters with Lakeflow Declarative Pipelines.
- Configure pipeline settings and Spark configurations. See Lakeflow Declarative Pipelines properties reference.
- Configure Tags. Tags are key-value pairs for the pipeline that are visible in the Workflows list. Pipeline tags are not associated with billing.
Use the Preview channel to test your pipeline against pending Lakeflow Declarative Pipelines runtime changes and trial new features.

Choose a product edition

Select the Lakeflow Declarative Pipelines product edition with the best features for your pipeline requirements. The following product editions are available:

Core to run streaming ingest workloads. Select the Core edition if your pipeline doesn't require advanced features such as change data capture (CDC) or Lakeflow Declarative Pipelines expectations.
Pro to run streaming ingest and CDC workloads. The Pro product edition supports all of the Core features, plus support for workloads that require updating tables based on changes in source data.
Advanced to run streaming ingest workloads, CDC workloads, and workloads that require expectations. The Advanced product edition supports the features of the Core and Pro editions and includes data quality constraints with Lakeflow Declarative Pipelines expectations.

You can select the product edition when you create or edit a pipeline. You can choose a different edition for each pipeline. See the Lakeflow Declarative Pipelines product page.

Note: If your pipeline includes features not supported by the selected product edition, such as expectations, you will receive an error message explaining the reason for the error. You can then edit the pipeline to select the appropriate edition.

Configure source code

You can use the file selector in the Lakeflow Declarative Pipelines UI to configure the source code defining your pipeline. Pipeline source code is defined in Databricks notebooks or SQL or Python scripts stored in workspace files. When you create or edit your pipeline, you can add one or more notebooks or workspace files or a combination of notebooks and workspace files.

Because Lakeflow Declarative Pipelines automatically analyzes dataset dependencies to construct the processing graph for your pipeline, you can add source code assets in any order.

You can modify the JSON file to include Lakeflow Declarative Pipelines source code defined in SQL and Python scripts stored in workspace files. The following example includes notebooks and workspace files:

{
  "name": "Example pipeline 3",
  "storage": "dbfs:/pipeline-examples/storage-location/example3",
  "libraries": [
    { "notebook": { "path": "/example-notebook_1" } },
    { "notebook": { "path": "/example-notebook_2" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.sql" } },
    { "file": { "path": "/Workspace/Users/<user-name>@databricks.com/Apply_Changes_Into/apply_changes_into.py" } }
  ]
}

Manage external dependencies for pipelines that use Python

Lakeflow Declarative Pipelines supports using external dependencies in your pipelines, such as Python packages and libraries. To learn about options and recommendations for using dependencies, see Manage Python dependencies for Lakeflow Declarative Pipelines.

Use Python modules stored in your Azure Databricks workspace

In addition to implementing your Python code in Databricks notebooks, you can use Databricks Git Folders or workspace files to store your code as Python modules. Storing your code as Python modules is especially useful when you have common functionality you want to use in multiple pipelines or notebooks in the same pipeline. To learn how to use Python modules with your pipelines, see Import Python modules from Git folders or workspace files.