Managing connections efficiently in Dataflow is essential for building robust and secure data workflows. Connections allow you to link your workspace to external databases and services seamlessly, ensuring your pipelines, notebooks, and applications can interact with data sources without hardcoding credentials.
What are Connections?
A connection is a secure configuration that stores authentication details for external databases or services. For example, you can create connections for:
PostgreSQL
MySQL
HTTP
SMTP
SQLite
IMAP
AWS S3
Python Package Index
and more
Once created, these connections can be used across your notebooks, pipelines, and apps, ensuring consistency and security in your workflows.
Creating a Connection
Follow these steps to create a new connection:
Navigate to the Connections section from the Studio sidebar.
Click “New Connection”.
Select the database type from the options (PostgreSQL, MySQL, etc.).
Fill in the required fields:
Connection Name (unique) - this will be used as conn_id
Database Type
Host
Port
Username (Login)
Password
Extra settings (if applicable, such as SSL options)
Click “Save” to create the connection.
Updating a Connection
To modify an existing connection:
Go to the Connections page.
Click Edit next to the desired connection.
Update any fields as required.
Click “Save” to apply the changes.
Note: Updating a connection does not interrupt running workflows unless the credentials change.
Deleting a Connection
To remove a connection:
Click “Delete” next to the connection you wish to remove.
Confirm deletion in the dialog prompt.
Caution: Deleting a connection will cause any workflows using it to fail until a replacement connection with the same name is configured.
Best Practices for Connections
Use unique and descriptive names for each connection to avoid confusion in large projects.
Store sensitive credentials only within connections to ensure security and avoid hardcoding in scripts.
When deleting connections, verify which workflows depend on them to prevent unexpected failures.
Using Connections in Python Scripts
Connections created in Dataflow can be accessed in your Python code using Airflow hooks. Each connection is identified by its conn_id which you specified when creating the connection.
Example: PostgreSQL Connection
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.providers.amazon.aws.hooks.s3 import S3Hook
deflist_buckets():
hook =S3Hook(aws_conn_id="my_aws_conn")
client = hook.get_conn()
response = client.list_buckets()
print([b["Name"]for b in response["Buckets"]])
list_buckets()
Using Connections in Jupyter Notebooks with JupySQL
Dataflow supports JupySQL magic commands in Jupyter notebooks, allowing you to execute SQL queries directly using your configured connections with a simple and intuitive syntax.
Basic Usage
# Connect to your database using the connection ID
%sql conn_id
# Execute SQL queries directly
%sql SELECT*FROM users LIMIT10;
# Multi-line queries
%%sql
SELECT
user_id,
username,
created_at
FROM users
WHERE status ='active'
ORDERBY created_at DESC;
Important Notes
Kernel Restart Required: If you add a new connection in Dataflow Studio while a notebook kernel is already running, you need to restart the kernel to use the new connection with %sql conn_id.
Alternative Without Restart: To avoid restarting the kernel, you can use the --section parameter:
%sql --section conn_id
This will work immediately without requiring a kernel restart.