Chris Chiosa on 06 Dec 2024 16:39:55
Issue:
Currently the Copy Data activity inside of the Fabric data pipeline will write null data to the destination when the source data connection fails.
Issue Example:
I have a copy data activity which copies data from my on-prem sql server to a delta table in my fabric lakehouse.
The copy data activity uses the overwrite option on the sink side.
If the on-prem sql server cannot be accessed during the copy data runtime, the sink delta table will be truncated ('overwritten' with null data).
Solution:
If the Copy Data activity fails to connect to the source, the activity is aborted and any connection to the sink is closed without performing any writes.
Solution Example:
I have a copy data activity which copies data from my on-prem sql server to a delta table in my fabric lakehouse.
The copy data activity uses the overwrite option on the sink side.
If the on-prem sql server cannot be accessed during the copy data runtime, the sink delta table will not be updated. No data is changed or updated.
Why Should This Be Done:
I do not see a use-case where you would ever want to default to writing or over writing data to a sink when your source connection failed.
If the connection fails and you want to perform additional actions - you should use the success/failure pipeline routes on the copy data activity that failed.
The current state means I need to be proactive to protect my data - I need to validate the source is reachable in a pipeline activity prior to the copy data activity. This adds complexity.
The solution state means I do not need to be proactive to protect my data (if a connection fails the state of my data is maintained). If I want to take actions on connection failures, I can utilize existing functionality to create a pipeline flow from the activity failure status.
The current state is unintuitive and bad user experience, and defies industry best practice of data preservation.
- Comments (1)
RE: Copy Data Activity Should Not Write Data On Source Connection Error
I believe current functionality also goes against MS best practices, but there is no way for a user to opt-out of the current functionality. Best practices: https://learn.microsoft.com/en-us/azure/databricks/lakehouse-architecture/reliability/best-practices