🌱 PoC ETL from Azure Storage to CosmosDB
December 12, 2022•320 words
PoC to transfer a CSV file from Azure Storage to Azure CosmosDB.
TL/DR
- Gist containing artifacts: https://gist.github.com/myreli/fea928cf46d328838697833fd354eb23
- Simple ETL implementation to transfer a file from a storage to a database (a "modern" implementation of the file transfer integration style)
- Total costs R$0.69 on Azure Data Factory (Storage and CosmosDB fall into always free services)
Concept
Azure Storage Blob → Azure Data Factory → CosmosDB
[File: 295f22f3-158f-4a63-9b34-64646a66c862]
[File: aec6259e-38b4-4fe0-9d42-d50858df816b]
Deployment
Create a resource group and deploy the template:
az group create --name poc-datafactory --location "East US"
az deployment group create \
--resource-group poc-datafactory \
--template-file poc.bicep \
--parameters dataFactoryName=etl
There are four available parameters, all optional:
location
defaults to resourceGroup locationdataFactoryName
defaults to auto-generated stringstorageAccountName
defaults to auto-generated stringdatabaseAccountName
defaults to auto-generated string
All other entities are named after their type, eg:
resource databaseContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2022-05-15' = {
parent: database
name: 'databasecontainer'
[File: cbdb8dc3-49ce-4419-81cd-61107f84f043]
Testing
- Upload a CSV file to the storage container
- Must be delimited by
;
- Must contain a header row
- Must contain the
name
,protein
andrating
fields
- Must be delimited by
- Manually trigger the pipeline inside Data Factory
- Check the output inside the Cosmos Database
Clean up
Delete the entire resource group to prevent waste:
az group delete --name poc-datafactory
Next
- [ ] Automate a trigger based on uploaded files events
- [ ] Expose mappping and translators as parameters to the pipeline execution level
- [ ] Stress test transfering to multiple outputs
Resources
- How to deploy resources with Bicep and Azure CLI
- Quickstart: Create an Azure Data Factory using Bicep
- PoC Template and dataset
- File Transfer architecture integration style
🌱 Seedlings são ideias que recém tive e precisam de cultivo, não foram revisadas ou refinadas. O que é isso?