Azure Data Factory – eat{Code}live

June 16, 2016May 29, 2018

Running an Azure Data Factory Pipeline on a Weekday Schedule Using an Azure Function

Introduction

I have written a few posts about different aspects of Azure Data Factory. I use it as the main workhorse of my data integration and ETL projects. One major drawback I have found with Azure Data Factory is the scheduling system, it’s not as flexible as I and many others would like it to be. With that being said there are certainly ways to adapt and get more control of an Azure Data Factory pipeline execution. In my post Starting an Azure Data Factory Pipeline from C# .Net, I outline the need to kick off a pipeline after a local job has completed and how this can be attained by utilizing the SDK to programmatically set the pipelines Start/End dates. You may not have that requirement specifically, but let’s say you want to only run a pipeline during the weekday or another specific schedule, this can be accomplished by utilizing the same code from my prior post and scheduling a local console app. However, I thought it would be more fun to utilize Azure Functions to kick off a pipeline on a weekday schedule to provide a fully cloud based solution. Continue reading “Running an Azure Data Factory Pipeline on a Weekday Schedule Using an Azure Function”

June 3, 2016June 3, 2016

Azure Data Factory Copy Activity Storage Failure Error

I was recently tasked with migrating local integration jobs into the cloud. The first job I began to tackle was an integration job between Salesforce and a local financial system. The integration workflow is to first download the newest entries of a specific object from salesforce, then to push that into an on premise sql server staging table. Once the data is copied locally another job within the financial system will process the staged data.

In order to accomplish this task, I decided to use Azure Data Factory, with the on premise gateway to connect to the local sql database. If you have seen some of my other posts I have used Azure Data Lake Store to land my data, with this pipeline I decided to use Azure Blob Storage. Since the Azure Blob Storage api has the ability to store Append Blobs, I was able to follow a similar pattern as I followed using the Azure Data Lake Store and append the Salesforce data to a blob as I fetched the data.

Once I had some sample Append Blobs in my container my next step was to setup the Azure Data Factory copy activity to get that data transferred to the on premise sql server staging tables. This was where I began to run into issues. After a lot of verification and testing, it turns out Append Blobs are not supported in Azure Data Factory.

Here are a few things to look out for to rule out this issue:

In the blob container blade, it will show the BlobType, check the type of the blobs you are trying to work with in Azure Data Factory.
I also ran into an issue where the data set which was pointing to the AppendBlob would not validate.
When running the Azure Data Factory copy activity against an Append Blob you will see the following error:
Copy activity met storage operation failure at ‘Source’ side. Error message from storage execution: Requested value ‘AppendBlob’ was not found.
This can be a bit misleading if you are not aware AppendBlobs are not supported.

March 13, 2016March 13, 2016

Azure Data Factory Pipeline Not Running (Pending Execution)

When starting out learning about Azure Data Factory, the first example usually completed is a basic copy activity. A great tutorial on moving data from an azure blob to an azure sql table can be found here. I think one of the key pieces of the data movement tutorial that gets missed is setting the external property on the input blob json definition to true. Setting this to true lets azure data factory know that the data is being externally generated and not from another pipeline in the factory. When I began trying my hand at creating pipelines I completely missed this note and was scratching my head wondering why the pipeline would never execute. With that being said I wanted to just walk through a few screens from the azure portal that are very helpful when trying to understand what is happening with your pipeline.

Suppose we have a simple copy example as follows:

Looking at the SampleCopyPipeline blade you will see both datasets:

Selecting the InputBlob, to open that blade you will notice the first problem. In the monitoring section of the InputBlob, there are no slices ready to be worked on. The output of the copy activity is directly the dependent on there being an input that needs to be processed.

Now selecting the OutputTable, to open that blade you will see the common Pending Execution status for the output slice.

As I mentioned before the output slice is directly dependent on the input slice being ready for processing, in order for this slice to begin processing the corresponding input slice needs to be ready. We can see the culprit once we select the output slice to open the output slice blade, which shows the dependent upstream slices it is waiting for.

The input slice is there, however the status is None. Now the solution as I stated above is to set the external property of the input blob to true. After that is complete we can now see that the input blob blade shows a slice Ready for processing.

We can also see the corresponding output slice blade no longer shows it is waiting on the input slice.

Conclusion

When first learning about Azure Data Factory, it may be difficult to follow exactly what is happening with the pipeline. From the azure portal and the new Azure Data Factory Monitoring App you can get a lot of information about the status of your pipeline. Hopefully this example gave you a simple way to start using the portal to better understand what is happening and troubleshoot any possible issues.

February 24, 2016May 29, 2018

Starting an Azure Data Factory Pipeline from C# .Net

Introduction

Azure Data Factory (ADF) does an amazing job orchestrating data movement and transformation activities between cloud sources with ease. Sometimes you may also need to reach into your on-premises systems to gather data, which is also possible with ADF through data management gateways. However, you may run into a situation where you already have local processes running or you cannot run a specific process in the cloud, but you still want to have a ADF pipeline dependent on the data being processed locally. For example you may have an ETL process that begins with a locally run process that stores data in Azure Data Lake. Once that process is completed you want the ADF pipeline to being processing that data and any other activities or pipelines to follow. The key is starting the ADF pipeline only after the local process has completed. This post will highlight how to accomplish this through the use of the Data Factory Management API.

Prerequisites

You will need an Azure Subscription. Free Trial
Visual Studio. Visual Studio Community 2015
A Data Factory you want to manually start from .Net

Continue reading “Starting an Azure Data Factory Pipeline from C# .Net”

February 6, 2016May 29, 2018

Accessing Azure Data Lake Store from an Azure Data Factory Custom .Net Activity

04/05/2016 Update: If you are looking to use the latest version of the Azure Data Lake Store SDK (Microsoft.Azure.Management.DataLake.Store 0.10.1-preview) please see my post Upgrading to Microsoft.Azure.Management.DataLake.Store 0.10.1-preview to Access Azure Data Lake Store Using C# for what needs to be done to update the DataLakeHelper class.

Introduction

When working with Azure Data Factory (ADF), having the ability to take advantage of Custom .Net Activities greatly expands the ADF use case. One particular example where a Custom .Net Activity is necessary would be when you need to pull data from an API on a regular basis. For example you may want to pull sales leads from the Salesforce API on a daily basis or possibly some search query against the Twitter API every hour. Instead of having a console application scheduled on some VM or local machine, this can be accomplished with ADF and a Custom .Net Activity.

With the data extraction portion complete the next question is where would the raw data land for continued processing? Azure Data Lake Store of course! Utilizing the Azure Data Lake Store (ADLS) SDK, we can land the raw data into ADLS allowing for continued processing down the pipeline. This post will focus on an end to end solution doing just that, using Azure Data Factory and a Custom .Net Activity to pull data from the Salesforce API then landing it into ADLS for further processing. The end to end solution will run inside a Custom .Net Activity but the steps here to connect to ADLS from .net are universal and can be used for any .net application.

Prerequisites

You will need an Azure Subscription. Free Trial
Visual Studio. Visual Studio Community 2015
(ADLS)Azure Data Lake Store

Continue reading “Accessing Azure Data Lake Store from an Azure Data Factory Custom .Net Activity”

January 18, 2016May 29, 2018

Deploy Azure Data Factory using PowerShell

Azure Data Factory is a cloud based, scalable orchestration service. Currently I am using Azure Data Factory (ADF) to coordinate and schedule a large scale ETL process. Building a data factory is a pretty easy process, consisting of various JSON definition files, representing linked services, data sets and pipelines connected together to perform an action. For simple processes it is not very difficult to create the JSON definition files directly in the Azure Portal. However, when you are dealing with very large complex processes that you are trying to orchestrate together, it can become a bit unruly to have to manage and deploy all of those files manually.

Visual Studio

Thankfully you can install Azure Data Factory Tools for visual studio. Installing this extension gives you a data factory project template and the ability to publish directly to the portal.

With that being said I have occasionally had issues publishing from Visual Studio, thus the reason for writing this post. Just like everything else we can use some simple PowerShell scripts to deploy ADF JSON definition files. These are very basic scripts to push updates to the data factory, they will not validate the JSON script or do any reference validation that the data factory tools will perform. I mainly use the data set portion of the script because that is where the largest number of files exist.

I still use visual studio and an ADF project to do the authoring of my JSON scripts, the script files are organized into LinkedService, Pipeline and Table folders.

One drawback I have with the ADF project template is that the actual file location does not mirror this same folder structure visible in Visual Studio. When looking in the actual project folder on disk you will see all the JSON scripts mixed together. With all scripts landing in a single folder I have found that following a verbose naming convention is very helpful.

I use a naming convention as follows:

LinkedService_[Type of Service Abbreviation]_[Service Name]
- For example if I have an Azure Data Lake Storage account called SampleDataLake the file name would be “LinkedService_ADLS_SampleDataLake”
DataSet_[Type of Data Set]_[DataSet Name]
- For example if I have an Azure Sql Data Warehouse with an Employee Dimenision table named Employee_D the file name would be “DataSet_ASDW_Employee_D”
Pipeline_[Description]
- For example if I had a pipeline which created my employee dimension the file would be named “Pipeline_Employee_Dimension_ETL”

As you can see this is a very explicit naming convention to easily convey what that JSON script is meant to do. This also is very helpful when writing the PowerShell that will loop over specific types of files to deploy.

PowerShell Scripts

The first script is to get your PowerShell session authenticated with the Azure account you would like to deploy your data factory to.

Add-AzureRmAccount

After you are authenticated, you can now create your actual data factory.

New-AzureRmDataFactory -ResourceGroupName "ADF" -Name "SampleFactory" -Location "WestUS"

Once you have created the data factory you can now use that same resource group name and factory name to deploy the rest of the files. Please remember to do them in the following order, Linked Services then Data Sets and last Pipelines. By following the naming convention above you can see the scripts below taking advantage of some simple filtering on the name of the file to deploy the JSON files based on the type they define. The following 3 loops beginning with the Linked Services, deploy each file to the newly created data factory from the project folder.

foreach($file in Get-ChildItem "[ProjectFolder]" -filter "LinkedService*")
{
  New-AzureRmDataFactoryLinkedService -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}
foreach($file in Get-ChildItem "[ProjectFolder]" -filter "DataSet*")
{
 New-AzureRmDataFactoryDataset -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}
foreach($file in Get-ChildItem "[ProjectFolder]" -filter "Pipeline*")
{
 New-AzureRmDataFactoryPipeline -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}

Conclusion

Azure Data Factory is a great service for orchestrating large scale jobs. I would absolutely recommend getting the ADF tooling for visual studio it is definitely a step up from trying to do all the authoring within the Azure Portal. The goal of this post is not to dissuade the use of the tooling, but to provide a quick solution to deploying the scripts to the Azure Portal.

January 7, 2016January 7, 2016

Resolving Syntax Error Executing U-SQL on Azure Data Lake Analytics from Azure Data Factory

Problem

While working with Azure Data Lake Analytics(ADLA) and Azure Data Factory(ADF) I ran into a strange syntax error when trying to run a DataLakeAnalyticsU-SQL activity in ADF. Before integrating the script with ADF I always test the script first by submitting the job directly to ADLA. From Visual Studio I created a new U-SQL project, wrote the desired script and then submitted the job to ADLA. The script ran correctly, at this point I was ready to run the script with ADF (Run U-SQL script on ADLA from ADF).

The syntax error I received looked something like this:

Solution

Fortunately I have been in contact with some of the ADF and ADL team members, who have been excellent in helping to resolve any issues I run into. Continue reading “Resolving Syntax Error Executing U-SQL on Azure Data Lake Analytics from Azure Data Factory”