Accessing Azure Data Lake Store from an Azure Data Factory Custom .Net Activity

04/05/2016 Update: If you are looking to use the latest version of the Azure Data Lake Store SDK (Microsoft.Azure.Management.DataLake.Store 0.10.1-preview) please see my post Upgrading to Microsoft.Azure.Management.DataLake.Store 0.10.1-preview to Access Azure Data Lake Store Using C# for what needs to be done to update the DataLakeHelper class.

Introduction

When working with Azure Data Factory (ADF), having the ability to take advantage of Custom .Net Activities greatly expands the ADF use case. One particular example where a Custom .Net Activity is necessary would be when you need to pull data from an API on a regular basis. For example you may want to pull sales leads from the Salesforce API on a daily basis or possibly some search query against the Twitter API every hour. Instead of having a console application scheduled on some VM or local machine, this can be accomplished with ADF and a Custom .Net Activity.

With the data extraction portion complete the next question is where would the raw data land for continued processing? Azure Data Lake Store of course! Utilizing the Azure Data Lake Store (ADLS) SDK, we can land the raw data into ADLS allowing for continued processing down the pipeline. This post will focus on an end to end solution doing just that, using Azure Data Factory and a Custom .Net Activity to pull data from the Salesforce API then landing it into ADLS for further processing.  The end to end solution will run inside a Custom .Net Activity but the steps here to connect to ADLS from .net are universal and can be used for any .net application.

Prerequisites

Continue reading “Accessing Azure Data Lake Store from an Azure Data Factory Custom .Net Activity”

The Lesser Known Resolution to the Unexpected Number of Columns Error Executing U-SQL on Azure Data Lake Analytics

Problem

Azure Data Lake Store (ADLS) gives you the ability to store all raw data in one location readily accessible for analysis. In my particular case I am pulling data from Salesforce and using the ADLS .net SDK to store the results in the data lake. If anyone has worked with Salesforce they know that it is possible to have an object with hundreds of custom fields, this leads to a file being stored in ADLS with hundreds of columns. One of the first translations I wanted to use U-SQL and Azure Data Lake Analytics (ADLA) for was to only evaluate a subset of the data by querying a few columns from the hundreds that might exist in a file.

An example script might look like this:

@tasks_raw =
    EXTRACT AccountId string,
            Id string,            
            OwnerId string,            
            Status string,
            Subject string,
	     ....
	     .... More Fields ....,
	     ....
            WhatCount int?,
            WhatId string,
            WhoCount int?,
            WhoId string
    FROM "/RawData/Salesforce/Task/2016/01/25/Task.csv"
    USING Extractors.Csv();



@tasks_subset =
    SELECT AccountId,
           ActivityDate,          
           CreatedById,
           CreatedDate,
           Id,           
           OwnerId,
           Status
    FROM @tasks_raw;



OUTPUT @tasks_subset
TO "/Subset/Salesforce/Task/2016/01/25/Task.csv"
USING Outputters.Csv();

The first step is to impose a schema on top of the current file by using the “Extract” syntax, there by storing that into a variable. Then from the newly created variable I can select only a subset of columns which I need for later processing to store. The common problem I have run into is with the Extract portion of this script. I have frequently received the Unexpected Number of Columns error as seen in the screen shot below.

image

Solution

The most common cause of this error, especially when the input file is a CSV, is a “comma” in one of the data cells. After I completed all possible sanitization of the data before writing it to ADLS using the .net SDK I still received this error.

I began a dialog with the Azure Data Lake team, who informed me that using the ADLS .net SDK to create and append data to a file in ADLS could be causing a column alignment issue. If you are using the .net SDK to create a file and then append data to it, be aware there is a 4 MB limit on the request. This means that if you send a request with N number of rows and that batch is over 4 MB, the request will terminate in the middle of the record and not in a record boundary delimiter. The solution here is simple, I needed to add some logic to batch the rows to be stored in 4 MB or less chunks, ensuring that the final row would end on a record boundary.

Some sample code would look like this:

public void StoreData(string path, List rows, bool append)
{
	var buffer = new MemoryStream();
	var sw = new StreamWriter(buffer);

	foreach (var row in rows)
	{
		if (buffer.Length + Encoding.UTF8.GetByteCount(row) > FOURMB)
		{
			buffer.Position = 0;
			if (append)
			{
				execute_append(path, buffer);
			}
			else
			{
				execute_create(path, buffer);
				append = true;
			}

			buffer = new MemoryStream();
			sw = new StreamWriter(buffer);
		}
		sw.Write(row);
		sw.Flush();
	}

	if (buffer.Length <= 0) return;

	buffer.Position = 0;
	if (append)
	{
		execute_append(path, buffer);
	}
	else
	{
		execute_create(path, buffer);
	}	

}


private AzureOperationResponse execute_create(string path, MemoryStream ms)
{
	var beginCreateResponse = inner_client.FileSystem.BeginCreate(path, DataLakeAppConfig.DataLakeAccountName, new FileCreateParameters());
	var createResponse = inner_client.FileSystem.Create(beginCreateResponse.Location, ms);
	return createResponse;
}

private AzureOperationResponse execute_append(string path, MemoryStream ms)
{

	var beginAppendResponse = inner_client.FileSystem.BeginAppend(path, DataLakeAppConfig.DataLakeAccountName, null);
	var appendResponse = inner_client.FileSystem.Append(beginAppendResponse.Location, ms);
	return appendResponse;
}


Conclusion

While working with any type of delimited file you no doubt have run into this unexpected number of columns error. This problem is exacerbated by the fact that you may have hundreds of columns in the file which makes it very difficult to track down. If you are using the ADLS SDK to store files in the data lake and you feel you have thoroughly gone through all other possible solutions, give this a shot. Either way it might be worth changing your storing pattern to avoid this problem on a future data file.

Deploy Azure Data Factory using PowerShell

Azure Data Factory is a cloud based, scalable orchestration service.  Currently I am using Azure Data Factory (ADF) to coordinate and schedule a large scale ETL process.  Building a data factory is a pretty easy process, consisting of various JSON definition files, representing linked services, data sets and pipelines connected together to perform an action.  For simple processes it is not very difficult to create the JSON definition files directly in the Azure Portal.  However, when you are dealing with very large complex processes that you are trying to orchestrate together, it can become a bit unruly to have to manage and deploy all of those files manually.

Visual Studio

Thankfully you can install Azure Data Factory Tools for visual studio.  Installing this extension gives you a data factory project template and the ability to publish directly to the portal.

image

With that being said I have occasionally had issues publishing from Visual Studio, thus the reason for writing this post.  Just like everything else we can use some simple PowerShell scripts to deploy ADF JSON definition files.  These are very basic scripts to push updates to the data factory, they will not validate the JSON script or do any reference validation that the data factory tools will perform.  I mainly use the data set portion of the script because that is where the largest number of files exist.

I still use visual studio and an ADF project to do the authoring of my JSON scripts, the script files are organized into LinkedService, Pipeline and Table folders.

image

One drawback I have with the ADF project template is that the actual file location does not mirror this same folder structure visible in Visual Studio.  When looking in the actual project folder on disk you will see all the JSON scripts mixed together.  With all scripts landing in a single folder I have found that following a verbose naming convention is very helpful.

I use a naming convention as follows:

  • LinkedService_[Type of Service Abbreviation]_[Service Name]
    • For example if I have an Azure Data Lake Storage account called SampleDataLake the file name would be “LinkedService_ADLS_SampleDataLake
  • DataSet_[Type of Data Set]_[DataSet Name]
    • For example if I have an Azure Sql Data Warehouse with an Employee Dimenision table named Employee_D the file name would be “DataSet_ASDW_Employee_D
  • Pipeline_[Description]
    • For example if I had a pipeline which created my employee dimension the file would be named “Pipeline_Employee_Dimension_ETL

As you can see this is a very explicit naming convention to easily convey what that JSON script is meant to do.  This also is very helpful when writing the PowerShell that will loop over specific types of files to deploy.

PowerShell Scripts

The first script is to get your PowerShell session authenticated with the Azure account you would like to deploy your data factory to.

Add-AzureRmAccount

After you are authenticated, you can now create your actual data factory.

New-AzureRmDataFactory -ResourceGroupName "ADF" -Name "SampleFactory" -Location "WestUS"

Once you have created the data factory you can now use that same resource group name and factory name to deploy the rest of the files. Please remember to do them in the following order, Linked Services then Data Sets and last Pipelines.  By following the naming convention above you can see the scripts below taking advantage of some simple filtering on the name of the file to deploy the JSON files based on the type they define.  The following 3 loops beginning with the Linked Services, deploy each file to the newly created data factory from the project folder.

foreach($file in Get-ChildItem "[ProjectFolder]" -filter "LinkedService*")
{
  New-AzureRmDataFactoryLinkedService -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}
foreach($file in Get-ChildItem "[ProjectFolder]" -filter "DataSet*")
{
 New-AzureRmDataFactoryDataset -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}
foreach($file in Get-ChildItem "[ProjectFolder]" -filter "Pipeline*")
{
 New-AzureRmDataFactoryPipeline -ResourceGroupName "ADF" -DataFactoryName "SampleFactory" -Name $file.BaseName -File $file.FullName -Force | Format-List
}

Conclusion

Azure Data Factory is a great service for orchestrating large scale jobs.  I would absolutely recommend getting the ADF tooling for visual studio it is definitely a step up from trying to do all the authoring within the Azure Portal.  The goal of this post is not to dissuade the use of the tooling, but to provide a quick solution to deploying the scripts to the Azure Portal.

Resolving Syntax Error Executing U-SQL on Azure Data Lake Analytics from Azure Data Factory

Problem

While working with Azure Data Lake Analytics(ADLA) and Azure Data Factory(ADF) I ran into a strange syntax error when trying to run a DataLakeAnalyticsU-SQL activity in ADF. Before integrating the script with ADF I always test the script first by submitting the job directly to ADLA. From Visual Studio I created a new U-SQL project, wrote the desired script and then submitted the job to ADLA. The script ran correctly, at this point I was ready to run the script with ADF (Run U-SQL script on ADLA from ADF).

The syntax error I received looked something like this:

 image

Solution

Fortunately I have been in contact with some of the ADF and ADL team members, who have been excellent in helping to resolve any issues I run into. Continue reading “Resolving Syntax Error Executing U-SQL on Azure Data Lake Analytics from Azure Data Factory”

Querying Azure SQL Database from an Azure Data Lake Analytics U-SQL Script

Introduction

With the public preview of Azure Data Lake Analytics and U-SQL scripting language becoming available, doing big data processing has become simpler and less daunting for the non-hadooper.  Besides the simplicity, the ability to query external data sources such as Azure SQL Database, Azure SQL Data Warehouse and SQL running on Azure VM, from within a U-SQL script, creates a very compelling case to jump on board.  The focus of this post is my experience with getting that external connection up and running with Azure SQL Database serving as my external source.

Prerequisites

Continue reading “Querying Azure SQL Database from an Azure Data Lake Analytics U-SQL Script”

PowerShell Validation Script for Azure RemoteApp Image

Azure RemoteApp allows you to host windows apps on any device.  I recently had to setup a proof of concept using Azure RemoteApp and an internal client server application.  The actual process is pretty straight forward if you are familiar with the process of capturing an image of an Azure VM or you already have an image you would like to use.

If you are starting this process from scratch make sure you use the Windows Server 2012 Remote Desktop Session Host image, this image is a base image valid for use with Azure RemoteApp.  This image also contains the RemoteApp PowerShell Validation Script that is a required step before Sysprepping the VM and capturing the image.  Also note that this image seems to only be available if you are creating the Azure VM from the Old Azure Portal.

Continue reading “PowerShell Validation Script for Azure RemoteApp Image”