記事

Connect Teradata Vantage to Azure Data Factory Using Custom Activity Feature

Teradata customers are interested in integrating Vantage with Microsoft Azure First Party Services. This Getting Started Guide will help you connect Teradata Vantage using Azure Data Factory Custom Activity feature.

2020年4月9日 11 分で読める

Kevin Bogusch

Rupal Shah

Run scripts to move and transform data leveraging Vantage

Many Teradata customers are interested in integrating Teradata Vantage with Microsoft Azure First Party Services. This Getting Started Guide will help you to connect Teradata Vantage using Azure Data Factory Custom Activity feature.

Although this approach has been implemented and tested internally, it is offered on an as-is basis. Microsoft does not provide validation of Teradata Vantage using Azure Data Factory Custom Activity feature.

Disclaimer: This guide includes content from both Microsoft and Teradata product documentation.

Overview

We will use the Teradata Access Module for Azure LOAD script for high-speed data movement from Azure Blob Storage to Teradata Vantage followed by a BTEQ script for further in-database transformation using the Azure Data Factory Custom Activity feature.

Azure Data Factory is a cloud-based data integration service that allows you to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.

There are two types of activities that you can use in an Azure Data Factory pipeline.

Data movement to move data between supported source and sink data stores.
Data transformation to transform data using compute services such as Azure HDInsight, Azure Batch, and Azure Machine Learning.

The current, ODBC-based Azure Data Factory Teradata Connector supports data movement with Vantage being a data source but does not support data transformation activities using the DataFlow feature.

Some customers have large data movement requirements where ODBC may not perform, or they need to transform or process data in a way that isn't supported by Data Factory. Fortunately, you can create a Custom Activity with your own data movement or transformation logic and use the activity in a pipeline in Data Factory. The custom activity runs your customized code logic on an Azure Batch pool of virtual machines.

This is a diagram of the workflow. You will need to install the Vantage client software on a compute node.

Prerequisites

You are expected to be familiar with Azure services and Teradata.

You will need the following accounts, objects, and systems. Links have been included with setup instructions.

A Microsoft Azure subscription. You can start with a free account.
An Azure Blob Storage account and container.
An Azure Batch Service account and virtual machine (VM) pool for compute nodes. You will need to install the Vantage client software on the compute nodes.
An Azure Data Factory instance and a pipeline with custom activity.
Teradata Vantage with a user and password. The user must have permission and space to create a table. Vantage must be accessible from Azure services.
Sample data file, which will be uploaded to your storage account. It is available at https://github.com/teradatadownload/azure/raw/master/dataset.

Getting Started

For Azure Data Factory Custom Activity to run scripts against Vantage, you need a Azure Blob Storage account and container, an Azure Batch Service account and pool of compute nodes (i.e. Windows or Linux VM), Vantage client software on a virtual machine, and scripts in an Azure Blob Storage account. Finally, we will create Azure Data Factory pipeline job using Custom Activity feature to execute your scripts on the compute node VM against Vantage.

If are new to Azure Data Factory, see Introduction to Azure Data Factory

Create an Azure Blob Storage account and container

We need an Azure Blob Storage account to store the following:

A dataset to load into a Vantage table.
The scripts which the Custom Activity will access and execute the load and transformation scripts.

If you do not have an Azure Blob Storage account, please Create an Azure Storage account.

We suggest creating the storage account with two (2) containers. Call one of the containers data for the dataset and the other container files for scripts. This document will use the names data and files for the containers. You call them something entirely different. However, for this document, we will refer to the containers as data and files. You will need to adjust the directions for the names that you actually used.

Create an Azure Batch Service

Azure Batch creates and manages a pool of virtual machines. Installs the applications you want to run and schedules jobs to run on the virtual machines.

1. Logon to the Azure portal and create a Batch account.
2. After deployment, click on Go to resource to view the Batch account in preparation for the next step.

Create a pool of compute nodes

We need to create a pool of virtual machines, which can be either Windows- or Linux-based. We will use Windows.
For simplicity, we will only create a pool with a single virtual machine. In production, you would create more virtual machines for scaling and redundancy.

1. In the Batch account, click on Pools in the left pane under Features and then +Add. Enter a Pool ID name.
2. For the Operating System, choose the following.

3. For the Node Size, choose the following.

4. For the Scale, choose the following.

5. Leave the other options at their default values. Click the Save button.

Batch will create the pool immediately, but it will take a few minutes to allocate and start the virtual machines.

6. Once the Allocation state shows as Steady, click on your Pool ID.

You may need to click on Refresh to update the Allocation state.
Picture1-(1).jpg

7. Click on Nodes in the left pane and then click on your specific node.
Picture1-(2).jpg

8. Create a user on the virtual machine by clicking on Connect. Select Specify your own option. Enter a Username and Password, set Is administrator to true, and set an appropriate Expiry Time. Click on Add user account. Click on Download RDP file. (Alternatively, you can copy the IP address and port and manually connect with the Remote Desktop application).

Note that the password must be sufficiently complex. An error message will remind you of what characters and length are required.
Picture1-(3).jpg

We will use the remote connection to the server to install software in the next section.

Install Vantage client software

We use the Teradata Access Module for Azure as well as BTEQ scripts in this document. However, you can use other client software tools and scripting languages, such as Teradata Parallel Transporter or the Teradata Vantage Python package for Advanced Analytics, as an alternative.

1. Log into the Windows virtual machine created in the previous step, if you have not already done so. (You can use the downloaded RDP file as a shortcut). You will need the username and password you created in the previous section.

2. Open Server Manager. Select Local Server in the left pane.
Picture1-(4).jpg

3. Set IE Enhanced Security to Off.
Picture1-(5).jpg

While this is normally a security concern, it is acceptable for our purposes. You can always turn it back on.

4. Open Internet Explorer and go to the Teradata Downloads site. If you do not have a login, you will need to register. Click on TTU Windows Install and then on the TTU 16.20.xx.yy Windows – Base link. You will be asked to login and agree to the licensing provisions. Choose to save the file.

The client software is periodically updated. The version will change. The latest version is acceptable.

5. Once completed, open the downloaded file. It appears as a folder as it is a compressed file. Right-click on the file and select Extract all. A new Explorer window will appear. Open the folder TeradataToolsAndUtiltiesBase. Double-click to run the setup program. Accept any defaults. Select BTEQ, Teradata Parallel Transporter Base, Teradata Parallel Transporter Stream, and Teradata Access Module for Azure and click the Install button.

You can verify that the software installed by opening a command prompt or PowerShell window and running BTEQ. The BTEQ program will start. Enter “.quit” to exit.

Create Teradata Access Module for Azure and BTEQ scripts

Azure Data Factory Custom Activity requires the scripts to be in an Azure Blob Storage account container. We will put them in the previously created container, files.

We will then load a data file from an Azure Blob Storage container, called data, into Vantage.

For our access module load script to access Azure Blob Storage account, we need to create a credentials file. For more information and an example of a credentials file, see Teradata Tools and Utilities Access Module Reference.

1. You may place the credentials file where you want as you can use the -ConfigDir parameter to specify this path. We will create a directory, azureaxsmod, in the Teradata directory. (That is, c:\users\public\azureaxsmod.) The file name is credentials.

Here is a sample credentials file. The StorageAccountKey below is abbreviated. You will need to ensure that your full key in your file. Note that the file name has no extension.

Replace any value within the angled brackets with actual values first.

Here are some examples.

You can find the "StorageAccountName" and "StorageAccountKey" properties by selecting Storage Resource group, then Storage account, and finally Access keys in the Azure portal.

2. Create a batch file, tbuild_load.bat, for the Custom Activity. This can be placed in the azureaxsmod directory.
Note that the full path name to the script and variable files is included. You will need to adjust for your batch file.

3. Create an access module load script, tdaxs4az(load).txt, which will be run by the Custom Activity. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.

Here are some examples.

The script will first drop any previous staging and error tables and then create our Vantage staging table. Next, the script will perform a load to the staging table using an existing data file, dataset, in our Azure Blob Storage container data.

4. Create a job variable (jobvars) file, jobvars(load).txt, which will supply directions to the access module script. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.

Here are some examples.

Screen-Shot-2020-04-10-at-12-09-59-PM-(1).png

5. Create a batch file, bteq_transform.bat, for the Custom Activity to use when invoking a BTEQ script. This can be placed in the azureaxsmod directory.

6. Create a BTEQ script, update_inssel.txt, which will be run by the Custom Activity. This can be placed in the azureaxsmod directory. Replace any value within the angled brackets with actual values first.

Here are some examples.

The script will drop and then create our target table, perform an UPDATE statement (recent loaded) on the staging table, and then perform an INSERT/SELECT to our target table. This can be made much more complex in a production scenario, but it shows how additional processing can be performed.

Upload the files to Azure Blob Storage

1. Upload the previously created files to the container, files, in Azure Blob Storage.

tbuild_load.bat
bteq_transform.bat
tdaxs4az(load).txt
jobvars(load).txt
update_inssel.txt

You can use the Azure portal to upload files into your Azure Blob Storage account.

We created these files on the Windows virtual machine. You will need to copy them to your local system or whichever system that you are using to access the Azure portal.

Alternatively, you can use the Azure Storage Explorer. We do not cover how to configure or use the Azure Storage Explorer but it is fairly intuitive.

Create an Azure Data Factory pipeline job with a Custom Activity

We need an Azure Data Factory instance to create our Custom Activity pipeline job.

1. Create a data factory. De-select Enable GIT as we are not using it.

2. After deployment succeeds, open the Data Factory instance. Select the Author & Monitor tile.

This starts the Azure Data Factory user interface (UI) application on a separate tab. You may need to (re‑)select your Azure account as this is opening a new browser tab.

3. Click on Create pipeline.
Create a new pipeline in Azure Data Factory

Create a new pipeline in Azure Data Factory

4. Open the Batch Service folder under Activities on the left pane. Drag Custom into your pipeline area in the right.
Create a custom activity in Azure Data Factory

Create a custom activity in Azure Data Factory

5. Drag another instance of Custom into the pipeline area. Drag the green connector from the first Custom instance to the second Custom instance.

6. Select the first Custom instance. In the General section below, rename it from Custom1 to Load to Vantage.

Select the second Custom instance and rename it from Custom2 to Transform in Vantage.
Rename custom instance in Azure Data Factory

Rename custom instance in Azure Data Factory

7. In the other browser tab, which should still be open to the Azure portal, look up the Batch account. You will need information about the Batch account that you previously created. Click on Key in the left pane under Settings for the URL and access key. Click on Pools in the left pane under Features for the pool ID.

8. Select the Load to Vantage object. Click the Azure Batch tab below. Click +New to add a new linked service.

9. In New linked service (Azure Batch) dialog, enter batch account into the Account name, the batch access key into Access key, the batch account URL into Batch URL, and the batch account pool ID into Pool name.

10. Select the Storage linked service name dropdown and then click on +New. Select your Azure subscription, your Storage account name, and then click Create. Once the dialog box is dismissed, click Create again.

11. Select the Transform in Vantage object. Click the Azure Batch tab below. In the Azure Batch linked service, select the Azure Batch linked service you just created (e.g. AzureBatch1).

12. Select the Load to Vantage object. Click the Settings tab below. In the Command text box, enter “cmd /c tbuild_load.bat”. This will run the access module script. Select the Resource linked service to Azure Blob Storage linked service connection you just created (e.g. AzureBlobStorage1). Set the Folder path and set folder path to our container, files.

13. Select the Transform in Vantage object. Click the Settings tab below. In the Command text box, enter “cmd /c bteq_transform.bat”. This will run the BTEQ script. Select the Resource linked service to Azure Blob Storage linked service connection you just created (e.g. AzureBlobStorage1). Set the Folder path and set folder path to our container, files.

14. Click on Publish all. In the dialog box, click on Publish.

15. Once the pipeline has been published, click on Debug to run the job.

The job status will appear in the bottom pane.

If the job ran successfully, Status will display Succeeded. You can check target table on Vantage.

Troubleshooting a job failure

A job failure will show as Failed in the status pane.

If you look in the drawing pane, you should see a red X that will give you an indication of where the failure occurred.
Picture1-(6).jpg

The first step is to ensure that the pipeline can make successful connections to the pool and storage. Select the Load to Vantage object and then the Azure Batch tab below. Click on Test Connection. Fix any errors encountered.

Click on the Settings tab and, again, click on Test Connection. Fix any errors encountered.

Repeat this for the Transform in Vantage object.

If the job still fails, log into the Windows virtual machine. Open a PowerShell (or command prompt) window and run the batch job, tbuild_load.bat, manually. Fix any errors encountered. Then run the batch job, bteq_transform.bat, manually. Fix any errors encountered.

Run the Data Factory job again.

If there are still errors, you can investigate it from Azure.

Open Batch Service resource group > Batch account > Pools > [Poll ID] > Nodes > Name. In your node folders, click on workitems > adfv2-[Pool ID] > job-[n] > [unique job identifier] where stderr.txt and stdout.txt results of your job can be found.

To view our load process, click the unique job identifier with the stdout.txt results.
Picture1-(7).jpg

To view our transform process, click the unique job identifier and CD wd directory where you will find our output.txt file we identified bteq_transform.bat.
Picture1-(8).jpg

Similarly, you can find job run result on Windows VM under D:\batch\tasks\workitems\adfv2-[Pool ID]\job-[n]\[unique job identifier].

Summary

You can use Azure Data Factory with a Custom Activity feature to load and transform data into Teradata Vantage. We used the Azure Batch Service, created a Windows virtual machine compute pool, and installed Vantage client software to run our script against Vantage. We also used Azure Blob Storage container to store scripts for Custom Activity pipeline job.