Azure Data Factory is a service to move data. It also also provides a data integration service. In this article I am going to use Azure Data Factory to copy (not move) data from an SFTP to an Azure Data Lake Store. I will not use the data integration function(s), only copy files.
The Azure Data Lake Store provides unlimited storage capacity for big data, it is based on HDFS so products like HDInsight can connect for data analysis.
Needed for this article
In the Azure portal we will create the Azure Data Factory. It is only available on a small number Azure locations at the moment of writing.
Once the Azure Data Factory is created, click on the Copy Data buttion.
This will open the Azure Data Factory editor with the Copy Wizard.
First step is to enter a name for the copy job (a job is called a Pipeline in Data Factory).
Next step is to select an interval or run it once. I will select the interval.
On the next page we will connect to a data source. This is the location to copy the files from. There is an option to connect to a new source or to select an existing source if one is created earlier. I’ll go for the new one and select SFTP.
Once SFTP is selected, it will need to information from the SFTP server. Enter a name for the connection, the port (if other than 22), the IP address or hostname (called Service Host in Data Factory) and the user name to connect to the SFTP server. Click on browse to upload your keyfile (the keyfile will be stored in Azure) which will act as a password for your SFTP server.
There are some advanced security options like a passphrase or SSH Host Key Validation which I will not use.
Pres next and the screen will become grey, it is now connecting to the SFTP server with the information entered on the page.
Once the connection is established, it displays the folder structure of the SFTP server. Choose a folder by clicking on the folder (once) and select Choose. Or double click to open the folder and select one file.
Once you pressed Choose, the page displays a screen with a few option what to do with the files. Since we only want to copy the files and not do any data integration, we will check the Copy Files Recursively and Binary Copy. I made these options bold since these are important as it will disable data integration options and only copies files.
By enabling these two features, Data Factory will just copy the folder and subfolder to from the SFTP server to the destination that we will specify in future steps. If needed, enable Encryption.
SFTP has the option to connect to Azure Blog Storage, Azure Data Lake Store and a File System. I want to use the Azure Data Lake Store.
The Azure Data Lake Store needs to exist. Once Azure Data Lake Store is select and the Next buttom is pressed Azure will display a page where the Data Lake Store can be selected. For convenience I use oAuth to connect to the Store.
Once the Data Lake Store is connected, the destination folder path can be specified. Click on Browse
and select a folder where the files need to be placed.
The only option on this page is to enable encryption.
On the next page, two settings can be set: Cloud Units and Parallel copies. I will leave these both to Auto.
The last step will display a summary with all our configuration options, if everything is checked, press Next and Data Factory will configure the pipeline.
If all the checks are green, click on monitor copy pipeline to see how the data is processed.
In the next article we will dive deeper into monitoring Azure Data Factory.