BlogOctober 1, 2018
Integrating FTP Into Your Azure ETL
In Azure, Data Factory is the ETL tool of choice, but have you ever tried to use Data Factory to pull data from an FTP server where you can't just move or remove the files after processing? With Data Factory alone, it can be a little challenging. If the naming convention is anything less than perfectly consistent or if the timing is unpredictable, how do you keep track of what you already collected versus what is new? This is a quick tip to help you get what you need from an FTP or SFTP server without any custom code. Just create a logic app!
Logic App has a trigger for new files on an FTP server. You can use this to identify new files and then move the content into a block blob or data lake store for further processing using PolyBase or Data Factory or whatever you choose.
There are three quick caveats… First, as of the time of this writing, the added or modified file trigger is currently limited to 50 MB of content. (https://feedback.azure.com/forums/287593-logic-apps/suggestions/17229566-increase-ftp-connector-limit-above-50mb). Second, the trigger utilizes the modification timestamp on the file rather than the timestamp of when the file hits the FTP server. Third, the FTP trigger does not allow filtering based on file name.
Fortunately, all of these caveats have a work-around solution. First, if you expect to exceed 50 MB, you can utilize the FTP file "properties only" trigger. With this trigger, you will need a second step to import the content of the file based on the path. This action does have a size limit, but it is much larger than 50 MB.
The timestamp issue can be a bigger problem. In the event that the source system creates the file before moving it to the FTP location, you need to make sure that your time window is long enough to account for the possible delay. (for example, if you poll the server every minute and the file arrives with a timestamp of now minus 5 minutes, you won't pick it up.) In most FTP scenarios, real-time data is not a top priority so aim for long enough time windows to catch all new files. Missing data sets can be difficult to troubleshoot.
The third caveat is the lack of filtering within the FTP folder you are pulling from. The screen shot below illustrates a quick solution. Logic App has a condition action that you can put in prior to collecting the file contents on new files. The condition action has a simple string filter with true and false actions. Obviously, you can pull the files you want into storage in the "if true" branch. Using the "if false" sequence, you may want to send an email alert for somebody to review unexpected files and their contents.
Utilizing a Logic App can be a helpful way to supplement Data Factory for what can potentially be lacking FTP functionality. Just remember to account for it in your ETL deployment processes.