Blog
December 12, 2018Extracting Data from Azure Data Lake Store Using Python: Part 1 (The Extracting Part)
A file-based data lake is a principal component of a modern data architecture. As such, data professionals may find themselves needing to retrieve data stored in files on a data lake, manipulating them in some fashion, and potentially feeding the result into a target data store or service. Data lakes can be built on a variety of platforms, and one of the most prominent cloud-based ones today is Microsoft Azure Data Lake Store (ADLS).
Whether you need to get ADLS data on-premises for your data flow, or need a pipeline built entirely in the cloud, you can write an app for that… and, if you're a data engineer with less of a programming background, you'll be glad to know you don't have to learn C# or Java-you can use Python!
In typical Python fashion, it's fairly straightforward to get data flowing. On the Azure side, just a few configuration steps are needed to allow connections to a Data Lake Store from an external application. In this blog, I'l coach you through writing a quick Python script locally that pulls some data from an Azure Data Lake Store Gen 1.
What do you need?
- Microsoft Azure subscription-free 30-day trials available-with an Azure Data Lake Store Gen 1 provisioned and populated with at least one file
- Local Python installation with
azure-datalake-store
library (ADLS ADK) - Python IDE (even if it's just a text editor)
Let's configure stuff on Azure!
Assuming you already have an Azure Data Lake Store populated with data, the first thing you'll need to do is create an Azure Active Directory App Registration (which is analogous to a service account) in order to allow other applications/services to access your Data Lake Store. Every application needs three pieces of information for authentication:
- Application ID (a.k.a. Client ID)
- Application Key (a.k.a. Client Secret)
- Directory ID (a.k.a. Tenant ID)
Register the Application
- Get started by visiting the Azure Active Directory page and clicking the "New application registration" button.
On this screen, give your app registration a name, a URL, and "Web app / API" as the Application Type (as we will be using service-to-service authentication, as opposed to end-user authentication).
Once the resource is created, the Overview blade for the app will appear. On it you'll find the first piece of information needed for authenticating as this app-the Application ID (a.k.a. the Client ID). One down, two to go!
The next value necessary for authentication is found under the Settings tab, followed by the Keys option.
Here, you'll need to generate a new Application Key (a.k.a. Client Secret) by typing in the Description box, choosing an expiration date, and hitting Enter.
Save the key somewhere safe, as you'll never be able to view it again (although you could always generate a new one if needed). Two down!
The last bit of authentication information is found again on the Azure Active Directory-this time, in the Properties menu. Copy the Directory ID (a.k.a. the Tenant ID) for safekeeping.
Grant the Application Access
Now that you have an application registration, you'll need to grant it access to your Data Lake Store. (Note: The way permissions behave and the options available for ADLS are similar for other Azure resources.)
- Navigate to the Data Lake Store, click Data Explorer, and then click the Access tab.
Choose Add, locate/search for the name of the application registration you just set up, and click the Select button.
On this blade, you have several options:
- The first deals with the type of permissions you want to grant-Read, Write, and/or Execute. For our purposes, you need read-only access to the ADLS for your Python application. Therefore, one might assume only Read access is needed, which is true if your application will directly open a specific file (i.e., you already know the exact location and name of the file). However, if you need to traverse files on the ADLS for any reason (such as wanting to pull all files with a particular extension), the application also needs Execute access. Check the Read and Execute permission boxes for this use case.
- The second option determines the scope of the permissions selected. Choose the "This folder and all children" radio button-otherwise, only items on the root directory will be available to applications using this registration.
- The last option asks if you want to treat this permission setting as a one-off action, as just a default for all future files/folders created in this ADLS, or both. Typically, as well as for this scenario, choosing both is the most prudent way to go.
With that, you're all set from the Azure side!
Bored with Configuration? Enter Python.
Get the SDK
To access the ADLS from Python, you'll need the ADLS SDK package for Python. Through the magic of the pip installer, it's very simple to obtain. In any console/terminal (such as Git Bash or PowerShell for Windows), type the following command to install the SDK. (Note that this assumes you have added your Python installation to the PATH system environment variable, or are using an activated Python virtual environment.)
pip install azure-datalake-store
Commence the Coding
In any Python IDE (Jupyter Notebooks are used below), you can now employ this package for use.
- Start by importing two specific components from the ADLS SDK and calling the appropriate function (the lib.py file's
auth
function) to obtain a credentials object to your Azure Active Directory, using the Tenant ID, Client ID, and Client Secret copied earlier from Azure. (Note that theauth
function grants access to Azure, not to your ADLS or any specific resource.)
To authenticate to your ADLS to perform file system operations, instantiate an object of the core.py file's AzureDLFileSystem
class, passing in your credentials object and the name of your ADLS. This will return a connection object that can then be used in conjunction with other functions in the azure-datalake-store
library for interaction with the ADLS files. (Note that authenticating in this manner only permits file system operations; it does not allow for account/access management of the ADLS-this requires a different series of commands and credentials not covered in this tutorial.)
Though you can download an ADLS file to your local hard drive through a simple call to the multithread.py file's ADLDownloader
class, you'll instead be reading a file into memory for instant manipulation in Python. Start with the AzureDLFileSystem
class' listdir
method, which works much like the standard os
library's listdir
function. When run without arguments, you'll get back a list of folders and files at the root level of the ADLS. When passed a specific folder path, you receive the list of folders and files at that path.
Of course, the above only works if you want a specific known folder's contents; however, let's say we want all files at the level below the root. To this end, you can traverse the files within the root folders using a for
loop. If no files exist at the path passed to the listdir
method, an empty list will be returned, but, no need for an if
statement-the list class' extend
method will exclude those when combining the file lists returned by listdir
.
Now that you have your file name, you can read the file into memory utilizing the open
method of the AzureDLFileSystem
class, just like the Python built-in function open
. You can even call it within a context manager to automatically close the file after you obtain its contents. The difference here is that you are limited to reading the file as a bytes object, rather than text/string, as you can see after calling the opened file's read
method and then the built-in type
function. Printing the actual file contents shows it appears to be in json format; however, you can't manipulate it like json until we convert it.
Depending on the characters in your file, going directly from bytes to json can throw an exception, so you should first convert to string and then to json. To accomplish the former, you call the decode
method of the bytes object and provide an encoding type (utf-8 will usually do). Next, you try calling the json
library's loads
method, which takes the string as input and returns a dictionary, as shown by the results below. (Note that you need to import the json
library in order to use it.)
Having the file contents accessible in json/dictionary format, you can provide any key you want and get back its value. For example, here is how you would get the first author's name from the file contents.
At this point, you can do just about anything with this data, including transforming it (whether using pandas
DataFrames or native Python classes) and/or outputting it to a file or database.
Now that you have the basics of the configuration and coding down, you can start to explore aspects a developer should consider when designing a full-fledged ETL/ELT pipeline with Azure Data Lake Store and Python. Stay tuned to CapTechConsulting.com for Part 2 of our foray into this subject coming soon!