Blog
March 13, 2019Extracting Data from Azure Data Lake Store Using Python: Part 2 (The Design Part)
In Part 1, we covered the technical basics of pulling data from Azure Data Lake Store (ADLS) using Python. We learned a little about how ADLS handles access for external applications, set up the credentials required for a Python application to connect to ADLS, and wrote some Python code to read from files stored on the ADLS. In Part 2, we expand our view to consider several important design decisions for building a production data pipeline between ADLS and another data store.
Small files are not your friends.
In the Part 1 code, we were dealing with one very small file, but what changes when we have much larger files or a high number of small files?
As is well documented about big data systems, Azure Data Lake Store exhibits the “small files” problem--performance suffers from having to read many small files, as opposed to only a few large files. This is because there is a certain amount of overhead incurred with opening each file in memory; therefore, more files to open means more time spent trying to get to their data, no matter their file size. Microsoft prescribes an average file size on ADLS for optimal performance--as of this writing, 256 MB to 2 GB each. We can take Microsoft’s guidelines at their word, but let’s explore this a bit to see the impact of file size and number first-hand.
In our exercise, we’ll start with a single 25 MB file. Putting it through the same “open-convert-read” code we had back in Step 6 of the Part 1 blog, except with an additional step of reading each data row into a list object post-conversion, results in a cumulative runtime of 6.61 seconds.
If the single 25 MB file were split into 100 separate 250 KB pieces and read in the same way as the original file, we see a real difference in execution time--33.8 seconds. This is about 5 times more expensive an operation than with just a single file.
Not only does the “small files” problem pose an issue for file access, it also impacts storage on the ADLS. Every file takes up a minimum amount of space by default (256 KB at the time of this writing), so, if a lot of files are smaller than this limit, a potentially nontrivial amount of extra space on the ADLS will be used up, leading to higher costs. Should this extra space consumption become significant enough, writing a consolidation program that combines small files together may be a good approach to lower the non-value-add storage.
Whenever possible in your data environment, you will want to maintain a smaller number of files to limit both storage footprint on ADLS and execution time in Python.
What if I need to transform the data while in Python?
For source data that is in JSON, CSV, and many other file formats, the Pandas https://img.captechconsulting.... offers the extremely useful DataFrame class for manipulating data in an in-memory table-like object. Operations such as filtering, joining, stacking, and aggregating are all possible with DataFrames. See the Pandas documentation for more information.
Alternatively, if you only need light transformations or just a sub-selection of your input data, standard Python object types, such as lists, tuples, and dictionaries, may suffice.
What are my options for landing Azure data in an on-premises database?
If your need is to send data from ADLS to an on-premises database, you have several options. The two main ones are to go directly from Python to database (all in-memory) or to first export to a file, followed by executing a bulk load.
-
In-Memory
- Non-parameterized SQL
INSERT
statements - Parameterized SQL
INSERT
statements - SQLAlchemy
- Non-parameterized SQL
- To Disk First
- Bulk load SQL statement
The decision to go in-memory or to-disk should be influenced by the speed required, whether or not the target database has a bulk load capability, and whether or not the data needs to be transformed before landing in the target database.
For workloads requiring speed, you may want to consider the to-disk + bulk
load option, as that kind of load will generally outperform executing
individual INSERT
statements.
When speed is not a huge concern or the data is limited, the in-memory SQL
INSERT
statement option works really well. This involves
constructing a SQL INSERT
statement for each row of data to be
loaded, as well as executing those statements on the database from Python
using a compatible Python database https://img.captechconsulting.....
Within this option, you can elect to create and run either parameterized
INSERT
statements or non-parameterized
INSERT
statements. The former is usually the more prudent choice,
especially if you have a mixture of data types across columns or the
possibility of adding more columns to the INSERT
s in the future.
This is because the parameterized option automatically handles wrapping or not
wrapping values with single quotes depending on the data type of each column.
(It even handles nuances in individual row values, like when you have NULLs
for a string column, in which case no single quotes will be used.) Note that,
if you encounter errors attempting parameterized statements, it may be due to
an out-of-date ODBC driver that does not support this. For SQL Server, only
the “SQL Server Native Client…” or “ODBC Driver for
SQL Server…” drivers allow parameterized inserts. An example of
parameterized insert code using pyodbc
is shown below:
import pyodbc
# Sample JSON file contents (i.e., data to be loaded)
file_json = {'meta': {'view':{'columns':[
{'name':'col1'},{'name':'col2'},{'name':'col3'},{'name':'col4'},{'name':'col5'}
]}}
,'data':[
['r1v1','r1v2','r1v3','r1v4','r1v5']
,['r2v1', 'r2v2', 'r2v3', 'r2v4', 'r2v5']
]
}
# Database information
database_schema_name = "dbo"
database_table_name = "SampleTable"
columns = ["Col1", "Col2", "Col3", "Col4", "Col5"]
# SQL INSERT statement shell (with placeholders)
database_insert_sql = "INSERT INTO {}.{} ({}) VALUES ({})"
# Populate SQL INSERT statement placeholders with schema and table name, comma-delimited string of column names,
# and comma-delimited string of question marks for the values, one for each column
insert_sql = database_insert_sql.format(
database_schema_name
, database_table_name
, ",".join(column for column in columns)
, ",".join("?" for column in columns)
)
# Connect to database using the correct connection string
conn = pyodbc.connect("<SQL CONNECTION STRING>")
cursor = conn.cursor()
# Execute INSERT statement against database (statement is the first argument and actual data that will be substituted
# into the INSERT statement is the second argument)
for row in file_json["data"]:
cursor.execute(insert_sql, row)
If the target database platform allows it (such as SQL Server), you could
create and execute multi-row INSERT
statements in lieu of
standard INSERT
s. Keep in mind that there is a limit to the
number of rows you can insert in one statement, so be sure to architect
looping logic if a given run of your data flow could ever exceed that row
count.
When you need the functionality offered by Pandas DataFrames, you may want to
consider using
SQLAlchemy
or the file-to-bulk load method. SQLAlchemy is a Python
https://img.captechconsulting.... that provides an object-oriented way
to interact with a database, effectively translating Python code into SQL
statements compatible with the specified database and executing them on that
database. Its functionality can be accessed via Pandas DataFrames. For
writing a DataFrame to a database table, use the to_sql
method.
If the performance is not acceptable, there are ways to speed it up, including
dividing your DataFrame into smaller sub-DataFrames and running
to_sql
on each or directing to_sql
to use
pyodbc
’s fast_executemany
method (more information). If you want to dump to a file rather than use SQLAlchemy, Pandas
DataFrames feature the to_csv
method, which writes a CSV from
your DataFrame in one line. Once the file is created, you can execute a
bulk load SQL command (still within your Python program), feeding it the path
to the file you just exported. That way, you take advantage of the speed
offered by bulk loading.
Whether you go the in-memory route or the to-disk route, you will need to use
a Python package that enables connection to your target database for the
statement(s) that actually load to the database. Certain databases have their
own connection packages (such as MySQLdb
for MySQL or
psycopg2
for PostgreSQL), but any that can use ODBC as the
connection mechanism will work with the pyodbc
package, provided
you have the appropriate ODBC driver installed.
Make your code modular.
When organizing your application code, you will want to consider the execution method, reusability, modularization, and the skills of those who will be supporting the application.
Options for executing the application from the command line are either directly or as a module. The latter requires a particular organization of your files, but it enables the import and use of your application code in other Python scripts/applications.
Another key decision point is choosing a programming paradigm--object-oriented or procedural. Python allows you to do either one or a blend of both, unlike many other programming languages, such as C++, Java, or C#. To establish a baseline understanding of these terms for comparison: Object-oriented code involves dividing attributes and behaviors into separate classes that exist independently from one another but that can be used collaboratively to form a complete program; on the other hand, procedural code contains all the attributes and behaviors of its object-oriented equivalent but is written in a much flatter way, with no concept of class hierarchy, inheritance, or encapsulation--instead of being tied to particular classes, attributes are just variables, and behaviors are just functions that can be used anywhere in the program. Each paradigm has its merits, depending on the situation. For software development, object-oriented is usually the way to go, but for data integration programs, consider modularizing your code (whether or not that means explicit object-orientation) while avoiding excessive hierarchies and abstraction, especially if those responsible for supporting the application have more of a procedural background. If the development/support team has an object-oriented bend to it, then feel free to lean more in that direction--with Python, you have that flexibility!
In the ADLS-to-database use case, a solid approach would be to create a class for code dealing with connecting and authenticating to the ADLS and another class for code dealing with database interaction, leaving the main code file(s) reserved for the nuances of your particular application. In this way, your ADLS class and your database class can be reused across any number of Python programs and also more easily reused within your main program. If you will be building multiple similar applications that pull from different ADLS sources and/or load to different targets, think about making a generic parent class to contain the common functionality and a child class to contain the specific functionality for each application.
Coding is cool, but what about testing?
Just like any other piece of Python code, you can write unit tests for your methods that call Azure SDK code; however, each time a test is run against Azure, a varying amount of expense may be incurred depending on the actions being performed. Although these costs are fairly minuscule, especially if you keep your test data small, you might want to minimize them. To this end, you can use the Python unittest https://img.captechconsulting.com/library’s mocking capabilities to intercept any calls the method you’re testing makes to Azure and directly provide an expected return value. This allows for testing your method’s logic only, as opposed to testing an already-tested ADLS SDK method or requiring test data to actually exist on your ADLS. Furthermore, it prevents costs from piling up due to repeatedly running your unit tests, while still fully testing the intent of your code.
Let's assume you have written a method (get_files
) that gets only
CSV files contained within the parent directories on an ADLS and returns them
as a list. To obtain a list of files and folders on the ADLS,
the get_files
method calls
the listdir
method of the ADLS
SDK AzureDLFileSystem
class. It does so first at the
top level (Line 46) and again at the second level (Line 47) to find only those
files that exist at the second level. It then adds only the CSV files to a
list object (by using the endswith
method of the
string object) and returns that list.
Below is an example of a Python unit test for this method.
The @mock.patch
decorators above the test method
definition override, or mock, the behavior of the ADLS
authentication. When calling mock.patch
as a
decorator, the class, function, or method being mocked in the decorator can be
added as an argument to the test
(i.e., mock_auth
and mock_adls
in
our example). Each mocked object can then be given a hard-coded return value
when the object is called during test execution. In this case, we are mocking
both the lib.auth
method
and core.AzureDLFileSystem
class, which come from the
ADLS SDK. The first one's return value does not matter since it is not called
directly in get_files
, so we just set it to an empty string
(Line 73). The second mocked object represents the ADLS connection
("adls_connection" in the get_files
method above), so
it has a listdir
method. In the test, this method can
be forced to return made-up results in the format it would return normally if
not mocked (Lines 74-77). Recall that
the get_files
method
calls listdir
twice--once for the top level of the ADLS
and once for the second level. As a result, we set two return values in the
test--"Folder1" for the top level and "Folder1/File1.csv," "Folder1/File2,"
and "Folder1/EmptyFolder1" for the second level. Because
get_files
is only concerned with CSV files at the second level,
it will ignore the top level "Folder1" folder, second level non-CSV file, and
second level empty folder and just return the one CSV file, which matches the
expected result of the test.
By mocking the parts of the code that access ADLS, you prevent unit tests from ever hitting Azure and, thereby, your next billing statement. As long as ADLS code is mocked with return values that match the format of their actual, un-mocked return values, these tests prove useful in validating the logic of your code.
In Closing...
Using Python for an Azure Data Lake Store ETL solution can be a perfect fit, but it takes some careful planning to avoid headache and development churn. Remember to keep in mind the overhead involved with opening many small files from ADLS; factor in speed, bulk load capability, and data transformations when determining how to load to your target database; make your code modular, reusable, and easy to follow; and always be thinking about testing and costs.