BlogJune 4, 2019
Traversing Parent-Child Hierarchies Using Ab Initio
Parent-child hierarchies can be found in almost every type of back-end data structure. Common examples include an employee database where each employee (all the way up to the CEO) has another employee as their manager, as well as an organization database where each organization has a parent organization. In relational databases, these hierarchies are often manifested within a single dimension table. This is done by including a parent key column in the dimension table that is a foreign key referencing the primary key of the dimension table itself. For example, the DimEmployee table (which can be found in the AdventureWorksDW2017 database) has an EmployeeKey column which is the primary key, several columns containing employee information, and a ParentEmployeeKey column containing the EmployeeKey of the employee’s parent employee (or manager). The ParentEmployeeKey column is the foreign key that references the primary key of the table itself, as each parent employee is an employee themselves.
Structuring the data this way gives access to all of the parent’s information at the child’s level by utilizing self-joins in a SQL query. A query like this can be furthered to also extract the grandparent’s (the parent of the parent’s) information or any entity’s information that is in the child’s hierarchy. We use a query like this below on the DimEmployee table to extract the FirstName and LastName attributes of a specified employee, their parent employee, and their grandparent employee.
SELECT child.EmployeeKey, child.FirstName, child.LastName, parent.EmployeeKey AS ParentEmployeeKey, parent.FirstName AS ParentFirstName, parent.LastName AS ParentLastName, grandparent.EmployeeKey AS GrandparentEmployeeKey, grandparent.FirstName AS GrandparentFirstName, grandparent.LastName AS GrandparentLastName FROM [AdventureWorksDW2017].[dbo].[DimEmployee] child LEFT OUTER JOIN [AdventureWorksDW2017].[dbo].[DimEmployee] parent ON child.ParentEmployeeKey = parent.EmployeeKey LEFT OUTER JOIN [AdventureWorksDW2017].[dbo].[DimEmployee] grandparent ON parent.ParentEmployeeKey = grandparent.EmployeeKey WHERE child.EmployeeKey = 1;
Traversing Parent-Child Hierarchies
Traversing a parent-child hierarchy, or accessing each level of a certain child’s hierarchy in order, is sometimes needed for business processes when information is needed at each level of the hierarchy. Traversing through a hierarchy can easily be done using SQL statements to loop through each level until it reaches the top. The top of a hierarchy can be determined in a lot of different ways. For example, in the DimEmployee table when an employee has a null ParentEmployeeKey, the top of the hierarchy has been reached. Shown below are SQL statements that traverse through a certain employee’s hierarchy and prints out the EmployeeKey and ParentEmployeeKey at each level.
DECLARE @child int DECLARE @parent int SET @child = 1 SELECT @parent = ParentEmployeeKey FROM DimEmployee WHERE EmployeeKey = @child WHILE(@child is not null) BEGIN PRINT N'EmployeeKey: ' + CAST(@child AS VARCHAR) + N', ParentEmployeeKey: ' + COALESCE(CAST(@parent AS VARCHAR), 'NULL') SET @child = @parent SELECT @parent = ParentEmployeeKey FROM DimEmployee WHERE EmployeeKey = @child END
While the ideal scenario above can be handled through simple SQL statements, often the situation is much more complicated. On a recent ETL project of mine which relied heavily on hierarchical structures, some member data was located in an Oracle database while the rest was stored in an MS SQL Server database. Additionally, the core ETL process in development required users to be locked out of the front-end system while it ran. Processing speed was a top priority as users could not be locked out for an extended period of time. Part of this process needed to access certain member attributes as it traversed the parent-child hierarchical member data. However, some of these attributes were stored in the Oracle database, while others were stored in the MS SQL Server database. It became obvious quickly that using a SQL script that looped through the hierarchy was not a viable option.
Luckily, the ETL process in development was being built in Ab Initio. Ab Initio is a GUI based ETL tool that runs natively on a dedicated Linux server. Much like other ETL tools such as SSIS and Informatica, it allows the ability to drag and drop different components onto a graph that executes common ETL tasks such as aggregating, filtering, partitioning, and reformatting on a dataset. These components can be executed in different phases so that different components have to be completed before other components begin processing. An example of an Ab Initio graph containing various inputs, components, and outputs can be found below.
Ab Initio provides two major benefits for traversing hierarchies. For one, it allows you to extract data from multiple servers and store them in data files written to disk on the ETL server. This means that in our case, the ETL processing engine would have access to both sources of member data, as it would be stored locally, needed for the process. Secondly, Ab Initio allows for high levels of data parallelism, where data can be partitioned so it can be processed simultaneously. This allows for processing to take substantially less time.
Using Ab Initio, traversing the hierarchy and accessing both sources of data was very straightforward. In the first phase of our Ab Initio graph, we simply extracted the two sets of member data into sorted lookup files. In the second phase of the graph, we prepared the data to be processed. The data was transactional with a member associated with each record. Before traversing the hierarchy, we made use of Ab Initio’s data parallelism and split the data into eight partitions. With the data able to be processed in all eight partitions simultaneously, Ab Initio would perform much faster than other solutions. Additionally, this phase applied some filters and aggregations to the data in accordance with business rules to prepare for the third phase of processing.
The actual traversing of the hierarchy occurred during the graph’s third phase. Ab Initio’s native programming language has constructs similar to T-SQL. So, we were able to implement a looping lookup, very similar to the SQL code above, in order to traverse the member hierarchy of the member associated with each row. However, instead of performing SQL queries to find the member attributes we needed, we used a key to find the attributes in the local lookup files. Since this data was written to disk on the ETL server and sorted by a predetermined lookup key, we could quickly access the data we needed using the key. This method of accessing the data is much faster than performing a SQL query that searches the entire database each time. This was all done inside of a reformat component, which allows a flow of data to be reformatted or transformed into the output you specify. Finally, in the fourth and last phase of the graph, an output file was loaded that contains the desired output.
Ab Initio’s functionality allowed us to combine the best of both approaches: multiple data sources and local files of an ETL tool, and the straightforward logical programming of a SQL script. Additionally, it was able to complete this part of the process on over 10 million rows of transactional data in under one minute. Needless to say, while traversing a parent-child hierarchy can be done using SQL statements, Ab Initio provides an easier and faster solution in more complicated real-world situations.