A team of CapTech’s data engineers was engaged to collaborate with the restaurant to redesign its ServiceNow Extract, Transform, Load (ETL) pipeline. The team began by decoupling the extraction steps from the transformation and load steps, which reduced the runtime of the daily incremental job by more than two hours for most tables.
Next, the pipeline was standardized and refactored. Previously, the incremental job extraction happened entirely within Python in the Airflow environment, while the transformation and load steps occurred in an EMR job. Conversely, for the weekly full refreshes, the entire ETL process was happening in batch jobs, one for each table. CapTech’s team standardized the process and refactored the code so both processes used batch jobs for extraction and AWS Glue for transformations and writing to the data lake. The incremental job was also separated to initiate individual jobs at the table level to allow for greater flexibility when handling errors and scheduling, since some of the smaller tables needed to be updated more frequently than once a day.
In the current stage of the project, the team has been migrating the data sets to take advantage of the system’s new “upsert” functionality, which allows records to be overwritten at a primary key level. Previously, the entire table or partition had to be updated at the same time. Once implemented, there will no longer be a need to regularly run the full data refreshes weekly, since deleted records can now be identified and removed.