The PGA currently uses AWS for their computing environment and opted to continue using AWS to build the consumer data platform. The first step was to prioritize the data sources that would ultimately populate the data warehouse. These sources were then ingested in raw form into Amazon S3, AWS’s Simple Storage Service. S3 serves as a data lake, a place to store raw, unstructured data from all ingested source systems. To bring together these varying data sources, CapTech implemented different patterns for the use-cases using Appflow, webhooks, Lambdas, and JDBC connections through AWS Glue.
From there, the data went through validation and initial cleaning, a minimal process to standardize email and phone number formats while dropping null values. Data that did not pass validation requirements was sent to “quarantine,” where it would be stored but not included in the data warehouse.
Once trusted data was identified, extensive transformation was done using AWS Glue before loading into the data warehouse. Due to the number of sources, the data needed to be joined, unjoined, and manipulated to match the tables and data types of the target system for a successful data load. This process was based upon an Entity Relationship Diagram (ERD), which is a means of visualizing the schema of a relational database. End-users can then use the ERD to gain an understanding of the tables, fields, and datatypes within the data warehouse to write SQL queries for their analytical efforts.
The new PGA Consumer Data Mesh environment is built to be scalable for any number and formats of endpoints. Foundationally, enriched consumer data is persisted into a structured data mart built on AWS Redshift. PGA’s technical and operations teams can access unstructured, but validated and enriched data directly in S3 via AWS Athena. Finally, CapTech implemented a direct file ingestion framework for integration into the PGA’s Marketing Technology stack.