Alfred - Your Data Butler

This blog is a part of a series written about the open-source data ingestion engine Alfred. For an overview of Alfred read this blog. You can also learn what Alfred means for data stewards, and how using this tool can save you time and money.


A great deal of a data scientist's workflow today is spent cleaning data, understanding the logic behind that data, and, increasingly, engaging in data engineering processes. These processes involve moving data from source to target where the data scientist can model it. Data movement is important process but takes valuable time away from a data scientist's main task of building models, running experiments, and analyzing results to bring clarity to business operations.

However, our recently open-sourced tool, Alfred, can help data scientists reclaim their time by eliminating some pain points. Here are some of the benefits Alfred offers to data scientists:

1. Data scientists can import data into distributed system environments through the platform without having to worry about writing complicated ETL scripts. Without Alfred, the data scientist would have to write Spark ETL scripts in Scala or Python to move data at timed intervals to HDFS, create Hive tables on top of the data, and test that the data was moved correctly with unit testing. Alfred abstracts all of that away.

2. The metadata management platform makes sure that each data source entered into a data lake environment is recorded and that users understand the lineage, transformations, and source. This allows data scientists to interface with originating business users to get questions answered quickly. Alfred writes to Parquet-backed Hive tables and allows you to specify partitions. All the metadata input into Alfred is translated into Hive DDL scripts that run independent of data scientists having to use it.

3. It allows data scientists to conduct the important work of synthesizing disparate data sources by easily exploring other sources of data through the Alfred UI. This makes making the black box of distributed systems data lake Unix environments easily searchable and transparent. Instead of having to constantly grep HDFS and navigate through Linux folders, data scientists can see data availability outside the command line.

Ultimately, Alfred takes the pain out of the important work of setting up data correctly for modeling: allowing data scientists to focus on making sense of the data, rather than sifting through it for what they need.