Getting delimited flat file data ingested into Hadoop and ready for use is a tedious task, especially when you want to take advantage of file compression, partitioning and performance gains you get from using the Avro and Parquet file formats.

In general, you have to go through the following steps to move data from a local file system to HDFS.

  1. Move data into HDFS. If you have a raw file you can use the command line, if you're pulling from a relational source I recommend using a tool like Apache Sqoop to easily land data and automatically create a schema and Hive table.
  2. Describe and document your schema in an Avro compatible JSON schema. If you ingested data using Sqoop you're in luck because the schema is now available to you in the Hive Metastore. If not, you need to create the schema definition by hand.
  3. Define the partitioning strategy.
  4. Write a program to convert your data to Avro or Parquet
  5. Using the schema created in step 2 and the file created in step 3 you can now create a schema in Hive and use HQL to view data.

Going through those steps to ingest a large amount of new data can get time consuming and very tedious. Fortunately, the Kite SDK and associated command line interface (CLI) exist and make the process much easier

I'm not a Java developer, so I opted to use the CLI to bulk load my data into HDFS and exposed via Hive. In this example, I used a comma delimited set of 25 baseball statistics data files, with data dating back to 1893.

Here are the steps I went through to quickly ingest this data into HDFS using Kite.

  1. Download the Cloudera Quickstart VM, there is also a HortonWorks VM if you prefer that distribution.
  2. Install the Kite CLI by running the following command
[cloudera@quickstart ~]$ curl http://central.maven.org/maven2/org/kitesdk/kite-tools/0.17.0/kite-tools-0.17.0-binary.jar -o kite-dataset
[cloudera@quickstart ~]$ chmod +x kite-dataset
  1. Create a folder called ingest and download and unzip the baseball statistics data into the folder
mkdir ingest
gzip –d lahman-csv_2014-02-14.zip
  1. Create the following shell script and name it ingestHive.sh
#!/bin/bash
FILES=/home/cloudera/ingest/*.csv
for f in $FILES
do
filename=`basename $f`
name=`echo $f | cut -f1 -d'.'`
 echo "************* Start Processing $name ********************" 
 echo "./kite-dataset csv-schema $name.csv --class `basename $name` -o $name.avsc"
 ./kite-dataset csv-schema $name.csv --class `basename $name` -o $name.avsc
 echo "./kite-dataset create `basename $name` --schema $name.avsc"
 ./kite-dataset create `basename $name` --schema $name.avsc
 echo "./kite-dataset csv-import $name.csv `basename $name`"
 ./kite-dataset csv-import $name.csv `basename $name`
 echo "************* End Processing $name ********************"
done
  1. Change the script to be executable and run
chmod 777 ingest.sh
./ingest.sh
  1. All data is now ingested into HDFS in compressed Avro format and tables are created in Hive
  2. We can check to confirm that tables exist in in Hive by running
[cloudera@quickstart ~]$ hive -e "show tables;"

I can use the same technique to just define a schema and load to HDFS directly, without creating a Hive table. This is useful if the processing I want to do will not require the Hive Metastore. As an example, I modified the above script to create a set of parquet files in HDFS:

#!/bin/bash
FILES=/home/cloudera/ingest/*.csv
for f in $FILES
do
filename=`basename $f`
name=`echo $f | cut -f1 -d'.'`
 echo "************* Processing $name ********************"
echo "./kite-dataset csv-schema $name.csv --class `basename $name` -o $name.avsc"
 ./kite-dataset csv-schema $name.csv --class `basename $name` -o $name.avsc
echo "./kite-dataset create dataset:hdfs:/user/cloudera/baseball/`basename $name` --schema $name.avsc --format parquet"
 ./kite-dataset create dataset:hdfs:/user/cloudera/baseball/`basename $name` --schema $name.avsc --format parquet
echo "./kite-dataset csv-import $name.csv dataset:hdfs:/user/cloudera/baseball/`basename $name`"
 ./kite-dataset csv-import $name.csv dataset:hdfs:/user/cloudera/baseball/`basename $name`
 echo "************* Processing $name ********************"
done

As you can see, using Kite makes the process of ingesting, converting and publishing to Hive easy. A fairly simple ingest engine could be built using the above techniques to monitor files landing to an edge node, and as they are received automatically ingest, convert, partition and publish data to the Hive Metastore and HDFS.