Case Study: Big Data ETL Automation

by Pete Carapetyan


Competitively Written Spagetti Code as a Motivator

I had spent a year on a team of brilliant developers, where all were madly hand-writing copy-pasted spagetti code in competition with each other on 15 plus ETL jobs. Nothing can be more frustrating than an experience such as this, for a developer trained on clean API code design, modularity, and code generation for tasks that are merely repetitive.

After leaving this project, it occured to me that a demonstration project was in order, that other companies might see an alternative to hand-writing copy pasted spagetti code.

Starting With Apache Camel

The first thing I did was make use of a well designed, OSGi based, mature and thoroughly tested, commercially supported framework. Hundreds of lines of code were replaced with approximately a page of code. This makes a very peaceful alternative to the ego gratification of NIH coding practices, and a much more maintainable codebase.

Open source Apache Camel is commercially supported as 'Fuse' by Red Hat.

Standardizing on Avro External to Codebase

The second thing I did was standardize on Apache Avro for a serialization protocol. Though strictly Hadoop in the first iteration, the advantage that this profers is that it makes migration possible to any of dozens of NOSQL persistence stores from Hadoop to Cassandra to Couchbase, with minimal changes, all while maintaining schema for each piece of data within the data itself.

A separate project was written to serialize data as Avro, and is opensourced on BitBucket for anyone to consume as an OSGi module. See here for link to same.

Separating Code From Metadata

Amazing things can happen when metadata isn't buried deep in the bowels of the code! Where a simple properties file was sufficient to provide metada, a web UI was created to add a little flourish to the data collection and validation. This UI was open sourced, see here for link to same. It is not production code, but neither does it have to be, as it is only used by and for a developer.

Clean Design

Reviewing the hypothesis: Time spent copypasting mutliple ETL jobs later can instead be spent cleaning up the design at the front of the project. That is what occurred. The design is not perfected, but it is clean and simple, thus fulfilling it's role as a demo project.

When the templates are used with the metadata to generate this code, you can observer the resulting code for yourself. The code generation project is visible as part of the Carrie UI above.

Management By Exception

One of the tenets of devops is that everything reasonable is automated, freeing up the developer to focus his attention on the exceptions, not the rule.

The code which is generated by the above process follows that pattern, in that it is expected to be extensible via hand fixes and additions. Not every ETL job fits the same paradigm, but neither does it have to. That which is consistent is code generated, that which is not is manually modified.

The Ops of Devops

The project was rounded out by a fully automated VM creation, so that the entire process can be run in a self generated VM, freeing the developer to focus on value add rather than setups.

Jeff created this setup using Chef and Vagrant, and offers same kinds of services to others who might wish to avail themselves of this capability.

Available On Line

Please consult this 20 minute video link for a demo of how this setup works, end to end.