Some days it seems that all we do is move data around. We take it from “here”, manipulate it, reformat it and send it over “there” for display or more analysis. ETL is short for for Extract, Transform and Load. There are lots tools, especially in data warehousing, that help us with these kinds of tasks. What you can accomplish with these tools using a drag and drop interface is truly amazing.
One of my favorite ETL tools is CloverETL. I love this tool. It has all the bells and whistles. The tool is intuitive to use. It’s really quick to get useful work accomplished with it. Some notable CloverETL features include :
- Fetching data over the web
- Multitude of database connection options
- Connectors for Excel/CSV etc.
- Data Validating
- Scheduling / Jobs
- Rapid Prototyping
- Custom Coding for Tasks
Purchasing a commercial flavor of CloverETL is an investment. The Community Edition is a solution to keeping costs low but there are some limitations with this edition :
- The number available of readers, writers and transformers is limited
- This edition only supports 20 items per graph
- You can’t save a graph with a higher number of items
- A potential workaround is to :
- Create multiple projects and organize major tasks around these projects
- Create sub-graphs in these projects for sub-tasks
- Multiple projects and sub-graphs can be difficult to maintain
CloverETL Community Edition is basically the gateway into a world of effective data manipulation. It’s hard to resist purchasing a commercial edition once you have started using it. If you have the budget, I think an ETL tool is really worth the investment. The tool will pay for itself in time saved. However, if your budget does not allow for purchasing an ETL tool with all the fixings, and if and the Community Edition is too restrictive, you can look at using a poor man’s solution.
A Poor Man's Solution
If the ETL tasks are relatively simple, another approach is to use the database and its functionality to achieve the desired transformations. For complicated ETL tasks this approach will probably not scale well but for simple ad-hock data manipulations it seems to be a viable solution.
In this post, I will describe an ETL task written using Bash, PostgresSQL and PL/pgSQL functions. This example was written using Postgres.app on OS X. The ETL task is very simple. It takes an input CSV and generates an output CSV. It’s a toy example. The interesting part of the example is where we make use of the PL/pgSQL function (See :
2_process_raw_data.sql) and cycle through the raw data, making decisions on what the output should be. It illustrates the possibility of doing more complex ETL tasks with this kind of approach.
The Bash Script
The Bash script controls the execution flow of our ETL steps. It will launch the
psql client and execute SQL scripts in a certain order. Sequentially, the Bash script below does the following :
- Creates temporary storage (See :
- Imports the contents of
fruits.csv- our raw data - into this temporary storage
- Processes the raw data with a call to a PL/pgSQL function (See :
- Exports our processed data to a csv file (See :
- Cleans up after itself (See :
#/bin/bash set -e db_user=beanj db_password= db_host=localhost db_name=testing data_dir=/Users/beanj/dev/git/quick_n_dirty_etl # 1 - Create temp storage export PGPASSWORD="$db_password" && \ export PGOPTIONS='--client-min-messages=warning' && \ '/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \ -h "$db_host" \ -d "$db_name" \ -v ON_ERROR_STOP=1 \ -a -f 1_create_temp_storage.sql # 2 - Import Raw Data export PGPASSWORD="$db_password" && \ export PGOPTIONS='--client-min-messages=warning' && \ '/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \ -h "$db_host" \ -d "$db_name" \ -v ON_ERROR_STOP=1 \ -c "COPY tmp_fruit FROM STDIN with CSV HEADER ;" < "$data_dir/fruit.csv" # 3 - Process Raw Data export PGPASSWORD="$db_password" && \ export PGOPTIONS='--client-min-messages=notice' && \ '/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \ -h "$db_host" \ -d "$db_name" \ -v ON_ERROR_STOP=1 \ -a -f 2_process_raw_data.sql # 4 - Export Processed Data export PGPASSWORD="$db_password" && \ export PGOPTIONS='--client-min-messages=warning' && \ '/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -U "$db_user" \ -h "$db_host" \ -d "$db_name" \ -v ON_ERROR_STOP=1 \ -c "COPY (select * from tmp_processed) TO STDOUT WITH CSV HEADER; " > processed_fruit.csv # 5 - Drop temporary tables and functions export PGPASSWORD="$db_password" && \ export PGOPTIONS='--client-min-messages=notice' && \ '/Applications/Postgres.app/Contents/Versions/9.4/bin'/psql -p5432 -U "$db_user" \ -h "$db_host" \ -d "$db_name" \ -v ON_ERROR_STOP=1 \ -a -f 3_drop_temp_items.sql
Step 1 - The Temp Storage
We need a place to import data into. For this, we’ll create a table that we’ll eventually drop once we’re done with it. The contents of our
1_create_temp_storage.sql SQL script looks like :
/* * Create temp table to store our incoming data - from the CSV import */ DROP TABLE IF EXISTS tmp_fruit ; CREATE TABLE tmp_fruit ( name varchar(10), count int ) ;
Step 2 - The Input Data
fruit.csv file gets imported in step 2 of our Bash script. The file contents look like :
name, count Apple, 10 Banana, 21 Orange, 12 Lemon, 33
Step 3 - Processing the Raw Data
Processing the raw data… This is where it gets a little more interesting. This SQL script creates a function called
tmp_process_fruit(varchar). The function cycles through our temporary fruit table and creates a string out of the fruit names. That string is returned from the function. The function takes a varchar as an input parameter. As we cycle through the input data, if the name of the fruit we are looking at, is the same as what we provided as an input parameter, the fruit’s name will be tacked to the
accum variable with special characters and a counter around it. Otherwise, the name of the fruit is just tacked on to the
2_process_raw_data.sql SQL script looks like this :
/* * Create function to process the fruits table data. * Cycles through the tmp_fruit records and returns a string. */ CREATE OR REPLACE FUNCTION tmp_process_fruit(varchar) RETURNS VARCHAR AS $$ DECLARE c_tmp_fruit refcursor; r_tmp_fruit RECORD; counter int; accum varchar; query varchar; BEGIN counter := 0; query := 'select * from tmp_fruit order by name desc'; accum := ''; OPEN c_tmp_fruit FOR EXECUTE query; FETCH FIRST FROM c_tmp_fruit INTO r_tmp_fruit; WHILE FOUND = TRUE LOOP -- If the fruit is what we passed in as a parameter the generated -- string element will look a little different when added to the accumulator IF (r_tmp_fruit.name = $1) THEN accum := '*_'||r_tmp_fruit.name||'_'||CAST(counter AS text)||'_'||accum||'*'; -- Otherwise, the string element just gets added to the accumulator as is ELSE accum := r_tmp_fruit.name || ' ' || accum; END IF; RAISE NOTICE 'Processing Record : % ', counter; counter := counter + 1; FETCH NEXT FROM c_tmp_fruit INTO r_tmp_fruit; END LOOP; CLOSE c_tmp_fruit; RETURN accum; END; $$ LANGUAGE PLPGSQL ; DROP TABLE IF EXISTS tmp_processed ; /* * Create a table called tmp_processed with the result of the * tmp_process_fruit('Orange') function call. We will export this * in a CSV file in another step. */ CREATE TABLE tmp_processed AS SELECT tmp_process_fruit('Orange') ;
One thing to note in this step is that we created a table called
tmp_processed in the last few lines of the SQL script above. This table’s content is the result of the call to the
Step 4 - Outputting the Processed Data
This step takes the contents of the
tmp_processed table we created in Step 3 and puts it in a .CSV file. The content of the output CSV file looks like the following :
tmp_process_fruit Apple Banana Lemon *_Orange_0_*
It’s a table with one row in it. The row contains the concatenated fruit names in our original raw data. Note that in Step 3, we called the
tmp_process_fruit() function with Orange as the input parameter. The result is that our output contains the
Step 5 - Cleaning Up
The clean up script just removes any temporary tables and functions that we no longer need.
/* * Drop temp tables. */ DROP FUNCTION tmp_process_fruit(varchar) ; DROP TABLE IF EXISTS tmp_processed ; DROP TABLE IF EXISTS tmp_fruit ;
What I like about this approach is that for simple tasks, revisiting and maintaining a series of data manipulations is relatively easy. There is no IDE to fight with. It’s clear what gets checked into SCM. The costs for tooling is very low. There are no additional tools or database drivers to install. The setup is very simple.
Developers write a lot of code in many different languages and environments. Having the ability to keep portions of it simple and easy to jump back into is a great advantage.
Source files for examples in this post : GitHub Repository