Valérian de Thézan de Gaussan · Data Engineering for process-heavy organizations

Stop using complicated technologies to solve simple problems.

Here's how I built in 5 minutes a data pipeline for a simple extract-transform-load process.


(And I didn’t use Airflow, Kafka, Snowflake and all these very costly tools, not even Python!)

You need a machine somewhere and a shell. That’s it. What is nice is that your client very probably already have one.

Yes, yes. It is better to have dedicated environments for the data engineering layer. I know. But you also want to consider cost. Do you want to run a VM just for this?

In the image below, is a bash script that is orchestrated by a crontab. It runs every day at 6am, pulls data from a Weather API, unzip it, aggregate all the data to compute the average temperature, and push that result in another API as a HTTP POST call. A fictitious yet very standard ETL where you pull data from a source, perfom a little transformation on it and push the result somewhere else.

A few lines of bash and it’s done. All of these tools are available on any Linux based server.

Last but not least, document the process, where it is and what is the machine orchestrating it so that it is clear to everyone in the organisation.

How many little data pipelines like these are currently running in very complex environments, needing a constant management from data engineers and ops, costing a fortune for questionable added business value?

Small bash pipeline