Valérian de Thézan de Gaussan · Data Engineering for process-heavy organizations

Advice for devs who want to build strong ETLs.

4 actionable best practices to start writing solid code.


It’s Tuesday morning, you have not even turned on your laptop that you get a call from the product team complaining about not having the newest data. This data should have come from an ETL script running every night.

That’s a script you developed three months ago. It used to work without any problems, but not anymore. The logs are unclear. You start diving into the garbage code. You do not remember how it works. You do not know where the problem is coming from. The headache is only starting…

Most data pipelines are built in a “fire & forget” way. They are considered little scripts to move data around.

You should start to consider them as piece of living software that requires the same engineering level as any piece of production code in your applications.

First, have some modularity

ETL: Extract, Transform, Load. If you do not see these three modules, they are not clearly separated in the code thus you’re going to spend way more time debugging and evolving your code.

We rarely write an ETL for a one-shot process. Thus, having a readable code where you can find easily find your way around is crucial. Think about your future self!

For this, we split the code into three parts that each have their own responsibility.

The extraction part takes the data source as input, extracts the data and feeds it to the transformation part, that feeds its output to the loading module. This is how clear boundaries between your modules must be.

This structure must be seen from a code overview.

If your ETL is a big one, you may even need more modules. The rule is about having one and only one responsibility for each module.

And please make it configurable! You should not have to change the code to change an API URL or a constant value. Use an external configuration file read by your code.

Advice: By using TDD, I’ve found out that the emerging design is very modular by nature.

In a nutshell, at least the three basic modules must be clear and distinct in the code: Extract, Transform, Load

Secondly, log what is going on.

Logs are here to explain your future self what is going on. Keep that in mind while writing log instructions.

When you look at the logs of your ETL, you have to quickly get when the process started, when it ended, if there were errors and if yes, what entities are impacted. Without that, you can not tell what really happened.

Now, what happens if you need to check what happened three months ago, but you do not have the log file anymore? You need to set an appropriate log strategy according to whatever business rules and enterprise context that you have. For example, you could have legal reasons where your company has to have the ability to report back up to a year of transactions. Thus, your log retention strategy must encompass this. In the opposite, you can have rules stating that the data must not be kept for too long. Again, this needs a proper, thought ahead log strategy.

Another thing I want to emphasize is that you can use logging levels to help you quickly find errors in your logs. Logging libraries have that by standard. You can log with different logging levels: Trace, debug, info, warn, error. Then, when you want to find all the errors in a big log file, just search for “ERROR” and you will have it. It’s even better when you integrate your log in an ELK stack and then use Kibana to display your logs with cool-looking dashboards.

Be careful not to generate too much data: you do not want to log everything that is happening after all. Log what is important for yourself looking at the logs in the future.

Also, you will move sensitive information from time to time. Keep in mind that if you log sensitive information, then your log security policy must be at least as strong as the security policy of the data you’re handling.

Small Example of good logging.

2022-04-22T06:00:00Z - INFO - Start
2022-04-22T06:00:02Z - INFO - Processed entities from 0 to 499
2022-04-22T06:00:04Z - INFO - Processed entities from 500 to 999
2022-04-22T06:00:06Z - INFO - Processed entities from 1000 to 1499
2022-04-22T06:00:07Z - ERROR - Could not process entity id 1789 because of MappingException : (--details of the error--)
2022-04-22T06:00:07Z - ERROR - Could not process entity id 1810 because of OtherException : (--details of the error--)
2022-04-22T06:00:08Z - INFO - Processed entities from 1500 to 1999
2022-04-22T06:00:09Z - INFO - Processed entities from 2000 to 2145
2022-04-22T06:00:09Z - INFO - End

In summary, log what, when, where, how. If you do not know what happened from your logs, then there is not enough logging. If you struggle with terabytes of logs from little ETL, then it may be a bit too much.

Now, let’s add some atomicity

Like in the modularity part, you need to split your processes in small functional units of work.

In each module, each function or class must be doing one and only one thing. It is the single responsibility principle.

For this, I use the “Extract until you drop” principle, or in other name “Extract until you can not extract anymore”.

With some experience, the majority of your functions will be one, two or three lines long. This does not mean that all of your functions must be that short. It’s just the way they turn out when they have only one responsibility.

With this, you end up with a lot of pure functions, that do not have a state and return the same thing for the same input. You can also see them as utility functions, as they provide a little service. As a rule of thumb, you should have a lot of functions like this in your transformation process that take one row or document of your data and output its transformed equivalent.

By applying these principles, you’ll find a proper level of abstraction and readability in your code almost automatically.

Again, when using TDD, you’ll get good atomicity in the “refactor” part of the red-green-refactor cycle of TDD.

All in all, even if you think that this might be overkill, you will be thankful for making a clear and readable code as soon as you get to an old piece of code.

Finally, have a plan for the errors

First, while transforming your data, you can prevent errors by correcting the data. By normalizing all the fields that might cause problems (and by experience the dates are often problematic because of all the different formats they have), you will prevent all of those errors coming at you. But, be really careful not to twist your data while doing so, as this can have consequences on the business.

Next, you need to take care of all the important values that might be null or empty. When you transform an object, make sure to handle all the cases where a mandatory value is not there. Those cases should log an explicit error, saying that the object with that ID has this field empty.

Finally, do not swallow errors. Follow the fail early, fail fast principle. Do not give default values to object that must have this value in the first place. This is the perfect recipe for failure. You’ll find yourself in the situation of not understanding what is happening three function later than where the original problem really occurred. If the error is not important business-wise, then still log a warning, you might appreciate having this information later.

Example (in Python but true regarding any language)

# Let's admit that my_product must have a "price" property, 
# else we have a business problem.
my_price = my_product.get("price", 0.0)
# By setting a default value, we will not see that the price 
# property of my_product was null. And this will lead to 
# consequences business-wise.

Now, you have two possibilities when facing an error.

You can choose to make the entire process fail. This can be justified if your next task in the pipeline depends on the data you are processing right now. So you do not want to continue because you know that the error while processing this product for example will cause the processing of the orders after to fail.

You can choose to only stop the processing of this line of data. Then you will log a clear error on why that line could not be processed, to be reviewed in the logs.

The choice will depend on the business rule that gives the context to your ETL. You have to think about this in order for you not to have a whole script stopping for a small error or even worse, a pipeline not stopping on a critical error. In all cases, log it!

In a nutshell: errors must be handled without hiding them under the carpet, while controlling the execution process.

Conclusion

You may have noticed that I did not talk about any language or framework.

It is because these principles are true whatever the tool used.

Now, take action! Take one of your existing ETL and refactor them by applying those principles one at a time.

I refactored a lot of data pipelines of all kinds, and everytime is the same story: by just applying those principles, I uncovered lots of bugs preemptively that were never discovered, I reduced the maintenance time not only for me but for all engineers working on those, and I was able to give to the clients proper information and KPIs about what was happening in those pipelines.

Of course, one could always go further than that and continue to push many other principle (SOLID, DRY, YAGNI …) onto your pipelines. The points that I have made constitute a good base so that you do not pull out your hair next time you have to make and maintain your ETL.