Tips for writing long running data processing scripts
Data Engineering is a broad field with a large number of complex tools, but sometimes you find yourself needing to write a quick script to do a one-off ETL job or fix some data. For example, an initial population of a system with historical data before setting up a streaming pipeline, or repairing some data that was corrupted by a bug.
Writing such scripts can feel like a trivial undertaking. After all, it’s easy enough to write something simple that transforms some data and moves it from one system to another. But if there’s a large amount of data to process, things can get messy quickly. From lurking data that doesn’t match your script’s assumptions to ephemeral network issues reaching the systems to which you’re sending the data. You can quickly find yourself in a loop of running the script, seeing it break, fixing it… and repeat.
These are some tips I’ve picked up over the last few years for making this process much less tedious.
1. Skip over data you already processed
This seems obvious, but it’s easy to overlook this if you’ve written something quickly, often while telling yourself “it’s not worth the effort, it’ll work fine this time” (It never does). The aforementioned run, break, fix loop is especially frustrating if the “seeing it break” stage occurs when the script has been running for an hour and after fixing you have to run through all that data again.
Keep track of what you’ve processed so far in a database, or just a local file. Then if the script needs to restart due to an error, it can start from the point it got to in the previous run.
2. Log, but not too much
Logs are useful to show that your script is making progress through its task and also to show you where time is being spent. However, if you add too many log statements to your script, you’ll be presented with a stream of information that’s far too verbose to be useful. Be mindful of where you put your log statements, focusing on the core data processing or IO parts of your script. Think about what information you’ll want to know if your script doesn’t seem to be doing what you think it should.
3. Log with timestamps
You leave your script running, you come back two hours later, and it’s not made nearly as much progress as you hoped. You now want to find out where it’s been spending its time. If your log statements have timestamps, this is easy to see. You should also consider printing out the time taken for certain operations so it’s easy to see at a glance. It’s then easy to see where you should focus your efforts to optimise your script.
4. Idempotency
If your script is idempotent, it means it will give the same result every time it is run. In the context of a script loading data from one system into another system, this would mean the data in the target system will end up in the same state no matter how many times you run the script. Therefore, if our script encounters issues and needs to be restarted, it won’t matter, our data will still end up in the correct state.
Your script may not be idempotent by default if mutates data in a system. However, it’s worth adding checks if an update or insert has already been applied so a script doesn’t apply the same change twice. This means we can rerun our script as many times as we like if issues are encountered, without putting data into a bad state. This combined with tip 1 makes it easy to fix and rerun a script as many times as is necessary.
5. Parallelise, but not too much
If you need to process a large amount of data, you can achieve this much faster by writing a script that does the processing in multiple parallel threads. However, be mindful of what other systems your script is interacting with. More parallelism won’t necessarily result in more speed. 4 threads may be faster than 1, but 10 threads may be slower than 4. If your script is interfacing with a database or a REST API, resource contention may end up slowing things down if you try and run too many threads at once. You may also risk hitting rate limits in an API for a SaaS product and backing off before retrying will slow things down even more.
6. Run in the cloud
Running your script in your cloud infrastructure has 2 major benefits. Firstly if the script is running for hours or days, you won’t need to keep your computer powered up or connected to the network for this entire time. Secondly, if your script needs to fetch data from other systems in your cloud infra or write data to a SaaS product, it will be much faster running in the cloud. These systems are likely hosted on the same infrastructure so the network speeds will be significantly faster than if you were sending the data via your personal internet connection. You can use tools like screen or tmux to run your script in a shell that you can return to even if you close your ssh session with the remote server.
7. Is a script the right choice?
Writing a script can often feel like the default option when needing to do a one off transform or move some data around. However, it’s worth considering if a script is strictly necessary. Can you achieve the same result with some database queries to manipulate the data in place? Does the SaaS product you’re trying to integrate with have tools to load data directly from your data store? Is the script actually going to be needed multiple times in the future and would therefore be more valuable as a full blown application? It’s always worth considering all the available options before diving into writing a custom script.
Hopefully these tips will be helpful the next time you have some data to move around.