Sync folders with Python
Manually comparing and synchronizing two folders can be tedious. Add long, confusing and very similar filenames and it’s no fun at all.
We recently faced a similar situation at work. Besides cryptic names, there was also a fair share of twisted logic governing the sync scenarios. We had to get our hands dirty, since the standard tools were useless.
Because we don’t have to do the sync often, and the folders have always been of reasonable size, automating the process seemed an overkill. However, as our products grow, so do the folders needing sync. We recently figured spending some time to create a script would be a good future investment.
This post builds upon ideas I came across writing the script, but were ultimately left out or done differently. It concludes with a simple tool capable of syncing two folders. The code was written using version 2.7.6.
Backing and reverting
Let’s assume nothing is under version control. It would be cool if our script would have a revert mechanism of sorts, in case something funky happens during sync. To keep it simple, maybe a directory snapshot is enough:
It seems dense, but it’s easy to explain. First, we create a repository folder or REPO
, if it doesn’t exist yet. It is where the backup snapshots live. We then save the archived directory under a unique name to REPO
. Tag is optional, but if given, we’ll be able to reference the archived file with it. The name is written into backup_path
and saved to directory
. Let’s call it the BACKUP
file. It’s there to help us revert:
In order to revert, we need to know the archived file to revert to. This is written in the BACKUP
file which should be part of directory
.
get_archive_name
helps us extract it:
Note that any backed state is reachable by a sequence of reverts, as long as REPO
contains the appropriate archive:
We should also be able to revert using tags. Here’s one way to do it:
My initial thought was to serialize and de-serialize a dictionary, but performance would degrade quickly. Even with a bit of SQL, I’d argue the above is quite concise.
It’s also quite easy to show tag history of a directory:
There are a few things to consider when reverting:
- we’re blindly extracting archives without prior inspection. It is possible that files are created outside of path, e.g. filenames starting with two dots. This could be a security hazard.
- someone could backup the root directory, then try to recover it and happily wipe out the hard drive, since we don’t worry about the directory we’re clearing.
- if the directory contains symbolic links,
shutil.rmtree(directory)
will throw anOSError
.
The issues are somewhat easily fixable and might be a good exercise to try out.
Finding and applying the differences
Finding the differences between two directories couldn’t be simpler:
If dst
does not exist, then the difference is the src
directory content. Otherwise, there’s a handy module we can use: filecmp
. It contains a function dircmp
that does exactly what we need - it finds all the differences between two folders.
We’re interested in files or folders only in src
, or common files that differ. We also don’t want to copy any of the config files, so we filter them out.
This is how to apply the differences:
The code speaks for itself. The point to note is the backup we perform before any copying is done. This enables us to revert if something goes sour.
Syncing the same folders over and over and over…
Sometimes, you know beforehand the folders you need to sync. For example, you know that folder A
will always have to be synced with folders B
and C
. This is where SYNC
file comes into play. It contains one or more source folders, each listed on a separate line.
In the example above, folder A
should contain the SYNC
file with the following content:
Then, all we need to do is sync the directory containing the SYNC
file:
As you can see, it’s as straightforward as opening the SYNC
file, reading the sources, and then applying the differences.
Of course, we should provide means to generate such file:
Adding CLI
To wrap what we’ve done in a simple utility tool, we should create a command line interface, so the user can interact with it. argparse
module makes this simple. Before turning to code, here’s what the user should be able to do:
1. Copy different files from one directory to another
Example usage: cp -t sample_tag /source/path/dir /destination/path/dir
The comand requires a src
directory and a dst
directory, where dst
will be synced with src
. Tag is optional.
2. Revert a directory or tag
Example usage: rv -t sample_tag
or rv -d /random/path/dir
3. Sync a directory
Example usage: sync -t just_in_case /random/dir/path
/random/dir/path
should contain the SYNC
file. Tag is optional.
4. Create a SYNC
file
Example usage: mksync dir/path/where/sync/is/created /fst/src /snd/src /trd/src
Create a SYNC
file in the first specified directory. Any directory listed afterwards is added to the source list.
5. Show tag history
Example usage: th /random/dir/path
It shows the available revert tags.
Here’s the above in code:
Perhaps the only interesting thing in our parser is the custom action:
It ensures that any path we provide as an argument is an existing directory we can write to.
Fin
We can now sync two folders and revert if need be. The tool is very simple and only offers crude functionality, but it’s a good starting point to build upon. What always amazes me is the expressiveness of Python and what can be achieved with cca. 200 lines of code, half of which are paranoid asserts and param checks.
I’d also like to note that although I love Python, I don’t use it enough to consider myself a pythonista. If you spot any piece of code that can be replaced with a more standarized idiom, please let me know!
As always, there’s a GitHub repo where you can find the complete script.