4.1 million files and a big mistake

It's another Tidy Tuesday.

Today, you and I get to learn the same lesson together. I started moving ahead to the next stage of my large archive organization project after collecting everything over the last many weeks. A few days ago, I excitedly let Duplicate Cleaner Pro run through the entire collection, which took...a while, to say the least. This screenshot shows that it finished the initial scan (only counting everything) after about half an hour, but it then took another 12 hours to perform the actual duplicate content search:

After it finished, I ended up with a first-run count of over 1.3 million duplicates, ready to delete with a single click.

So I clicked.

BOOM. 💥 More than a million files, hundreds of gigabytes, gone. Hooray! ...right?

Even just deleting stuff took the better part of half an hour. I let it do its thing and didn't come back to the process until the next day. That's when I realized I'd made a mistake. (Fortunately, the copy-instead-of-move requirement of the triage process meant I just rolled my eyes at myself because I'd have to re-do a bit of copying again. Otherwise, I would have been something much closer to nauseated.)

Here's the problem: in some cases, having duplicate copies is exactly what you want. I'd simply neglected to pay attention to this...ahem...minor detail. Nor have I mentioned it to any of you. It's time to remedy that error.

Some of you may know that I'm a programmer by trade. If you've ever done any software development, you know there are often files bundled together in a project which are exactly the same as those in many other projects--for example, a few files (a.k.a. "library" or "plugin" or "extension" or "driver") that let you easily draw a certain type of chart on a web page. All of these files need to be kept intact in that overall project structure in order for everything to work right.

A strict de-duplication process, then, will necessarily destroy the integrity of the project. Yes, I still have exactly one copy of everything, but now all the rest of my projects don't know where to find that one copy, because they expected to have their own copy right where I put it originally!

I knew as soon as I dug into it that I'd encountered this challenge a few years ago on a smaller scale, but I forgot.

This "some duplicates required" issue is probably most prevalent in the programming world, but it can also show up in graphic design, 3D modeling, and other fields where final products are pieced together from smaller components.

Some tools exist to eliminate such duplicate requirements by having all projects internally refer to a single, central repository that holds each unique library in exactly one location. However, doing this well usually requires very tight integration with your whole development process with no deviation or inclusion of other tools or methods. In other words, it's great if you can pull it off, but it's often impractical.

So what's the solution?

Instead of a single-click deletion process, I need to dig a little deeper before I let automated software do its work. I need to "flatten out" the messy archive structure into at least a few pre-sorted subcategories so that I can configure the de-duplication process with some nuance. For example:

Personal documents (no duplicates)
Photos, videos, and music (no duplicates)
Programming projects (major folder duplicates only, but allow specific file duplicates)
...other categories as I explore...

Working through 4.1 million files in this way will take quite some time, but that's okay. It doesn't have to be done right away; I just want to keep making progress. I'll also end up with a much better idea of exactly what data I have in there, and I can delete more unimportant stuff along the way that I might have missed in my earlier high-level pass.

If you're sitting on a digital mess pile like I am, are you likely to encounter this "some duplicates required" problem? If you aren't sure, it's always better to play it safe and dig around a bit first unless you genuinely don't care. (Interestingly, the older the data gets without your needing to touch it, the more likely you are not to care about perfect archive integrity. Just procrastinating a decade or two can make the eventual effort practically zero! Ha!)

That's it for this week. I'll let you know how well I managed to get back on track next time.

Happy data-taming!

4.1 million files and a big mistake

Planning an archive structure

A motivating hit of nostalgia

4.1 million files and a big mistake

Read next

Planning an archive structure

A motivating hit of nostalgia