Speak of the devil

Hello, data-tamers!

Last week, I broached the subject of involving AI (generally speaking) into the data organization process. I've been mulling over that all week, and I even talked to a couple of people about it. Before I get into some additional thoughts on that topic, here's my weekly 2025 Consistency Challenge progress report:

I deleted 125,623 files comprising ~65 GB
I sorted 3,692 files comprising ~58 GB
Overall progress: 30% (from 29% last week)

About 120 GB of data processed...not too bad. Not as much as last week, but respectable. I deleted way more than I kept in terms of individual files, but the ones I kept were much larger on average, hence the lopsided numbers.

Defining a System for Data Organization

I am predictably intrigued by the prospect of using technology to make managing my data more efficient and less time-consuming. After last week's post where I discussed the challenges and concerns with using AI to streamline the process I'm currently going through for my consistency challenge (organizing 30 years' worth of family data), I reached out to a former colleague who has spent the last few years deep in the AI/LLM world. I figured he would be a good one to ask for a pointer or two. Among other things, he recommended that I play with Unsloth to help with fine-tuning a model with my own training data. (This is much more technical than probably most of you will care to see, but I mention it in case anyone is curious enough to click.)

After our brief conversation, my most helpful realization was that my ideas and goals were still too vague. Clearly articulating the process and outcome is a critical requirement before I have any hope of automating or streamlining it with AI or any other tools.

So, I spent some time considering exactly how I would describe the steps involved to another person. What exactly am I doing, and how am I doing it? I started from the Q2 2025 Consistency Challenge details I posted in this very newsletter about seven weeks ago:

📋 PROJECT: I want to clean and organize my old archives. My data is already centralized on a single triage hard drive.
🎯 GOAL: By the end of this quarter, I will have a file count of less than half of what I started with (2.3 million instead of 4.6 million).
📅 PROCESS: Every week, I will work through at least 5 high-level folders from one drive and either move the data into a category folder or, if possible, delete it. Once all files are in category folders, I will run automatic de-duplication on as much as I can.
📈 TRACKING: I'll note in a simple table how many files I organized or eliminated each time I work on my task.

This is good, but it doesn't really have detailed steps. Also, it covers (conceptually) more than what I'd want to hand over to an AI just yet. To keep things simple, I decided to focus on only the general categorization part--answering one single question: "What kind of data is this?"

Of course, you can answer this one question in a variety of ways for the same file or set of files:

File type? (Image, video, document, source code, note, etc.)
File size? (~1kB, ~100kB, ~1MB, ~10MB, ~100MB, etc.)
File age? (New, >1 week, >1 month, >1 year, >10 years, etc.)
File location? (In an "Archive" or "Backup" folder? In a known operating system folder?)

🤖 AI/LLMs REQUIRED BYOND THIS POINT 🤖

File contents? (Personal letter, email to colleague, picture of friends, etc.)
File context? (Research data for a work project, logo used in business website, source code library used in project, etc.)

The first four points are easy to answer without any fancy AI/LLM involvement. Simple programming will get the job done with pretty good accuracy because of how files are traditionally organized on a computer. And, indeed, apps exist already which do many of these things; WizTree does a good job with the first three, stopping short of analyzing the location for quickly separating OS/program files from others.

AI comes into play for the last two points: analyzing data more deeply and then relating different pieces together in helpful ways. The Dokkio app I mentioned last week is working toward this kind of goal. But, as I said, it's not something I'm likely to trust with all of my data even in the best case. And it isn't purpose-built to solve the problem I'm attacking in my consistency challenge project.

Instead, I envision something like a typical "File Explorer" interface for Windows (or Finder on a Mac), where tagging, categorization, and best-guess suggestions for whether to delete or where to move/rename content is updated in real time as you work explore your files. Every time you make a correction (fix tags, reassign categories, etc.), it improves the training data and updates any automatic tags, categories, or move/delete recommendations. The more you work through your data set, the more accurate all future recommendations become.

I believe this is entirely achievable with today's technology, but I don't (yet) know how to put the pieces together to make it happen. Until then, I'll keep brainstorming as I do the process by hand through my own data set.

Unless...

A NAS Powered by AI

Yesterday, as I was quickly deleting the latest batch of promotional emails that landed in my inbox, I happened to note a featured Kickstarter project: the Zettlab AI NAS, and AI-powered Network Attached Storage unit (a.k.a. big on-premise backup device).

I'm not saying that you should go back this device on Kickstarter; crowdfunded projects are always a bit of a gamble, and I don't know enough about Zettlab or their offering here to recommend it. But what caught my attention was their claim to use AI-powered categorizing and tagging--the same idea I've been toying with for the last week! One early-access reviewer shows a little of this in action here (link jumps to 19:53 in a video), and I can say at least that it seems to be on the right track.

However, unless I had $1k+ to throw at an expensive experiment, I wouldn't jump on this yet. One reason is that I already have a functional Synology NAS, but the bigger reason is that whatever they're doing with AI-powered tagging and categorization is 100% not dependent on their specific hardware. If I had the same (or similar) software on my desktop PC, I could do basically the same thing. Granted, I wouldn't have all of the fancy transcription and video editing tools and other handy features they built into their "ZettOS" platform. But for now, I don't need all of that anyway. I'm just happy to see file management and organization tools moving in the direction I want to see.

What would be the "holy grail" of data organization for you? Every time you receive an email, bookmark a web page, download a file, create a new document, type a new note, or transfer a photo, that it would be timestamped, analyzed, tagged, linked to other related data, named intelligently, and moved to a logical place on your computer (or cloud storage)?

Such a system feels almost frighteningly powerful, but I predict we're actually not far from achieving it even without needing to hand over your data to other companies (Google, Apple, etc.) to do the AI-powered analysis. I'll keep my eyes open for more interesting developments.

Until next week, happy data-taming!

Speak of the devil

Defining a System for Data Organization

A NAS Powered by AI

Is there room for AI?

The impact of your physical environment

Speak of the devil

Defining a System for Data Organization

A NAS Powered by AI

Read next

Is there room for AI?

The impact of your physical environment