It's Tidy Tuesday again.
Although I didn't get anywhere on my data organization project last time, this week was considerably better:
- I deleted 251,470 files comprising ~169 GB
- I sorted 302,441 files comprising ~405 GB
- Overall progress: 29% (from 19% last week!)
This was a productive week. I got to focus on the task for a good while on just one day, and I made a huge dent. Hopefully, I can do the same again next time.
AI and Data Organization
========================
I haven't brought up AI yet in earlier Tidy Bytes material, and that might seem a little odd since it's everywhere these days. There are two main reasons for this:
- I don't have a tool or method that I personally trust and use yet
- I suspect that any tools that do exist are not universally applicable to most readers
Now, there may be compelling options that I simply don't know about. If anyone has a recommendation, even just a name of an app or platform you think you might have heard of, let me know--I'll gladly dive in and look.
What I'd like to do today is briefly walk through some of the relevant points involved with using AI for data organization. You don't have to know everything about AI (I certainly don't) in order to appreciate the discussion; whether it's something you have considered, might consider, or would never in a million years consider yourself, I hope you'll still find it interesting.
Also, for simplicity, I'm playing fast-and-loose with terminology, generally lumping many distinct technologies under the single vague "AI" classification. This is not technically accurate, but being technically accurate at this point won't help most of you better consider the points I intend to bring up. If you'd like a cheat sheet for actual definitions of terms and concepts involved in the AI world these days, check out this page on the FROMDEV blog (or many others like it).
However, I will take a moment to define two terms that you've probably all encountered:
- Large Language Model (LLM) - A large language model is a type of computer program that has been trained to understand and generate human language. It learns from huge amounts of text, so it can answer questions, write stories, and hold conversations in a way that sounds natural.
- Generative Pretrained Transformer (GPT) - A specific kind of large language model that can create new text ("generative") based on a large set of input data fine-tuned for specific contexts ("pretrained") using a particular type of technology to understand relationships between words ("transformer").
⚠️ Spoiler alert: these definitions came directly from ChatGPT (see, now you know what the "GPT" in ChatGPT stands for). You can see my conversation, including a friendly and fun explanation for kids and even The Story of Robo Read-A-Lot, right here. Let me know if one of you turns it into a real printed book. ChatGPT never ceases to amaze me with how much it can do with so little effort.
Fundamentally, "AI" these days generally means a computer program that has consumed a ton of well-classified input data and can use that "knowledge" to create useful output based on new unclassified input data. For example, ChatGPT has read so much written material of every conceivable variety that when you ask it a question about a historical figure, or scientific process, or mathematical concept, or almost anything else, it can respond in a believable and (generally) accurate manner.
Training a GPT is not terribly unlike training a person. You provide a bunch of input, explain what it is and what it means, and then expect reasonable answers based on that knowledge. If you ask within the knowledge domain you've trained, you can expect good output. But if you ask about something you haven't trained, you can expect either no output or bad output.
And therein lies the problem...
Is AI a Magic Bullet?
=====================
Whether AI is a good candidate for data organization depends very much on what you want to do with it. Ignoring privacy concerns (which are significant), the challenge is finding a model that is good enough to move you towards your goal. Everyone has unique ideas about what a "good" outcome is--if they can articulate it in the first place. However, if it can do even a portion of the work quickly and painlessly--especially if that part is the most repetitive and tedious--it might be worth using such a tool.
Pretrained organizational models already exist in the email world with tools like Superhuman, Canary Mail, and SaneBox. However, email is relatively narrow space compared to all personal data management. And even within that space, the three tools I mentioned may not add much value, depending on how you actually use email. If you run a business primarily out of your inbox, Superhuman might be great. But for individual home users, the impressive AI-powered feature set would likely be an expensive waste.
Photo management is another specific area where pretrained models can work wonders if used correctly. Tools like Excire Foto have come a long way toward streamlining the photo culling process with powerful AI-assisted content, quality, and aesthetic identification. (I'll dig into Excire once my massive data archive organization project is done.)
But outside of some narrowly-definable segments, achieving good data organization results is still pretty elusive. The "holy grail" for me, right now, would be if I could apply an AI-powered tool as follows:
- "Here's a hard drive full of 30 years' worth of personal, school, and business data of all kinds: documents, music files, letters, photos, videos, programming projects, artwork, correspondence, instant message transcripts, journals, random notes, old operating system files, and more. Categorize everything, identify distinct groups of files, suggest relevant tags for each category and group, and identify which files are likely pointless and should be deleted."
An objective pretrained model that could summarize text content decently well and infer reasonable meanings based on directory structure and filenames could likely do a pretty good job at this, though not without mistakes. If I could then fine-tune the results by correcting false positives and negatives or incorrect tags, I could probably end up with something that saved me dozens if not hundreds of hours of poring over files by hand.
While my data collection is personal and unique, most of the high-level organization I'm doing right now doesn't require any special knowledge about it. Any decently trained model would be able to automatically do at least 90% of what I'm doing by hand. The last 10% of the work, the really fine-detail classification and organization, will inevitably need manual effort.
So far, I don't know of a tool that will do this. The closest thing I'm aware of is a platform called Dokkio, which is promising but not a perfect fit for an existing data archive that needs to be culled and organized. However, I can't comment too much on Dokkio yet because I haven't thrown my whole dataset at it to see how it performs. There's a cost involved, but I'd honestly be happy to spend $120 for a year of the Pro version if it would allow me to complete in a month what would otherwise take a year.
But there's another issue that they probably won't ever overcome: TRUST. ☠️
What About Privacy?
===================
With Dokkio, it isn't that I believe they're nefarious people who will steal my data. I don't.
But I also don't like the idea of providing 30 years of the digital history of half a dozen people to be used as training data for an AI model under someone else's control. Would it achieve the technical goal I want? Possibly, yes. But I'm not willing to give up that much deeply personal information to achieve that goal.
This is a subjective choice, and not one that I think everyone must necessarily make. Each of us prioritizes privacy in our own way. For example, I happily use 3rd-party AI tools for research, occasional image generation, and other benign activities where I don't honestly care if anyone sees (or even profiles) what I'm working on. But I'm not ready to hand over hundreds of gigabytes of childhood and family data to anyone or anything that isn't 100% under my control. This may be seen as a paranoid choice, but the world of AI is moving too fast for me to be comfortable with any other decision.
I've talked about self-hosting before, and the same concept applies to AI. It's technically possible to host your own AI-powered tools, provided you know how to set them up. However, the hardware required for good and fast AI performance is extremely high. Because of the computation power required to use them, it's easy to spent well over $2k for a generally powerful personal AI server that you can adapt to a variety of different tasks. (The key is having a powerful GPU with plenty of video memory, which is the special combination that lets you load and run AI models efficiently.)
This cost will inevitably come down as technology improves and AI implementations become more efficient, as has happened throughout the history of computers. But right now, good local AI not accessible for most people because of both the cost and the technical knowledge required to set it up.
There are some platforms built with this concern in mind--encrypted and secured tools that let you rent online access to powerful AI-capable hardware without needing to give anyone else visibility to your training data or the outputs from your models. However, these are tools for developers, not the general public; they're available to anyone, but only practical for a few.
Until local-only AI advances, hardware costs drop, and a wider variety of tools emerges into the consumer space, the more security-conscious among us will have to wait for the magic-like functionality that public access AI platforms provide.
In the meantime, I'm paying attention as well as I can to new AI-related developments, especially those that might be useful for data organization. Part of me would love to dive into the development side of things and try to create something exactly like what I want, but right now, that's a pipe dream.
Are there any AI tools that you use for organization or other related activities? Are there any activities you wish you could hand off to such a tool--the boring, tedious, repetitive tasks that LLMs tend to be good at, or perhaps the tasks that are so big you don't even know where to start? I'd love to know. It might help me hone my focus as I keep an eye on the organizational AI landscape.
I'm also more than happy to be corrected on any point by someone who knows more than I do. I'm interested in AI, but I haven't spent enough time on it to consider myself truly knowledgeable even as a layman, let alone a developer. If anything I've said above is wrong or might benefit from clarification, please reach out. I promise you I will appreciate it.
Until next week, happy data-taming!