No Data is Data
A void tells you important things
I find myself in an interesting position. I am a staunch believer in the religion of AI. I have been studying the topic, in one way or another, for more than 50 years. The creation of actual intelligence would be the crowning achievement of our species.
This current round of technology won’t get us there.
At best, we are going to get a deep glimpse into how our world actually works. Language about something is descriptive. Living that same something is an entirely different story. No matter how hard you try, you cannot explain everything using representative tools. The actual world is not representative.
So, I am fan boy and critic.
I’m sure that the distinction I am drawing is of little use to people for whom measurement is everything. I don’t think all people are that shallow.
Please don’t get me wrong. Descriptive tools like Math and Language are at the core of our remarkable accomplishments in science and technology. Our ability to communicate and understand depends on them. But, they depend on generalizations (or models).
I am increasingly aware of the limitations of my mental models.
There was (and is to a limited extent) a movement called ‘General Semantics.’ They originated the ideas that “the map is not the territory” and “the word is not the thing”. When we analyze and categorize things, we create a layer of abstraction that loses important detail in the process.
No two things are identical and they change over time. So, when you count or describe them, you invariably lose important information.
Today, the primary focus of AI technical development is a deep, deep, manipulation of language, not in its ability to describe but in the deep intricate connections between words. Large Language Models consume words without digesting them. The result is (and always will be) something that sounds right said by someone who doesn’t understand what they are talking about.
It’s not unlike the critiques and opinions that people with no parenting experience try to give parents. Current forms of AI can do the magic trick of saying things without the slightest clue about their meaning.
There’s a kind of tinny sound to it.
In the high-tech world, you are always learning about the tech. Very few people ever completely understand the entirety of the tech they are working on. There are generally social norms that allow an engineer with more complete knowledge to correct someone with a less comprehensive view. In that world, it is important that the most accurate ideas be applied.
That’s the heart and soul of engineering.
=======================================================
If you ask around, you are sure to hear the data quality is the key to AI success. Every time I ask someone what they mean by Data Quality, I get answers that involve completeness, timeliness, accuracy, and cleanliness.
Initially, that seems right. The highest quality data should be complete and clean. Except, the only ‘complete’ data sets contain machine generated data to fill in the blanks. Cleaning data means tossing away the outliers and standardizing the answers.
Both approaches assume that great data is comprehensive and standardized. I think of this as the fast-food view of data. While comprehensiveness and regimentation improve the ability to make predictions, the value of difference is left out of the equation.
Take a standard HR Information System. They are usually something like 75% incomplete. The holes in the data are consistently blamed on the people who are supposed to fill in those blanks (employees). The major providers of HRIS find it difficult to deliver planning. In the absence of ‘complete’ data, systems are going to be inaccurate.
But what if they aren’t incomplete at all?
In art, empty space is called negative space. It is every bit as important as the space that is full. Negative space is what gives the other space its outline and definition. What if the empty space in databases is at least as important as the places where there is no emptiness?
In other words, the absence of data is data.
A rigorous analysis of the degrees and kinds of empty cells will produce a rich analysis of the meaning of the emptiness. Department by department, age group by age group, gender, profession, location, tenure, performance rating are all variables through which you can examine the empty space.
When I spoke with the greatest data person in our industry, Stacey Harris, she noticed that this question about data quality looked like an east-west idea. Completion and regimentation are what quality looks like if precision is the key to high quality. The other view is that imperfection is the heart of real quality.
The key to AI that can deliver on its promise is the ability to make sense out of the data that isn’t there. Our current tools can’t begin to do this.
=========================================
Photo by Adlan



