I had someone ask me the other day, after reading a bit
from my book, what is
“dirty data?” I was taken by surprise, because I thought
that anyone under 30 knew what that was, you know, “digital natives” and all.
Guess I should throw some definitions around:
“Dirty
data is inaccurate, incomplete or erroneous data, especially in a computer
system or database. In reference to databases, this is data that contain
errors. Sometimes called noise, as in
signal noise, and is cleaned by a data janitor.” –wiki
Dirty data is part and parcel of Big Data and the
Information Age. It’s inevitable and it’s everywhere. My autocorrect, for
example, has some mis-spelled words accidentally added, and that messes up my
texts, unless of course, I clean my personal dictionary. My phonebook has two
different people named Nicole, obviously with two different numbers, and unless
I go and disambiguate, there is no way for me to know which is which.
As a database, the “language of smell” is a stellar
example of what it means to be dirty. Ask somebody what “musky” smells like, or
“musty.” These are two very different smells, but because their names sound
similar, people often substitute one for the other. Give someone the smell of
an orange and then a lemon, and ask them which is which, but without telling
them in advance. They can also both be called “Citrus.” As a database, the
corpus of words we use to describe smells is a powerfully rich example of dirty
data in action.
Snippets from
Hidden Scents:
If knowledge is supposed to tell us which is which, and
what is what, then how do we use it to study a thing that is inherently
ambiguous? Smell is such a thing. In it, we have an example of an information-processing
system that makes its sole purpose to ascertain ambiguous information.
Moreover, during the entire process from primitive sensation to cognitive
verbalization, it is fuzzy, noisy, and dirty.
No comments:
Post a Comment