Friday, September 2, 2016

Dirty Data


I had someone ask me the other day, after reading a bit from my book, what is
“dirty data?” I was taken by surprise, because I thought that anyone under 30 knew what that was, you know, “digital natives” and all. Guess I should throw some definitions around:

Dirty data is inaccurate, incomplete or erroneous data, especially in a computer system or database. In reference to databases, this is data that contain errors. Sometimes called noise, as in signal noise, and is cleaned by a data janitor.” –wiki

Dirty data is part and parcel of Big Data and the Information Age. It’s inevitable and it’s everywhere. My autocorrect, for example, has some mis-spelled words accidentally added, and that messes up my texts, unless of course, I clean my personal dictionary. My phonebook has two different people named Nicole, obviously with two different numbers, and unless I go and disambiguate, there is no way for me to know which is which.

As a database, the “language of smell” is a stellar example of what it means to be dirty. Ask somebody what “musky” smells like, or “musty.” These are two very different smells, but because their names sound similar, people often substitute one for the other. Give someone the smell of an orange and then a lemon, and ask them which is which, but without telling them in advance. They can also both be called “Citrus.” As a database, the corpus of words we use to describe smells is a powerfully rich example of dirty data in action.

Snippets from Hidden Scents:

If knowledge is supposed to tell us which is which, and what is what, then how do we use it to study a thing that is inherently ambiguous? Smell is such a thing. In it, we have an example of an information-processing system that makes its sole purpose to ascertain ambiguous information. Moreover, during the entire process from primitive sensation to cognitive verbalization, it is fuzzy, noisy, and dirty.

No comments:

Post a Comment