Data normalization

Data normalization is one of those words that I've been intimidated of for a while. My initial reaction is that it's about making the data "normal", i.e. standardized, so you can't have some rows where the date is a timestamp (e.g. 14141231) and others where it's a string (e.g. "January 23, 2015"). I think that initial intuition was along the right tracks but data normalization seems to be more focused on making sure no particular piece of data is stored in more than one place. Essentially, if I can boil it down, data normalization is about having a "single source of truth" for any given piece of information (e.g. Bill Clinton's date of birth).

There are three forms of data normalization that each build on each other, with the second being more strict than the first, and so on. The examples in the wikipedia page were actually very easy to understand and I highly recommend skimming through the pages and reading through the examples:

https://en.wikipedia.org/wiki/Database_normalization

https://en.wikipedia.org/wiki/First_normal_form

https://en.wikipedia.org/wiki/Second_normal_form

https://en.wikipedia.org/wiki/Third_normal_form

I initially got interested in what "normalization" meant, when Dan Abramov mentioned his library normalizr, which normalizes nested JSON data.

As a business analyst in my last job, I think this notion of de-duplicating data is second nature and storing the same piece of information in multiple places is the bane of any data analyst managing a complex Excel workbook. For example, sometimes we had to build an Excel model really quickly and take some shortcuts. Later when our boss would ask us "what would be the impact if factors A and B were adjusted by 5%?", it wouldn't be as simple as changing a single cell in one place. The difficulty would be in remembering all the places where you would need to manually update the data. Of course, as you get better at Excel modeling, you would utilize cell references as much as possible, and try to consolidate all the various inputs ("levers" in consulting-speak) in one area, ideally the first worksheet of an Excel file.

views