Database Normalization vs Data Cleansing

(Part III of the study notes for the Data Warehousing Concepts course on DataCamp (link))

Database Normalization is the process of structuring a relational database according to a series of so-called "Normal Forms". The goal is to reduce data redundancy and improve data integrity. The set of normal forms are listed on Wikipedia with examples, which I don't think is necessary to repeat here.

A very closely related concept is Data Cleansing, which involves identifying and correcting/removing corrupt, inaccurate or irrelevant records from a dataset, table, or database.

My understanding of their differences is:

  • Database normalization specifically operates on an entire database, while data cleansing can be done on a single table (or even a subset of rows) at the lowest level.
  • Database normalization is more like a data engineering task, with a set of standardized normal forms to adhere to. We might not yet have a specific analytics / modelling task at hand, but database normalization can still be done. The ultimate goal is just to get the raw data into some widely-accepted standard form for future use.
  • Data cleansing is more for data analytics and modelling. We typically have some questions to ask about the data, or there is a model to be fitted which requires a specific set of transformation of data. Depending on the ultimate goal, the steps of data cleansing can be different even for the same dataset (e.g. standardization of a column may or may not be required).

A few more resources on database normalization (from commercial companies): link 1, link 2 and link 3.


You'll only receive email when they publish something new.

More from Spark Tseung
All posts