Database Normalization vs Data Cleansing
January 9, 2025•264 words
(Part III of the study notes for the Data Warehousing Concepts course on DataCamp (link))
Database Normalization is the process of structuring a relational database according to a series of so-called "Normal Forms". The goal is to reduce data redundancy and improve data integrity. The set of normal forms are listed on Wikipedia with examples, which I don't think is necessary to repeat here.
A very closely related concept is Data Cleansing, which involves identifying and correcting/removing corrupt, inaccurate or irrelevant records from a dataset, table, or database.
My understanding of their differences is:
- Database normalization specifically operates on an entire database, while data cleansing can be done on a single table (or even a subset of rows) at the lowest level.
- Database normalization is more like a data engineering task, with a set of standardized normal forms to adhere to. We might not yet have a specific analytics / modelling task at hand, but database normalization can still be done. The ultimate goal is just to get the raw data into some widely-accepted standard form for future use.
- Data cleansing is more for data analytics and modelling. We typically have some questions to ask about the data, or there is a model to be fitted which requires a specific set of transformation of data. Depending on the ultimate goal, the steps of data cleansing can be different even for the same dataset (e.g. standardization of a column may or may not be required).
A few more resources on database normalization (from commercial companies): link 1, link 2 and link 3.