Database Normalization vs Data Cleansing

January 9, 2025•264 words

(Part III of the study notes for the Data Warehousing Concepts course on DataCamp (link))

Database Normalization is the process of structuring a relational database according to a series of so-called "Normal Forms". The goal is to reduce data redundancy and improve data integrity. The set of normal forms are listed on Wikipedia with examples, which I don't think is necessary to repeat here.

A very closely related concept is Data Cleansing, which involves identifying and correcting/removing corrupt, inaccurate or irrelevant records from a dataset, table, or database.

My understanding of their differences is:

Database normalization specifically operates on an entire database, while data cleansing can be done on a single table (or even a subset of rows) at the lowest level.
Database normalization is more like a data engineering task, with a set of standardized normal forms to adhere to. We might not yet have a specific analytics / modelling task at hand, but database normalization can still be done. The ultimate goal is just to get the raw data into some widely-accepted standard form for future use.
Data cleansing is more for data analytics and modelling. We typically have some questions to ask about the data, or there is a model to be fitted which requires a specific set of transformation of data. Depending on the ultimate goal, the steps of data cleansing can be different even for the same dataset (e.g. standardization of a column may or may not be required).

A few more resources on database normalization (from commercial companies): link 1, link 2 and link 3.

👍❤️🫶👏👌🤯🤔😂😍😭😢😡😮

Subscribe to the author

You'll only receive email when they publish something new.

More from Spark Tseung
All posts

Fact & Dimension, Star & Snowflake Schema

January 6, 2025•373 words

(Part II of the study notes for the Data Warehousing Concepts course on DataCamp (link)) There are two types of tables in a data warehouse: fact table and dimension table. A fact table contains the measurements, metrics and facts of a business process. Consider the fact table of a car sales business. Each row would contain information such as the date of transaction, what car model is sold, the sales price, details about the buyer, etc. I think of it as a table that records and updates all tr...

Read post

Understanding RAG (with math symbols)

January 10, 2025•1,075 words

Better rendered version hosted on Github: link TLDR: it is a finite mixture model with a bit of handwaving. Retrieval-Augmented Generation (RAG) is a popular technique to improve the quality of generated texts from language models. According to OpenAI (link), RAG can be quite promising if you have additional documents to provide as contexts relevant to your question asked to a language model. There are many online tutorials on RAG, e.g. by OpenAI, LangChain and NVDIA, most of which use flowch...

Read post

Database Normalization vs Data Cleansing

More from Spark TseungAll posts

Fact & Dimension, Star & Snowflake Schema

Understanding RAG (with math symbols)

More from Spark Tseung
All posts