Podcast Review: Scam Inc
I recently finished listening to Scam Inc, a podcast series by the Economist on the "booming" industry of online scams. Just to show how good this series is, I subscribed immediately after the free episodes and finished the entire series in a two-day weekend. This is despite me previously unsubscribing to the Economist due to a lack of time for reading (but perhaps podcast is a better option for me anyways). Online scams isn't exactly new to me. I've watched many content creators in this space,...
Read post
Catch-22
This came up when I was watching a YouTube video about credit cards and credit history. The video described how you need a good credit history to get approved for a credit card, but then you also need a first credit card to start building a credit history. The YouTuber then said it is a classic "Catch-22" (obviously I couldn't figure out the hyphen in-between at the time). I could sort of guess the meaning of this slang from the context, but just to be sure, I searched it on Urban Dictionary. ...
Read post
Cube and Rollup for GroupBy
Many popular SQL engines, such as Apache Spark and PostgresSQL, have convenient functions for multiple groupby manipulations, namely cube and rollup. I will use cube in PySpark as an illustration. Suppose we have the following course enrolment data across different years for different courses. Note CS is not available for year 2021. +----+---------+---------+ |year| course|enrolment| +----+---------+---------+ |2021| Math| 100| |2021| Physics| 200| |2021|Chemistry| 300| |2...
Read post
Model Ablation
I came across this term when reading OpenAI's paper on GPT (Generative Pre-Training), which is one of the references in the book Build a Large Language Model (From Scratch). It's a new word/terminology for me, so I thought I might as well note it down. According to Wikipedia, ablation is originally a medical term meaning the surgical removal of body tissue. In the context of machine learning, its usage is credited to Allen Newell. An ablation study means removing a component of an AI system in...
Read post
Apple Dictionary/Calculator is Underrated
Since starting to use Kagi, I've become aware that I search a lot of unfamiliar English words for their meaning. As an attempt to divert this portion of my search usage, I tried to use the built-in Dictionary App on my Mac. I was pleasantly surprised and satisfied - this free app installed by default is actually very much underrated! Here are a few highlight features I've found in the past few days: It comes with a few default dictionaries (which I assume depends on the system languages). Fo...
Read post
Book Review: Build a Large Language Model (From Scratch)
I've just finished reading Building a Large Language Model (From Scratch) by Sebastian Raschka. Overall, I highly recommend the book to anyone interested in an introduction to / a refresher on large language models. The "From Scratch" part in the title is actually what drew my attention in the first place. Nowadays, so many online tutorials on language models are too application-focused (e.g. how to download and use open-source models, how to use API's, etc.). I am not saying they aren't useful...
Read post
Beam Search
Suppose we have a distribution over a sequence of words, and we want to generate a sentence from it. It can be very challenging to generate the sequence that has the highest probability overall - the number of possible sequences actually grows exponentially relative to the length of the sequence. One simple idea is to use the Greedy Algorithm where we always aim for the local optimum. In this context, the next word is chosen based on the highest probability. We continue until we reach the end-...
Read post
Basics of Vector Databases
In a traditional database with structured tables, querying data is easy because each table has a key column that uniquely identifies the rows. In contrast, the same task may be harder for unstructured data such as text documents, audio files and images. A vector database is one solution for storing and querying unstructured data. Given the popularity of large language models, let's use text documents as an example: In a vector database, documents (or parts of them) are stored as high-dimensio...
Read post
Kagi Trial Log Pt. 1
Kagi is a paid search engine focusing on privacy and better search results without advertisements. Privacy and quality of results are the primary reasons I wanted to try out this operating model of search engine. I originally signed up for a Kagi trial account in November 2024. It came with 100 free searches and 50 AI interactions. I did 7 searches, but then discovered Whoogle, an open-source and (sort of) private way to indirectly use Google. Prior to that, I had long left Google and switched ...
Read post
ML Model Deployment Strategies
(Part I of the study notes for the MLOps Concepts course on DataCamp (link)) Say, you have updated an ML model and want to deploy it in production to test its performance on real, unseen data. There are three possible strategy: Basic, Shadow, and Canary. Basic Deployment: Retire the old model immediately, and direct all production data to the new model. Shadow Deployment: Keep the old model running, and pass production data to both the old and new model. Real data are still processed by the o...
Read post
Fewer Decisions, More Energy
I recently watched a very informative video called Why you're so tired by Johnny Harris. Perhaps the most (un)surprising takeaway for me is the impact of decision making on our overall level of energy. I already forgot the exact reasons behind, but it turns out that our brains are likely to be overwhelmed and fatigued when there are more decisions to make in our daily life. Just think about what you want to order/cook for dinner everyday, and what video you want to watch to accompany that meal....
Read post
So what is customer churn?
While I have not dealt with datasets involving consumer behaviour, the term "Customer Churn" comes up a lot whenever I read articles/resources on machine learning. So exactly what is the definition of customer churn? Today I finally have the time to search it on Wikipedia. Voila! Churn rate (also known as attrition rate, turnover, customer turnover, or customer defection) is a measure of the proportion of individuals or items moving out of a group over a specific period. Basically, it meas...
Read post
nextToken in AWS boto3 Request
In one of my projects, I was trying to obtain a list of all available files in an AWS s3 bucket, which is done through an API call through boto3. However, I noticed that the returned list isn't complete, and that there is an additional field nextToken. A quick search returns this StackOverflow page that solves my problem. Basically, the API may not return the complete result when it is too long. Instead, it returns a partial result + nextToken. You can use nextToken in the next API call to con...
Read post
Monolithic vs Microservices Architecture
(Part I of the study notes for the DevOps Concepts course on DataCamp (link)) Imagine you are building a software. You could consider the monolithic architecture, which is basically a single unified software application that is self-contained and independent from other applications. A monolithic architecture may be suitable for small-scale projects, but most likely it gets clunky, unscalable, and almost impossible to maintain at a larger scale. A good solution for large scale projects is the m...
Read post
Understanding RAG (with math symbols)
Better rendered version hosted on Github: link TLDR: it is a finite mixture model with a bit of handwaving. Retrieval-Augmented Generation (RAG) is a popular technique to improve the quality of generated texts from language models. According to OpenAI (link), RAG can be quite promising if you have additional documents to provide as contexts relevant to your question asked to a language model. There are many online tutorials on RAG, e.g. by OpenAI, LangChain and NVDIA, most of which use flowch...
Read post
Database Normalization vs Data Cleansing
(Part III of the study notes for the Data Warehousing Concepts course on DataCamp (link)) Database Normalization is the process of structuring a relational database according to a series of so-called "Normal Forms". The goal is to reduce data redundancy and improve data integrity. The set of normal forms are listed on Wikipedia with examples, which I don't think is necessary to repeat here. A very closely related concept is Data Cleansing, which involves identifying and correcting/removing cor...
Read post
Fact & Dimension, Star & Snowflake Schema
(Part II of the study notes for the Data Warehousing Concepts course on DataCamp (link)) There are two types of tables in a data warehouse: fact table and dimension table. A fact table contains the measurements, metrics and facts of a business process. Consider the fact table of a car sales business. Each row would contain information such as the date of transaction, what car model is sold, the sales price, details about the buyer, etc. I think of it as a table that records and updates all tr...
Read post
Data Warehouse vs Mart vs Lake
(Part I of the study notes for the Data Warehousing Concepts course on DataCamp (link)) The following table summarizes some key differences between Data Warehouse, Data Mart and Data Lake. I've also found this useful resource by AWS on the same comparison but it gives a lot more details. Feature Data Warehouse Data Mart Data Lake Data Structure Structured Structured Structured & Unstructured Complexity to Change Complex Complex Less Complex Purpose of Data Known Known May not be k...
Read post
Wrapping up 2024 with selfcare
I decided to leave my PhD program in November 2024, more than six years after I first landed in Canada. I've learned many valuable lessons throughout these years, but I'd say 2024 is the most important year for my growth as a person. If I have to pick one thing from this year, self-caring is definitely the most valuable lesson I learned. While the administrative procedures started only a few weeks ago, I'd made up my mind a lot earlier. Leaving wasn't a particularly easy decision, but things be...
Read post
Better not pip freeze
I recently bumped into a head-scratching issue at work when using pip freeze -r requirements.txt to generate Python package dependencies. As a very short recap of the incidence, the generated requirements.txt file borked one of my virtual machines at work because pip freeze actually contained every package in the environment, including some mysterious lines that are only specific to a particular session of the virtual machine. When I shut down the previous session and tried to restart the virtu...
Read post
* and ** in Python
A short tutorial on the splat operators * and ** in Python: link. In summary, * unpacks positional arguments, while ** unpacks keyword arguments. ...
Read post
Fixing MacOS C compiler error in pyenv install
The solution is to add CC=gcc before pyenv install. The answer is found from this Stack Overflow question. ...
Read post
Beyond Jupyter: a tutorial for data science software design
Beyond Jupyter is an online tutorial on software design with a focus on machine learning tasks. It gives a gentle introduction to object-oriented programming (OOP), and then walks through an example that refactors a Jupyter notebook into reusable and maintainable modules. I particularly appreciate the walkthrough examples on code refactoring. It is very helpful and informative that the overarching principles are accompanied by code examples and commentaries. ...
Read post
A Quick Tutorial on Bash Scripts
Bash scripts, if properly written, can be extremely convenient and efficient for automating repetitive tasks, while making your code more reproducible. For example, with a bash script, you no longer to manually input the same command line arguments each time, nor do you need to keep a record of all the command line arguments used. Here is an awesome online tutorial for Bash scripting is written by Ryan. It is super user-friendly and provides an excellent introduction to Bash. I finished the tut...
Read post
Tools for Customizing Zsh
Oh My Zsh is a great tool for customizing your zsh. I love the flexibility to change themes especially when working with git repositories. zsh-autosuggestions allows for better auto-completion in zsh, which saves many a key stroke. I have used these tools since 2022 and absolutely love them! ...
Read post
Minimal Example for Python Multiprocessing
The following is a minimal example for multiprocessing in Python. A very useful guide written by Jan Bodnar is here. from multiprocessing import Process def fun(i): return i def main(): proc = [] for i in 1:100: p = Process(target=fun, args=(i, )) p.start() proc.append(p) for p in proc: p.join() if __name__ == '__main__': main() ...
Read post
# %% scripts are superior to Jupyter Notebooks
Jupyter notebooks, while popular as an entry to data science, have many shortcomings. The drawbacks and dangers of over-reliance on Jupyter notebooks are best summarized in this 2018 talk by Joel Grus. The percent format for Python scripts, denoted by # %%, is a great way to replace Jupyter notebooks. The percent format is supported by many editors, most notably VS Code. The official guide offers intuitive examples as introduction to the percent format and allows for a smooth transition from no...
Read post
venv for Python Virtual Environments
The venv library is part of the Python standard library suites for creating reproducible environments in terms of package dependencies. Suppose you have a folder for a Python project. The following command creates a new virtual environment. python -m venv /path/to/new/virtual/environment Then, use the following command to activate (or "enter") this virtual environment. source /path/to/new/virtual/environment/bin/activate Now we are working in an isolated Python environment: everything Python-...
Read post
Book Review: Data Science at the Command Line
Back in 2022, I came across a free E-book called Data Science at the Command Line written by Jeroen Janssens. I had quick and light read. My immediate thoughts after reading it are: Well, this is interesting... I didn't know you could do this much data manipulation and analysis through command line alone. Nah... I will probably stick with scripts and notebooks. Command lines still have their limits. Overall, the book is quite well-written: a friendly and easy introduction, abundant code exam...
Read post
Whoogle: Google but no ads
Whoogle is an open-source project that allows you to get "clean" Google search results, e.g., no ads/sponsored contents, no JavaScript, no cookies, limited tracking, etc. It is very easy to set it up locally with Docker - literally just two commands to pull it down and spin it up. I've been using it for weeks with great experiences - primarily to remove ads, disassociate search results from my Google account, and remove url tracking when clicking results. However, my IP address is still visibl...
Read post
VB-Cable for Virtual Audio Cables
VB-Cable is probably the simplest way to set up a virtual audio cable between two applications. The software is free and easy to download and install. It adds two virtual audio devices to your computer: CABLE Input (like a virtual microphone) and CABLE Output (like a virtual speaker). The following is a typical use case. You have some program to add filters and tweaks to your microphone input, e.g. using OBS or specialized software. You prefer this processed audio and would like to use it el...
Read post
Inconsistency in Python library names
It is well known (or perhaps not) that some Python packages have seemingly inconsistent names. Keeping this in mind might save you a lot of unnecessary debugging headaches. For example, the package multiprocessing can speed up certain tasks by parallelism. It is imported as follows. import multiprocessing Meanwhile, a nonsensical error would occur if you try to use pip install multiprocessing for actually installing said package. It turns out the installation should be: pip install multiproces...
Read post