All posts from Spark Tseung

Shelving things up

March 15, 2025•164 words

I went to pick up my master's diploma yesterday (Pie Day!). It wasn't anything ceremonious - ten minutes in and out of the Convocation Office. I got my piece of paper stuffed in a huge envelope. I will probably shelve it up deep in my storage space. I don't think that the past six odd years as a PhD student was particularly a failure, or at least I try to convince myself so. It was a learning experience. I got to take a peek into the world of academia, realized how it wasn't what I expected, a...

Mortgage

February 20, 2025•30 words

Just a random piece of information. The origin of the word "mortgage" is old French: mort (dead) + gage (pledge). So a mortgage is literally a death pledge. How fitting! ...

Podcast Review: Scam Inc

February 18, 2025•190 words

I recently finished listening to Scam Inc, a podcast series by the Economist on the "booming" industry of online scams. Just to show how good this series is, I subscribed immediately after the free episodes and finished the entire series in a two-day weekend. This is despite me previously unsubscribing to the Economist due to a lack of time for reading (but perhaps podcast is a better option for me anyways). Online scams isn't exactly new to me. I've watched many content creators in this space,...

Catch-22

February 13, 2025•154 words

This came up when I was watching a YouTube video about credit cards and credit history. The video described how you need a good credit history to get approved for a credit card, but then you also need a first credit card to start building a credit history. The YouTuber then said it is a classic "Catch-22" (obviously I couldn't figure out the hyphen in-between at the time). I could sort of guess the meaning of this slang from the context, but just to be sure, I searched it on Urban Dictionary. ...

Cube and Rollup for GroupBy

February 8, 2025•500 words

Many popular SQL engines, such as Apache Spark and PostgresSQL, have convenient functions for multiple groupby manipulations, namely cube and rollup. I will use cube in PySpark as an illustration. Suppose we have the following course enrolment data across different years for different courses. Note CS is not available for year 2021. +----+---------+---------+ |year| course|enrolment| +----+---------+---------+ |2021| Math| 100| |2021| Physics| 200| |2021|Chemistry| 300| |2...

Model Ablation

February 7, 2025•129 words

I came across this term when reading OpenAI's paper on GPT (Generative Pre-Training), which is one of the references in the book Build a Large Language Model (From Scratch). It's a new word/terminology for me, so I thought I might as well note it down. According to Wikipedia, ablation is originally a medical term meaning the surgical removal of body tissue. In the context of machine learning, its usage is credited to Allen Newell. An ablation study means removing a component of an AI system in...

Apple Dictionary/Calculator is Underrated

February 6, 2025•241 words

Since starting to use Kagi, I've become aware that I search a lot of unfamiliar English words for their meaning. As an attempt to divert this portion of my search usage, I tried to use the built-in Dictionary App on my Mac. I was pleasantly surprised and satisfied - this free app installed by default is actually very much underrated! Here are a few highlight features I've found in the past few days: It comes with a few default dictionaries (which I assume depends on the system languages). Fo...

Book Review: Build a Large Language Model (From Scratch)

February 2, 2025•405 words

I've just finished reading Building a Large Language Model (From Scratch) by Sebastian Raschka. Overall, I highly recommend the book to anyone interested in an introduction to / a refresher on large language models. The "From Scratch" part in the title is actually what drew my attention in the first place. Nowadays, so many online tutorials on language models are too application-focused (e.g. how to download and use open-source models, how to use API's, etc.). I am not saying they aren't useful...

Beam Search

January 29, 2025•296 words

Suppose we have a distribution over a sequence of words, and we want to generate a sentence from it. It can be very challenging to generate the sequence that has the highest probability overall - the number of possible sequences actually grows exponentially relative to the length of the sequence. One simple idea is to use the Greedy Algorithm where we always aim for the local optimum. In this context, the next word is chosen based on the highest probability. We continue until we reach the end-...

Basics of Vector Databases

January 26, 2025•278 words

In a traditional database with structured tables, querying data is easy because each table has a key column that uniquely identifies the rows. In contrast, the same task may be harder for unstructured data such as text documents, audio files and images. A vector database is one solution for storing and querying unstructured data. Given the popularity of large language models, let's use text documents as an example: In a vector database, documents (or parts of them) are stored as high-dimensio...

Kagi Trial Log Pt. 1

January 24, 2025•629 words

Kagi is a paid search engine focusing on privacy and better search results without advertisements. Privacy and quality of results are the primary reasons I wanted to try out this operating model of search engine. I originally signed up for a Kagi trial account in November 2024. It came with 100 free searches and 50 AI interactions. I did 7 searches, but then discovered Whoogle, an open-source and (sort of) private way to indirectly use Google. Prior to that, I had long left Google and switched ...

ML Model Deployment Strategies

January 20, 2025•257 words

(Part I of the study notes for the MLOps Concepts course on DataCamp (link)) Say, you have updated an ML model and want to deploy it in production to test its performance on real, unseen data. There are three possible strategy: Basic, Shadow, and Canary. Basic Deployment: Retire the old model immediately, and direct all production data to the new model. Shadow Deployment: Keep the old model running, and pass production data to both the old and new model. Real data are still processed by the o...

Fewer Decisions, More Energy

January 19, 2025•143 words

I recently watched a very informative video called Why you're so tired by Johnny Harris. Perhaps the most (un)surprising takeaway for me is the impact of decision making on our overall level of energy. I already forgot the exact reasons behind, but it turns out that our brains are likely to be overwhelmed and fatigued when there are more decisions to make in our daily life. Just think about what you want to order/cook for dinner everyday, and what video you want to watch to accompany that meal....

So what is customer churn?

January 18, 2025•269 words

While I have not dealt with datasets involving consumer behaviour, the term "Customer Churn" comes up a lot whenever I read articles/resources on machine learning. So exactly what is the definition of customer churn? Today I finally have the time to search it on Wikipedia. Voila! Churn rate (also known as attrition rate, turnover, customer turnover, or customer defection) is a measure of the proportion of individuals or items moving out of a group over a specific period. Basically, it meas...

nextToken in AWS boto3 Request

January 17, 2025•112 words

In one of my projects, I was trying to obtain a list of all available files in an AWS s3 bucket, which is done through an API call through boto3. However, I noticed that the returned list isn't complete, and that there is an additional field nextToken. A quick search returns this StackOverflow page that solves my problem. Basically, the API may not return the complete result when it is too long. Instead, it returns a partial result + nextToken. You can use nextToken in the next API call to con...

Monolithic vs Microservices Architecture

January 13, 2025•218 words

(Part I of the study notes for the DevOps Concepts course on DataCamp (link)) Imagine you are building a software. You could consider the monolithic architecture, which is basically a single unified software application that is self-contained and independent from other applications. A monolithic architecture may be suitable for small-scale projects, but most likely it gets clunky, unscalable, and almost impossible to maintain at a larger scale. A good solution for large scale projects is the m...

Understanding RAG (with math symbols)

January 10, 2025•1,075 words

Better rendered version hosted on Github: link TLDR: it is a finite mixture model with a bit of handwaving. Retrieval-Augmented Generation (RAG) is a popular technique to improve the quality of generated texts from language models. According to OpenAI (link), RAG can be quite promising if you have additional documents to provide as contexts relevant to your question asked to a language model. There are many online tutorials on RAG, e.g. by OpenAI, LangChain and NVDIA, most of which use flowch...

Database Normalization vs Data Cleansing

January 9, 2025•264 words

(Part III of the study notes for the Data Warehousing Concepts course on DataCamp (link)) Database Normalization is the process of structuring a relational database according to a series of so-called "Normal Forms". The goal is to reduce data redundancy and improve data integrity. The set of normal forms are listed on Wikipedia with examples, which I don't think is necessary to repeat here. A very closely related concept is Data Cleansing, which involves identifying and correcting/removing cor...

Fact & Dimension, Star & Snowflake Schema

January 6, 2025•373 words

(Part II of the study notes for the Data Warehousing Concepts course on DataCamp (link)) There are two types of tables in a data warehouse: fact table and dimension table. A fact table contains the measurements, metrics and facts of a business process. Consider the fact table of a car sales business. Each row would contain information such as the date of transaction, what car model is sold, the sales price, details about the buyer, etc. I think of it as a table that records and updates all tr...

Data Warehouse vs Mart vs Lake

January 3, 2025•145 words

(Part I of the study notes for the Data Warehousing Concepts course on DataCamp (link)) The following table summarizes some key differences between Data Warehouse, Data Mart and Data Lake. I've also found this useful resource by AWS on the same comparison but it gives a lot more details. Feature Data Warehouse Data Mart Data Lake Data Structure Structured Structured Structured & Unstructured Complexity to Change Complex Complex Less Complex Purpose of Data Known Known May not be k...

Wrapping up 2024 with selfcare

December 29, 2024•499 words

I decided to leave my PhD program in November 2024, more than six years after I first landed in Canada. I've learned many valuable lessons throughout these years, but I'd say 2024 is the most important year for my growth as a person. If I have to pick one thing from this year, self-caring is definitely the most valuable lesson I learned. While the administrative procedures started only a few weeks ago, I'd made up my mind a lot earlier. Leaving wasn't a particularly easy decision, but things be...

Better not pip freeze

December 24, 2024•178 words

I recently bumped into a head-scratching issue at work when using pip freeze -r requirements.txt to generate Python package dependencies. As a very short recap of the incidence, the generated requirements.txt file borked one of my virtual machines at work because pip freeze actually contained every package in the environment, including some mysterious lines that are only specific to a particular session of the virtual machine. When I shut down the previous session and tried to restart the virtu...

* and ** in Python

December 14, 2024•24 words

A short tutorial on the splat operators * and ** in Python: link. In summary, * unpacks positional arguments, while ** unpacks keyword arguments. ...

Fixing MacOS C compiler error in pyenv install

December 12, 2024•18 words

The solution is to add CC=gcc before pyenv install. The answer is found from this Stack Overflow question. ...

Beyond Jupyter: a tutorial for data science software design

December 3, 2024•67 words

Beyond Jupyter is an online tutorial on software design with a focus on machine learning tasks. It gives a gentle introduction to object-oriented programming (OOP), and then walks through an example that refactors a Jupyter notebook into reusable and maintainable modules. I particularly appreciate the walkthrough examples on code refactoring. It is very helpful and informative that the overarching principles are accompanied by code examples and commentaries. ...

A Quick Tutorial on Bash Scripts

November 26, 2024•108 words

Bash scripts, if properly written, can be extremely convenient and efficient for automating repetitive tasks, while making your code more reproducible. For example, with a bash script, you no longer to manually input the same command line arguments each time, nor do you need to keep a record of all the command line arguments used. Here is an awesome online tutorial for Bash scripting is written by Ryan. It is super user-friendly and provides an excellent introduction to Bash. I finished the tut...

Tools for Customizing Zsh

November 26, 2024•48 words

Oh My Zsh is a great tool for customizing your zsh. I love the flexibility to change themes especially when working with git repositories. zsh-autosuggestions allows for better auto-completion in zsh, which saves many a key stroke. I have used these tools since 2022 and absolutely love them! ...

Minimal Example for Python Multiprocessing

November 18, 2024•56 words

The following is a minimal example for multiprocessing in Python. A very useful guide written by Jan Bodnar is here. from multiprocessing import Process def fun(i): return i def main(): proc = [] for i in 1:100: p = Process(target=fun, args=(i, )) p.start() proc.append(p) for p in proc: p.join() if __name__ == '__main__': main() ...

# %% scripts are superior to Jupyter Notebooks

November 13, 2024•104 words

Jupyter notebooks, while popular as an entry to data science, have many shortcomings. The drawbacks and dangers of over-reliance on Jupyter notebooks are best summarized in this 2018 talk by Joel Grus. The percent format for Python scripts, denoted by # %%, is a great way to replace Jupyter notebooks. The percent format is supported by many editors, most notably VS Code. The official guide offers intuitive examples as introduction to the percent format and allows for a smooth transition from no...

venv for Python Virtual Environments

November 3, 2024•189 words

The venv library is part of the Python standard library suites for creating reproducible environments in terms of package dependencies. Suppose you have a folder for a Python project. The following command creates a new virtual environment. python -m venv /path/to/new/virtual/environment Then, use the following command to activate (or "enter") this virtual environment. source /path/to/new/virtual/environment/bin/activate Now we are working in an isolated Python environment: everything Python-...

Book Review: Data Science at the Command Line

November 2, 2024•411 words

Back in 2022, I came across a free E-book called Data Science at the Command Line written by Jeroen Janssens. I had quick and light read. My immediate thoughts after reading it are: Well, this is interesting... I didn't know you could do this much data manipulation and analysis through command line alone. Nah... I will probably stick with scripts and notebooks. Command lines still have their limits. Overall, the book is quite well-written: a friendly and easy introduction, abundant code exam...

Whoogle: Google but no ads

November 1, 2024•147 words

Whoogle is an open-source project that allows you to get "clean" Google search results, e.g., no ads/sponsored contents, no JavaScript, no cookies, limited tracking, etc. It is very easy to set it up locally with Docker - literally just two commands to pull it down and spin it up. I've been using it for weeks with great experiences - primarily to remove ads, disassociate search results from my Google account, and remove url tracking when clicking results. However, my IP address is still visibl...

VB-Cable for Virtual Audio Cables

October 31, 2024•257 words

VB-Cable is probably the simplest way to set up a virtual audio cable between two applications. The software is free and easy to download and install. It adds two virtual audio devices to your computer: CABLE Input (like a virtual microphone) and CABLE Output (like a virtual speaker). The following is a typical use case. You have some program to add filters and tweaks to your microphone input, e.g. using OBS or specialized software. You prefer this processed audio and would like to use it el...

Inconsistency in Python library names

October 30, 2024•126 words

It is well known (or perhaps not) that some Python packages have seemingly inconsistent names. Keeping this in mind might save you a lot of unnecessary debugging headaches. For example, the package multiprocessing can speed up certain tasks by parallelism. It is imported as follows. import multiprocessing Meanwhile, a nonsensical error would occur if you try to use pip install multiprocessing for actually installing said package. It turns out the installation should be: pip install multiproces...