Spark Tseung

Tech/non-tech notes and random thoughts. Views and opinions expressed here are mine only. They do not represent those of my former and current employers, or any other organizations I am affiliated with.

* and ** in Python

A short tutorial on the splat operators * and ** in Python: link. In summary, * unpacks positional arguments, while ** unpacks keyword arguments. ...
Read post

Fixing MacOS C compiler error in pyenv install

The solution is to add CC=gcc before pyenv install. The answer is found from this Stack Overflow question. ...
Read post

Beyond Jupyter: a tutorial for data science software design

Beyond Jupyter is an online tutorial on software design with a focus on machine learning tasks. It gives a gentle introduction to object-oriented programming (OOP), and then walks through an example that refactors a Jupyter notebook into reusable and maintainable modules. I particularly appreciate the walkthrough examples on code refactoring. It is very helpful and informative that the overarching principles are accompanied by code examples and commentaries. ...
Read post

A Quick Tutorial on Bash Scripts

Bash scripts, if properly written, can be extremely convenient and efficient for automating repetitive tasks, while making your code more reproducible. For example, with a bash script, you no longer to manually input the same command line arguments each time, nor do you need to keep a record of all the command line arguments used. Here is an awesome online tutorial for Bash scripting is written by Ryan. It is super user-friendly and provides an excellent introduction to Bash. I finished the tut...
Read post

Tools for Customizing Zsh

Oh My Zsh is a great tool for customizing your zsh. I love the flexibility to change themes especially when working with git repositories. zsh-autosuggestions allows for better auto-completion in zsh, which saves many a key stroke. I have used these tools since 2022 and absolutely love them! ...
Read post

Minimal Example for Python Multiprocessing

The following is a minimal example for multiprocessing in Python. A very useful guide written by Jan Bodnar is here. from multiprocessing import Process def fun(i): return i def main(): proc = [] for i in 1:100: p = Process(target=fun, args=(i, )) p.start() proc.append(p) for p in proc: p.join() if __name__ == '__main__': main() ...
Read post

# %% scripts are superior to Jupyter Notebooks

Jupyter notebooks, while popular as an entry to data science, have many shortcomings. The drawbacks and dangers of over-reliance on Jupyter notebooks are best summarized in this 2018 talk by Joel Grus. The percent format for Python scripts, denoted by # %%, is a great way to replace Jupyter notebooks. The percent format is supported by many editors, most notably VS Code. The official guide offers intuitive examples as introduction to the percent format and allows for a smooth transition from no...
Read post

venv for Python Virtual Environments

The venv library is part of the Python standard library suites for creating reproducible environments in terms of package dependencies. Suppose you have a folder for a Python project. The following command creates a new virtual environment. python -m venv /path/to/new/virtual/environment Then, use the following command to activate (or "enter") this virtual environment. source /path/to/new/virtual/environment/bin/activate Now we are working in an isolated Python environment: everything Python-...
Read post

Book Review: Data Science at the Command Line

Back in 2022, I came across a free E-book called Data Science at the Command Line written by Jeroen Janssens. I had quick and light read. My immediate thoughts after reading it are: Well, this is interesting... I didn't know you could do this much data manipulation and analysis through command line alone. Nah... I will probably stick with scripts and notebooks. Command lines still have their limits. Overall, the book is quite well-written: a friendly and easy introduction, abundant code exam...
Read post

Whoogle: Google but no ads

Whoogle is an open-source project that allows you to get "clean" Google search results, e.g., no ads/sponsored contents, no JavaScript, no cookies, limited tracking, etc. It is very easy to set it up locally with Docker - literally just two commands to pull it down and spin it up. I've been using it for weeks with great experiences - primarily to remove ads, disassociate search results from my Google account, and remove url tracking when clicking results. However, my IP address is still visibl...
Read post

VB-Cable for Virtual Audio Cables

VB-Cable is probably the simplest way to set up a virtual audio cable between two applications. The software is free and easy to download and install. It adds two virtual audio devices to your computer: CABLE Input (like a virtual microphone) and CABLE Output (like a virtual speaker). The following is a typical use case. You have some program to add filters and tweaks to your microphone input, e.g. using OBS or specialized software. You prefer this processed audio and would like to use it el...
Read post

Inconsistency in Python library names

It is well known (or perhaps not) that some Python packages have seemingly inconsistent names. Keeping this in mind might save you a lot of unnecessary debugging headaches. For example, the package multiprocessing can speed up certain tasks by parallelism. It is imported as follows. import multiprocessing Meanwhile, a nonsensical error would occur if you try to use pip install multiprocessing for actually installing said package. It turns out the installation should be: pip install multiproces...
Read post