Book Review: Data Science at the Command Line
November 2, 2024•411 words
Back in 2022, I came across a free E-book called Data Science at the Command Line written by Jeroen Janssens. I had quick and light read. My immediate thoughts after reading it are:
- Well, this is interesting... I didn't know you could do this much data manipulation and analysis through command line alone.
- Nah... I will probably stick with scripts and notebooks. Command lines still have their limits.
Overall, the book is quite well-written: a friendly and easy introduction, abundant code examples with good explanations, and even an accompanying Docker image for the readers to try out everything written in it. The book structure and writing style are definitely suitable for someone like me, coming from statistics/data analytics with little to no knowledge of many great (or in this case, basic) tools in programming.
However, command lines cannot do everything, or, I'd venture to say, most things in data analysis. It also comes down to personal preference: For me, I'd still be much more comfortable and efficient when analyzing data interactively, either running script line by line, or more conveniently in a notebook.
At the very least, when my code is run interactively, it would be much easier for me to scroll up and check what code has been run and to make changes here and there. I wouldn't say it'd be fun moving the cursor through multiple lines in the command line interface only to change one parameter in a 10-line visualization function. Not to mention other great utilities coming from many well-developed and well-maintained tools for interactive data analysis (e.g. the VS Code extensions for Python and Julia).
That said, certain tasks are best left to be done in command lines, such as running a script to fit multiple machine learning models non-interactively in the background. Obviously, it wouldn't be convenient to have a browser tab of Jupyter notebook open for hours just to wait for a model to be fitted! Besides, command line tools also ensures a higher level of reproducibility, e.g., by properly version-controlling a bash script.
In short, I really like this book and highly recommend it! It inspires me to think of use cases for command lines in my work when it actually is more convenient and efficient. On the topic of command lines, below are some related learning resources that I find particularly useful.