April 10, 2021•1,180 words
Lost in the spectral domain
Starting my journey with deep learning for audio processing
As I'm getting into machine learning applied to audio signals, I realize that there's very few ressources available to craft yourself a curriculum. On the other hand, googling for deep learning applied to image processing or NLP will return more results than you can use.
Of course more content, blog posts, GitHub repos, online courses, etc. will be available over time. But still, how do you get started in the field in early 2021 ?
Motivating examples, speech-to-text and text-to-speech
Speech to text (stt) and text to speech (tts) are two problems where deep learning seems promising, especially compared to traditional ML approaches using complex feature engineering.
Even if academia and Big Tech are moving fast, like FAIR with wav2vec, those models are still as not as widely deployed and used as their image/NLP counterparts. We haven't that much tutorials and feedback available on how to use them and what kind of problems they answer well or not.
Learning tts and stt; your options
Ok, let's say you want to be a text to speech expert; there're several way to get ramped up:
- Reading research papers
- Watching YouTube videos
- Forking GitHub repos
- Attending online courses
- Reading books
Let's review each method and identify its pros and cons:
Research papers are great if you already know the basics of the field. Otherwise, prepare to be overwhelmed with new concepts at each line you read.
Having done some signal processing in College, I started my journey reading papers. Papers like the ones from Google Tacotron's series.
- Pro - reading papers gives you access to the "ground truth" that inspire many implementations you'll find on GitHub.
- Con - The acamedic jargon doesn't aim to explain concepts in layman's terms. It's hard to distinguish between the key concepts authors use to solve a problem and the the low levels tricks they use to beat previous SOTA.
- Papers can be biased towards their proposed architecture. Researchers don't have incentives to fairly acknlowedge the advantages of other approaches, especially if they are from a competing Unversity/Big Tech.
- Read the abstract and conclusion first. Only then, decide if it's worth reading.
I had strong priors against watching YouTube videos for educational purposes. It's so easy to consume an endless stream of media without being proactive. Still, when you start in a domain what you look for is intuition about the key concepts . Armed with this intuition, you'll develop your understanding of the field enough to formulate specific questions.
I found this video from Ubisoft La Forge to be particularly good at conveying intuition behind implementing a tts pipeline. (Note: La Forge is the Ubisoft team responsible for making viable products out of AI proof of concepts.)
They describe their ongoing work to have text-to-speech model generating non-playable characters voice in real-time when you're playing an open world like Assassin's Creed, Watch Dogs, or Far Cry. That's badass. And entertaining.
- Pro - Entertaining, low barrier to entry (when comitting to read a research paper requires lots of focus and mental energy).
- Con - Watch out for the rabbit hole. Youtube's goal is to maximize your watch time.
- As you watch such videos, be active. Take notes of the concepts you don't understand and check them out once the video is over. Ideally, don't stop the video while taking note. You just need to write a few words like "Griffin Lim algorithm" when you hear a new concept.
- Watch videos in 1.5x speed to optimize your time.
I'm a big fan of Andrew Ng's online courses on Coursera. His courses contributed to make Coursera the success it is today. But if Coursera has plenty of ressources available for image and NLP, they have not a single course for deep learning for audio (as of april 2021).
The existing courses for audio are old fashioned signal processing approached that are not fit to answer problem such as Text2speech or speech-to text.. Guess we'll wait
Sure enough, Coursera isn't the only online courses provider. Youtube offers a promising alternative before "established" providers catch up. Valerio Velardo made Audio for DL its sweet spot with several series mixing theoretical concepts and hands-on implementation.
- Pro - Courses offer a well defined syllabus that you can cherry pick from.
- Con - For unknown teacher, it's harder to assess their expertise in the domain.
- Run a brief backrgound check of the course teacher to assess their credentials in the field. Related previous Blog posts? Videos?
If well crafted courses for DL audio are scarce, there's plenty of GitHub repos implementing research papers, some from the paper's author themselves, some from outsiders who want to contribute to open source software.
e.g., this voice cloning app built by a student during its Master Thesis.
- Pro - If you have a clear use case and the repo addresses it, it's your lucky day. 🍀
- Con - Open source is messy, you're likely to spend hours struggling with the project installation and replicating the repo's maintainer results. Hello Python dependency hell. Let alone, retraining the model for your use case. Implementing research model in a industry-usable fashion is a skill to practice.
- Some repos offer an interesting collection of cherry-picked papers, blog articles, and videos. They're good starting point. See this repo for a collection of curated papers and blog posts on tts and stt.
Books are great reference companions, but they're expensive. Both in $ and in reading time.
Worse, for technical concerns like librairies versions, they're outdated before reaching print. Who wants to follow a TensorFlow v.1 tutorial in 2021 ?
- Pro - They're worth it for understanding key concepts and previous SOTA approaches, and books are structured documents that make information retrieval easy. Plus, you can check reviews of the book before commiting to it.
- Con - Books get outdated fast for technical considerations and implementations. Prefer an online documentation.
- Keep some reading notes, it will help you digest the content.
- One I've been recommended is Fundamentals of Music Processing: Audio, Analysis, Algorithms, Applications from Meinard Müller. After checking it out, it's not that useful for speech processing passed the general concepts over audio signal processing and Fourier Transform.
Closing thoughts 🧭
The best way to get lost is to jump in the train without knowing where you want to go.
Having a well-defined motivating use case will give you a clear compass to establish your syllabus. This compass will help you when you're wondering whether you should watch another YouTube video, or start getting your hands on the code.
I hope year 2021 will see the rise of audio expert YouTuber making the field more accessible.