Beam Search

Suppose we have a distribution over a sequence of words, and we want to generate a sentence from it.

It can be very challenging to generate the sequence that has the highest probability overall - the number of possible sequences actually grows exponentially relative to the length of the sequence.

One simple idea is to use the Greedy Algorithm where we always aim for the local optimum. In this context, the next word is chosen based on the highest probability. We continue until we reach the end-of-sentence token or a pre-specified maximum length of sequence.

This greedy search can be very fast, but it doesn't always produce the best sequence. In fact, it could be stuck at a rather bad local optimum.

Beam Search can be viewed as a generalization of the above. When generating the next word, instead of choosing only the word with the highest probability, we first try out all possibilities, and keep those with the top K (e.g. 5) probabilities. This process is iterated until reaching the same stopping criteria.

Beam search is also a greedy-type algorithm, but we are allowing it to try out more paths and potentially explore a wider space. Similar to the best-first approach, it is not guaranteed to reach the global optimum. However, it is shown in practice to generate better results.

This website offers a concrete implementation of beam search in the context of decoding from a language model. I enjoyed reading it along with the Wikipedia article.

I tried to make my own implementation after reading the principles, but unfortunately wasn't able to do it within half an hour (arguably a rather arbitrary timeframe). Besides conceptual understanding, it seems like I need a bit more practice in coding some basic algorithms from the ground up.


You'll only receive email when they publish something new.

More from Spark Tseung
All posts