bio 2021 question 1

Part A

This problem can be solved simply using a recursive algorithm:

function is_pat(string: String) -> bool
    if string.len() == 1 then
        return True
    else
        is_pat = False
        for i=0 to string.len() - 2 do
            let (first_half, second_half) = string.split_at(i)
            if NOT min(first_half) > max(second_half) then
                return False
            endif
            if is_pat(reversed(first_half)) AND is_pat(reversed(second_half)) then
                is_pat = True
                break
            endif
        endfor
        return is_pat
endfunction

It is a good idea to check that this algorithm is fast enough by testing some down pats that are twelve letters long.

Part B

This question is easily solved once you have solved 1A by generating all the permutations of ABCD and checking which ones are pats (you can even do this by hand).

Part C

This question is a bit more involved (I would argue that it is harder than question 1.A) and requires some combinatorics.

The trick to thinking about this question is to generalise – it's much easier to think about a down pat in the abstract than it is to thinking about different specific down pats and try to draw conclusions about the number of permutations.

First we can simplify the problem – we know that the down pat will be in the form B___A (this is because B comes one letter after A in the alphabet, so it is not possible to put a letter after A because this letter would be after B in the alphabet, and thus not all the characters in the left string would be higher in the alphabet than B). So we can split B___A into ___B and A. For ___B we know that the letter before B will be after B in the alphabet (C or higher) this means that we can only split ___B into ___ and B (because if we had CB, when we reverse this we get BC, and B is not higher than C in the alphabet, so this is not a down pat – the argument can be extended for other letters and greater splits).

We can conclude that we have 24 letters, and we would like to see how many different ways we can arrange them that are also valid down pats. We could think about individual permutations, and whether or not these are valid down pats – this is really messy though!

There is a better way; we can consider the structure of every down pat simultaneously. Let's think about a 24 letter (valid – this is the crucial assumption; one way that we can define a pat is that for a pat of n letters, we can split it n-1 times in total, before we have just single letters – i.e. we have a binary tree with n-1 nodes with children) down pat – we can split it in 23 different places.

You might find these diagrams useful as a visual aid:

A visualisation of how the down-pat "BCA" can be decomposed

  • 3 letters, 3-1 nodes with children

A visualisation of how the down-pat "BDCA" can be decomposed

  • 4 letters, 4-1 nodes with children

So we could have splits in the form (1, 23), (2, 22), (3, 23), ..., (23, 1) (where each number represents the length of the pat). How does this help us? Well, it means the number of ways this "tree" (think about it – every down pat has two "children" and all of those childen have either two children or no children) can be arranged is:

arrangements(1) * arrangements(23) + arrangements(2) * arrangements(22) + arrangements(3) * arrangemenents(23) + ... + arrangements(23) * arrangements(1).

Hopefully this is looking quite solvable now! But why is this true? Well, let's think about the first case, where we place one node on the right of the tree, and one node on the left of the tree (see my diagram below).

Diagram showing a binary tree with one node with value 24, connceted to two child nodes, one node with value 23 and the other with value 1

The number of ways that we can rearrange this tree is the number of ways that we can rearrange a tree with one node (i.e. one) times the number of ways that we can rearrange a tree with twenty-three nodes (it's not immediately obvious how we do this, but note that we can break this problem down recursively in the same way that we did with the 24 letter one). Why? Because for every possible arrangement of the left node, we can have every other arrangement of the right node (in this case, that's one). I've emphasised some of the words there because hopefully they help to make it obvious where this program comes from:

total_options = 0
// for every arrangement of the left node
for i=1 to arrangements(left_node) do
    // we have every arrangement of the right node
    for i=1 to arrangements(right_node) do
        // and this is a distinct arrangement of the whole tree
        total_options += 1
    endfor
endfor

Note that this is equivalent to arrangements(left_node) * arrangements(right_node).

So overall, we have an algorithm that looks a bit like

function arrangements(nodes: nat) -> nat
    // this is our basis case
    if nodes == 1 then
        return 1
    endif

  // we reduce the other cases in terms of the basis case
    total = 0
    for i=1 to nodes - 1 do
        total += arrangements(i) * arrangements(nodes - i)
    endfor
    return total
endfunction

This takes into account all the ways we can build distinct trees.

We are starting with 24 nodes, so the value we want is arrangements(24) (this equals 343059613650). Unfortunately, this function above (although very simple) is too slow as-is. It can easily be sped up using dynamic programming; see a possible solution that runs which I wrote.

The key is never to compute the same value twice – we just save the output value of the function into (for example) a hashtable (aka hashmap, aka dictionary).

A really easy way to do this in Python is to use the built-in cache decorator (cache documentation) – note that this seems to only exist in Python 3.10 (so you may need to replace cache with lru_cache):

from functools import cache

@cache
def arrangements(nodes: int) -> int:
    if nodes == 1:
        return 1
    total = 0
    for i in range(1, nodes):
        total += arrangements(i) * arrangements(nodes - i)
    return total

print(arrangements(24))

Using very statistically dubious measurement techniques, I measured the time taken for each approach:

  • With cacheing: 0.0001087188720703125s
  • Without cacheing: you know, my machine is still trying to work this out. We'll just say a lot. I think this a great illustration of just how effective dynamic programming can be. Update: it took 24854.752412080765s (ouch) which is about 7 hours!

An aside

It is sometimes joked that

There are two hard problems in computer science – naming things, cache invalidation [working out when to remove things from a cache] and off by one errors.

And not completely without merit – cacheing can be hard. Although it is not the case for our function, sometimes we might have too many values to store in our cache. In this case, we could use an LRU cache.

LRU stands for "least recently used" and is an eviction strategy (i.e. if our cache is too big, we have to remove an element, which element do you choose?) for cacheing. A cache stores values that have previously been computed – however, sometimes (not in this case) it turns out that there are too many values (relative to the amount of memory we have). One way of removing values is to store when each value was last used, and remove the value that was (you guessed it) least recently used.

For example, if our cache could only support five values, and we already have five values in it:

value | last_used
arrangements(1) = 1  | 4
arrangements(2) = 1  | 3
arrangements(3) = 2  | 2
arrangements(4) = 5  | 1
arrangements(5) = 14 | 0

Then we would select the value with the greatest last_used value (i.e. arrangements(1)) and remove it from the cache, instead inserting the new value that we wanted to insert. This can be useful if you find that you use this method and run out of memory. For this function, though, it is not needed (at least given the size of our input) and we can just use an "unbounded" (i.e. there is no maximum number of items) cache.


You'll only receive email when they publish something new.

More from Teymour Aldridge
All posts