Letter Coverage

Problem Description

The letter/word coverage problem is the problem of finding the minimum number of words from the five-letter-words set (see Five Letter Words) to cover X letters of the alphabet.

The problem has a couple of variations: we might provide a set of letters, and search for the smallest number of words that can cover those particular letters. Or we might give an integer N <= 26, and search for the smallest number of words that can cover the first N letters of the alphabet.

We may also wish to restrict our search to the first M words of the 5757 total five letter words.

For the sake of simplicity, we will start with the problem of considering the first N letters of the alphabet, and the shortest sequence of words that will provide coverage of those first N letters of the alphabet.

Solution

Description

This dynamic programming problem requires us to both maximize and minimize two different quantities. Our top priority is to maximize the number of letters covered; our second priority is to minimize the number of words used. Thus, we are always trying to be greedy and cover more letters; but if we have two solutions that both cover the same number of letters, we break the tie by comparing the number of words used.

The procedure is as follows:

For each word (word_i), we must examine all prior words (word_j, where j < i)
For each pair of words, we need to take the union (OR) of the character coverage for word_i and the solution bit vector for word_j (that is, the best-covered solution so far for word_j)
(Note: for word_i we need to store one of these unions as the best-covered solution so far for word_i, but we aren't sure which one yet.)
For the given pair of words, word_i and word_j, we are looking at word_j and the possibility of extending that with word_i
Compute the number of 1s in the union with, e.g., sum(bit_vector)
Compute the number of words in the solution for word_j and add 1 to it (add word_i to the solution)
Searching for the prior solution word_j that leads to the maximum number of 1s
Break ties by picking the word_j that will minimize number of words
Once you find the best word_j, save the union bit vector for word_i and word_j under word_i combined solution bit vector; save the length of 1s in the combined solution bit vector; save the number of words so far in the solution

Once you've gone through all words, you're ready to find the minimum:

Search the solutions for every word, and pick out the one that maximizes the number of 1s and minimizes the number of words.
To get the actual sequence of words, we also need to save the prior word that leads to the maximum number of 1s and minimum number of words, for each word. Then at the end we can backtrack through the words that compose the solution.

Pseudocode

word2bitvector method:

Turn a five-letter word into a bit vector representing character coverage.

main method:

Initialization step
- Initialize best coverage bit vector
- Initialize maximum number of 1s
- Initialize number of words in current solution
- Initialize backtracking array (for constructing final solution)
Fencepost step: set things up with word 0 (base case)
For each remaining word i:
- Fencepost step: set things up with word i-1
- For each prior word j < i:
  - Compute the new potential best coverage bitvector
  - Compute the number of ones in the new potential best coverage bitvector
  - Compute number of words in new potential best solution
  - Check if this solution is better than the current one - if so, overwrite the old solution
Get the solution:
- Find maximum indices of vector of number of ones (this is potentially multiple indices)
- Find minimum number of words corresponding to the maximum indices
- Backtrack through the solution indices

Code

Link to code: https://git.charlesreid1.com/cs/five-letter-words/src/master/letter_coverage.py

word2bitvector method:

def word2bitvector(word,N):
    """
    Turns a five-letter word into a bit vector representing character coverage.
    Uses 26 letters by default.
    """
    bit_vector = [False,]*N
    for c in word:
        i = ord(c)-ord('a')
        try:
            bit_vector[i] = True
        except IndexError:
            pass
    return np.array(bit_vector)

The main method:

    # Initialization:
    # ----------------

    # Store best coverage bitvectors for each word
    bestcoverage_bv = [[False]*N for k in range(len(words))]

    # Store number of 1s for best coverage vector for each word
    ones_bv = [0]*len(words)

    # Store number of words in best solution for each word
    ws = [0]*len(words)

    # Store prior word for backtracking
    bt = [-1]*len(words)

Fencepost initialization step:

    # Fencepost: Initial Step
    # Word 0
    # ----------------

    # Start with word 0
    wi = words[0]

    # Best letter coverage bit vector
    bestcoverage_bv[0] = word2bitvector(words[0],N)

    # Length of 1s
    ones_bv[0] = sum(bestcoverage_bv[0])

    # Number of words in best solution:
    ws[0] = 1

    # Backtracking: first word has no prior word
    bt[0] = 0

Loop over words i:

    for i in range(1,len(words)):
        wi = words[i]

        # Start with bitvector of word i's coverage
        wi_bv = word2bitvector(wi,N)

        # Fencepost: Initial Step 
        # Word i-1 
        # Assume this is the solution,
        # and if we find a better one later,
        # we can overwrite it.
        # ------------------------

        # word i-1
        wj = words[i-1]

        # Get the best coverage bitvector for i-1
        wj_bv = bestcoverage_bv[i-1]

        # Get combined bitvector for word i and best coverage bitvector for word j
        bestcoverage_bv[i] = np.logical_or(wi_bv,wj_bv)

        # Count ones in new combined bitvector
        ones_bv[i] = sum(bestcoverage_bv[i])

        # Number of words in new best solution:
        ws[i] = ws[i-1]+1

        # Backtracking
        bt[i] = i-1

Loop over words j < i:

        # Now loop over the rest of the words,
        # and look for a better solution.
        for j in reversed(range(1,i)):

            # Get the prior word
            wj = words[j]

            # Get best coverage bitvector 
            wj_bv = bestcoverage_bv[j]

            # (potential) new combined coverage vector
            bestcoverage_bv_i = np.logical_or(wi_bv, wj_bv)

            # Number of ones in (potential) new combined coverage vector
            ones_bv_i = sum(bestcoverage_bv_i)

            # Number of words in (potential) new best solution
            ws_i = ws[j]+1

            # If this solution is better than our current one,
            # overwrite the current solution.
            # (Better means, more ones or fewer words.)

            if(ones_bv_i > ones_bv[i]):
                bestcoverage_bv[i] = bestcoverage_bv_i
                ones_bv[i] = ones_bv_i
                ws[i] = ws_i
                bt[i] = j

            # It's tempting to stop early,
            # but what if we find the perfect 
            # solution right at the end?!?

Construct the actual solution:

    # Okay, now actually get the solution.
    # The solution is the maximum of ones_bv and the minimum of ws
    # 
    # Start by finding the maximum(s) of ones_bv
    # Then check each corresponding index of ws
    ones_bv_indices = [k for k,v in enumerate(ones_bv) if v==max(ones_bv)]

    min_key = ones_bv_indices[0]
    min_val = ones_bv[ones_bv_indices[0]]
    for ix in reversed(ones_bv_indices[1:]):
        if(ones_bv[ix] < min_key):
            min_key = ix
            min_val = ones_bv[ix]

    print("Min key:")
    print(min_key)
    print("Min val:")
    print(min_val)

Example

Simple Manual Example

Let's walk through an example manually to illustrate the approach.

Suppose we are considering 2-letter words taken from a 5-letter alphabet abcde. We can represent a given word as a binary string/bit vector: for example, the two-letter word ab would be represented by the bit vector 11000.

Now let's look at a set of words and step through the algorithm:

W1 = ab = 11000
W2 = bc = 01100
W3 = aa = 10000
W4 = dd = 00010
W5 = de = 00011
W6 = bb = 01000

Now, we wish to write a dynamic program that will find the smallest set of words such that taking the union of each bit vector gives 11111. At each step, we seek the words that will maximize the length of the sequence of 1s in the union of the bit vectors, for the minimum number of words. We take the union of the "largest sequence of 1s" bit vector from the prior step, plus the bit vector for the current step.

W1:

Start with word W1: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.

W1 solution is therefore $11000$ . The number of 1s is 1. The number of words is 1. (W1 SOLUTION)

W2:

Next is word W2: the "largest sequence of 1s" bit vector is the union of the prior step's "largest sequence of 1s" bit vector and the current word's bit vector. One option:

union of W1 solution and W2 $11000\bigcup 01100=11100$ . The number of 1s is 3. The number of words is 2. (W2 SOLUTION)

W3:

Next is word W3: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Two options:

union of W1 solution and W3 $11000\bigcup 10000=11000$ . The number of 1s is 2. The number of words is 2.
union of W2 solution and W3 $11100\bigcup 10000=11100$ . The number of 1s is 3. The number of words is 3. (W3 SOLUTION)

W4:

Next is word W4: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Three options:

union of W1 solution and W4 $11000\bigcup 00010=11010$ . The number of 1s is 3. The number of words is 2.
union of W2 solution and W4 $11100\bigcup 00010=11110$ . The number of 1s is 4. The number of words is 3. (W4 SOLUTION)
union of W3 solution and W4 $11100\bigcup 00010=11110$ . The number of 1s is 4. The number of words is 4.

W5:

Next is word W5: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Four options:

union of W1 solution and W5 $11000\bigcup 00011=11011$ . The number of 1s is 4. The number of words is 2.
union of W2 solution and W5 $11100\bigcup 00011=11111$ . The number of 1s is 5. The number of words is 3. (W5 SOLUTION)
union of W3 solution and W5 $11100\bigcup 00011=11111$ . The number of 1s is 5. The number of words is 4.
union of W4 solution and W5 $11110\bigcup 00111=11111$ . The number of 1s is 5. The number of words is 4.

W6:

Next is word W6: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Five options:

union of W1 solution and W6 $11000\bigcup 01000=11000$ . The number of 1s is 2. The number of words is 2.
union of W2 solution and W6 $11100\bigcup 01000=11100$ . The number of 1s is 3. The number of words is 3.
union of W3 solution and W6 $11100\bigcup 01000=11100$ . The number of 1s is 3. The number of words is 4.
union of W4 solution and W6 $11110\bigcup 01000=11110$ . The number of 1s is 4. The number of words is 4.
union of W5 solution and W6 $11111\bigcup 01000=11111$ . The number of 1s is 5. The number of words is 4. (W6 SOLUTION)

(NOTE: We don't need to consider every possible combination of W1, W2, W3, W4, W5, and W6; we only need to consider each word once, because each word's current solution can be written in terms of the prior word's solution, so we only need to consider solutions for each word. We've already considered the non-solutions and can therefore ignore them because they don't maximize number of 1s and minimize number of words.)

Thus far, we have found a local solution for each word. We can now compare all of these local solutions to find a global solution. The global solution will maximize the number of 1s found (meaning we can toss out any solutions that have less than 5 1s), and minimizes the total number of words (meaning, our W5 solution gives us the global optimum).

Therefore our global solution is the W5 solution: 5 1s, and 3 words. Thus, backtracking, we see that the words W1, W2, W5 cover all of the first five letters, with the minimum number of total words.

W1 = ab = 11000
W2 = bc = 01100
W5 = de = 00011

Flags