Letter Coverage
From charlesreid1
Contents
Problem Description
The letter/word coverage problem is the problem of finding the minimum number of words from the five-letter-words set (see Five Letter Words) to cover X letters of the alphabet.
The problem has a couple of variations: we might provide a set of letters, and search for the smallest number of words that can cover those particular letters. Or we might give an integer N <= 26, and search for the smallest number of words that can cover the first N letters of the alphabet.
We may also wish to restrict our search to the first M words of the 5757 total five letter words.
For the sake of simplicity, we will start with the problem of considering the first N letters of the alphabet, and the shortest sequence of words that will provide coverage of those first N letters of the alphabet.
Solution
Description
This dynamic programming problem requires us to both maximize and minimize two different quantities. Our top priority is to maximize the number of letters covered; our second priority is to minimize the number of words used. Thus, we are always trying to be greedy and cover more letters; but if we have two solutions that both cover the same number of letters, we break the tie by comparing the number of words used.
The procedure is as follows:
- For each word (word_i), we must examine all prior words (word_j, where j < i)
- For each pair of words, we need to take the union (OR) of the character coverage for word_i and the solution bit vector for word_j (that is, the best-covered solution so far for word_j)
- (Note: for word_i we need to store one of these unions as the best-covered solution so far for word_i, but we aren't sure which one yet.)
- For the given pair of words, word_i and word_j, we are looking at word_j and the possibility of extending that with word_i
- Compute the number of 1s in the union with, e.g., sum(bit_vector)
- Compute the number of words in the solution for word_j and add 1 to it (add word_i to the solution)
- Searching for the prior solution word_j that leads to the maximum number of 1s
- Break ties by picking the word_j that will minimize number of words
- Once you find the best word_j, save the union bit vector for word_i and word_j under word_i combined solution bit vector; save the length of 1s in the combined solution bit vector; save the number of words so far in the solution
Once you've gone through all words, you're ready to find the minimum:
- Search the solutions for every word, and pick out the one that maximizes the number of 1s and minimizes the number of words.
- To get the actual sequence of words, we also need to save the prior word that leads to the maximum number of 1s and minimum number of words, for each word. Then at the end we can backtrack through the words that compose the solution.
Pseudocode
word2bitvector method:
- Turn a five-letter word into a bit vector representing character coverage.
main method:
- Initialization step
- Initialize best coverage bit vector
- Initialize maximum number of 1s
- Initialize number of words in current solution
- Initialize backtracking array (for constructing final solution)
- Fencepost step: set things up with word 0 (base case)
- For each remaining word i:
- Fencepost step: set things up with word i-1
- For each prior word j < i:
- Compute the new potential best coverage bitvector
- Compute the number of ones in the new potential best coverage bitvector
- Compute number of words in new potential best solution
- Check if this solution is better than the current one - if so, overwrite the old solution
- Get the solution:
- Find maximum indices of vector of number of ones (this is potentially multiple indices)
- Find minimum number of words corresponding to the maximum indices
- Backtrack through the solution indices
Code
Link to code: https://git.charlesreid1.com/cs/five-letter-words/src/master/letter_coverage.py
word2bitvector method:
def word2bitvector(word,N): """ Turns a five-letter word into a bit vector representing character coverage. Uses 26 letters by default. """ bit_vector = [False,]*N for c in word: i = ord(c)-ord('a') try: bit_vector[i] = True except IndexError: pass return np.array(bit_vector)
The main method:
# Initialization: # ---------------- # Store best coverage bitvectors for each word bestcoverage_bv = [[False]*N for k in range(len(words))] # Store number of 1s for best coverage vector for each word ones_bv = [0]*len(words) # Store number of words in best solution for each word ws = [0]*len(words) # Store prior word for backtracking bt = [-1]*len(words)
Fencepost initialization step:
# Fencepost: Initial Step # Word 0 # ---------------- # Start with word 0 wi = words[0] # Best letter coverage bit vector bestcoverage_bv[0] = word2bitvector(words[0],N) # Length of 1s ones_bv[0] = sum(bestcoverage_bv[0]) # Number of words in best solution: ws[0] = 1 # Backtracking: first word has no prior word bt[0] = 0
Loop over words i:
for i in range(1,len(words)): wi = words[i] # Start with bitvector of word i's coverage wi_bv = word2bitvector(wi,N) # Fencepost: Initial Step # Word i-1 # Assume this is the solution, # and if we find a better one later, # we can overwrite it. # ------------------------ # word i-1 wj = words[i-1] # Get the best coverage bitvector for i-1 wj_bv = bestcoverage_bv[i-1] # Get combined bitvector for word i and best coverage bitvector for word j bestcoverage_bv[i] = np.logical_or(wi_bv,wj_bv) # Count ones in new combined bitvector ones_bv[i] = sum(bestcoverage_bv[i]) # Number of words in new best solution: ws[i] = ws[i-1]+1 # Backtracking bt[i] = i-1
Loop over words j < i:
# Now loop over the rest of the words, # and look for a better solution. for j in reversed(range(1,i)): # Get the prior word wj = words[j] # Get best coverage bitvector wj_bv = bestcoverage_bv[j] # (potential) new combined coverage vector bestcoverage_bv_i = np.logical_or(wi_bv, wj_bv) # Number of ones in (potential) new combined coverage vector ones_bv_i = sum(bestcoverage_bv_i) # Number of words in (potential) new best solution ws_i = ws[j]+1 # If this solution is better than our current one, # overwrite the current solution. # (Better means, more ones or fewer words.) if(ones_bv_i > ones_bv[i]): bestcoverage_bv[i] = bestcoverage_bv_i ones_bv[i] = ones_bv_i ws[i] = ws_i bt[i] = j # It's tempting to stop early, # but what if we find the perfect # solution right at the end?!?
Construct the actual solution:
# Okay, now actually get the solution. # The solution is the maximum of ones_bv and the minimum of ws # # Start by finding the maximum(s) of ones_bv # Then check each corresponding index of ws ones_bv_indices = [k for k,v in enumerate(ones_bv) if v==max(ones_bv)] min_key = ones_bv_indices[0] min_val = ones_bv[ones_bv_indices[0]] for ix in reversed(ones_bv_indices[1:]): if(ones_bv[ix] < min_key): min_key = ix min_val = ones_bv[ix] print("Min key:") print(min_key) print("Min val:") print(min_val)
Example
Simple Manual Example
Let's walk through an example manually to illustrate the approach.
Suppose we are considering 2-letter words taken from a 5-letter alphabet abcde. We can represent a given word as a binary string/bit vector: for example, the two-letter word ab
would be represented by the bit vector 11000
.
Now let's look at a set of words and step through the algorithm:
W1 = ab = 11000 W2 = bc = 01100 W3 = aa = 10000 W4 = dd = 00010 W5 = de = 00011 W6 = bb = 01000
Now, we wish to write a dynamic program that will find the smallest set of words such that taking the union of each bit vector gives 11111. At each step, we seek the words that will maximize the length of the sequence of 1s in the union of the bit vectors, for the minimum number of words. We take the union of the "largest sequence of 1s" bit vector from the prior step, plus the bit vector for the current step.
W1:
Start with word W1: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.
- W1 solution is therefore . The number of 1s is 1. The number of words is 1. (W1 SOLUTION)
W2:
Next is word W2: the "largest sequence of 1s" bit vector is the union of the prior step's "largest sequence of 1s" bit vector and the current word's bit vector. One option:
- union of W1 solution and W2 . The number of 1s is 3. The number of words is 2. (W2 SOLUTION)
W3:
Next is word W3: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Two options:
- union of W1 solution and W3 . The number of 1s is 2. The number of words is 2.
- union of W2 solution and W3 . The number of 1s is 3. The number of words is 3. (W3 SOLUTION)
W4:
Next is word W4: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Three options:
- union of W1 solution and W4 . The number of 1s is 3. The number of words is 2.
- union of W2 solution and W4 . The number of 1s is 4. The number of words is 3. (W4 SOLUTION)
- union of W3 solution and W4 . The number of 1s is 4. The number of words is 4.
W5:
Next is word W5: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Four options:
- union of W1 solution and W5 . The number of 1s is 4. The number of words is 2.
- union of W2 solution and W5 . The number of 1s is 5. The number of words is 3. (W5 SOLUTION)
- union of W3 solution and W5 . The number of 1s is 5. The number of words is 4.
- union of W4 solution and W5 . The number of 1s is 5. The number of words is 4.
W6:
Next is word W6: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Five options:
- union of W1 solution and W6 . The number of 1s is 2. The number of words is 2.
- union of W2 solution and W6 . The number of 1s is 3. The number of words is 3.
- union of W3 solution and W6 . The number of 1s is 3. The number of words is 4.
- union of W4 solution and W6 . The number of 1s is 4. The number of words is 4.
- union of W5 solution and W6 . The number of 1s is 5. The number of words is 4. (W6 SOLUTION)
(NOTE: We don't need to consider every possible combination of W1, W2, W3, W4, W5, and W6; we only need to consider each word once, because each word's current solution can be written in terms of the prior word's solution, so we only need to consider solutions for each word. We've already considered the non-solutions and can therefore ignore them because they don't maximize number of 1s and minimize number of words.)
Thus far, we have found a local solution for each word. We can now compare all of these local solutions to find a global solution. The global solution will maximize the number of 1s found (meaning we can toss out any solutions that have less than 5 1s), and minimizes the total number of words (meaning, our W5 solution gives us the global optimum).
Therefore our global solution is the W5 solution: 5 1s, and 3 words. Thus, backtracking, we see that the words W1, W2, W5 cover all of the first five letters, with the minimum number of total words.
W1 = ab = 11000 W2 = bc = 01100 W5 = de = 00011
Flags
The Art of Computer Programming notes from reading Donald Knuth's Art of Computer Programming
Part of the 2017 CS Study Plan.
Mathematical Foundations: AOCP/Infinite Series · AOCP/Binomial Coefficients · AOCP/Multinomial Coefficients AOCP/Harmonic Numbers · AOCP/Fibonacci Numbers Puzzles/Exercises:
Volume 2: Seminumerical Algorithms
Volume 3: Sorting and Searching AOCP/Combinatorics · AOCP/Multisets · Rubiks Cube/Permutations
AOCP/Combinatorial Algorithms · AOCP/Boolean Functions AOCP/Five Letter Words · Rubiks Cube/Tuples AOCP/Generating Permutations and Tuples
|
Algorithms Part of Computer Science Notes
Series on Algorithms
Algorithms/Sort · Algorithmic Analysis of Sort Functions · Divide and Conquer · Divide and Conquer/Master Theorem Three solid O(n log n) search algorithms: Merge Sort · Heap Sort · Quick Sort Algorithm Analysis/Merge Sort · Algorithm Analysis/Randomized Quick Sort
Algorithms/Search · Binary Search · Binary Search Modifications
Algorithms/Combinatorics · Algorithms/Combinatorics and Heuristics · Algorithms/Optimization · Divide and Conquer
Algorithms/Strings · Algorithm Analysis/Substring Pattern Matching
Algorithm complexity · Theta vs Big O Amortization · Amortization/Aggregate Method · Amortization/Accounting Method Algorithm Analysis/Matrix Multiplication
Estimation Estimation · Estimation/BitsAndBytes
Algorithm Practice and Writeups Project Euler · Five Letter Words · Letter Coverage
|