I was introduced to the shortest superstring problem by a request I got on Fiverr. The request turned out to be a waste of time but I really enjoyed learning about it.
Problem statement
The shortest common superstring problem aims to find a string with a minimal length that contains every string in a given set.
Example
Suppose the given set has two strings: ABA and AAB.
We could say ABAAB contains both ABA and AAB by simply merging the last character in ABA and the first character in AAB, both being A. The resulting superstring has 5 characters. This is a possible “solution”.
Another way to do it is to flip the order and merge the last two characters in AAB and the first two in ABA. We then get AABA. This contains both ABA and AAB and has only 4 characters. So this one is shorter than the first “solution”. We then say this one is a better solution, a more optimal solution.
This one is after all the best solution for the given set.
Relevance
As a professor in grad school used to say: “So what?”. We care because this can help in various ways. The most commonly cited uses are in DNA research and in the development of data compression algorithms.
The difficulty
This problem is a lot harder than it looks. The problem will grow exponentially more complex with more strings in the set, longer strings, and more distinct characters.
It is a somewhat known and classical problem in computer science and can be “solved” but it is considered an NP-complete problem. Basically, it means that it will be very hard to solve in a timely manner. I mentioned similarly hard problems in a previous post.
These problems are so hard to solve that they are usually tackled with a “greedy algorithm” which simply provides an approximate, good (but not the best) solution in a very efficient and quick way.
“Future work”
In a separate post, I will show how I implemented the greedy algorithm using VBA, as I am not a programmer but I repeatedly end up doing programmatic things, and VBA is very easy to work with.
And hopefully, in a third post, I will show how I combined the greedy algorithm with some optimization to get a better solution than with the greedy algorithm alone.