GLH compression

Over lunch today, I was talking to Gwyn and Laurie, two of the Ruby coders from New Bamboo - I work with them on a Top Secret Web Project for work.

We started talking about geeky things, and somehow ended up discussing a compression algorithm, where you would alphabetise each character of a message, and then replacing the number of instances of a character with a number, thereby drastically compressing how much space it takes up.

The first paragraph of Lorem Ipsum, which looks a little something like this:

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Nulla at neque. Maecenas id velit id lacus volutpat dictum. Sed tellus erat, semper eget, euismod ac, feugiat vitae, nisi. Integer eget purus vel nulla adipiscing commodo. Ut id ante. Nullam lobortis mollis lacus. Quisque ac sem eu lectus scelerisque molestie. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Fusce urna. Ut a felis et est hendrerit hendrerit.

Could be expressed as:

a24 b4 c20 d12 e51 f4 g6 h2 i38 l27 m16 n16 o15 p9 q4 r19 s32 t32 u35 v5

Which is a reduction of 457 to 60 characters - or a 86.9% reduction in space.

Of course, this version is quite crude, and puts all text into lower case, and destroys spacing, which means that you lose all readability. We can easily add 3 more characters, for space, comma and full stop, which adds a bit of length to the compressed data, but makes the post-uncompressed data easier to read:

a24 b4 c20 d12 e51 f4 g6 h2 i38 l27 m16 n16 o15 p9 q4 r19 s32 t32 u35 v5 A70 B5 C10

If you wanted to, you could also allow for more characters and capital letters, but this would immediately double the length of the compressed string, which ruins the point a bit.

Some compression examples

100,000 words of Lorem Ipsum:

a4530 b660 c2406 d1675 e6456 f377 g706 h312 i5462 j71 l3332 m2594 n3478 o2417 p1466 q772 r3045 s4848 t4437 u4983 v863 A9890 B1328 C1630

MD5 hash is 4a1a63d4b2b17733cc64422d77eaa496
SHA1 hash is d063ea401b508e2a8b0467bdb2be67f6e5c6d693

Encoding took about 13 milliseconds.

That’s a compression rate from 67643 to 114 characters, or a reduction of 99.8%

The entire King James V bible:

a282602 b50007 c58750 d159527 e422235 f83096 g57191 h286915 i196033 j13754 k25476 l132226 m85246 n227142 o247290 p46617 q953 r175732 s197573 t318954 u86542 v32428 w65213 x2663 y58249 z4784 A789645 B70684 C26147

MD5 hash is 6336aff90949d987ae7c36ee4c868399
SHA1 hash is 13b0391b632f4300eed7d77a6fc6719fba1832b0

Encoding took just over a second.

Compressing the King James bible like this reduces it from 4,404,445 characters down to 184 characters… In comparison, TAR/GZ, which is one of the industry standards of file compression, gets the same text file down to 1,316,216 characters…

Uncompressing this algorithm

Of course, it’s no good only being able to compress something: you need to be able to decompress it as well. By its very nature, the above compression is lossy (you lose all capitalisation plus special characters, for example), and by the time you alphabetise a string by character, all hope is lost… Right?

The discussion around the lunch table continued. I came up with the suggestion of using a hash of the original file, taken before the file is sorted, but after it has had all its lossy conversions applied. In theory, if you had a good enough strong one-way hash, you could tell a computer the hash and the number of times each character needs to appear, and the computer could brute-force the message, eventually finding the correct message. The process would be like so:

The computer…

  1. takes the compressed message and decompresses it by replacing ‘a3b2′ with aaabb
  2. generates the first possibility of the order of these letters
  3. confirms whether this particular possibility is correct.

If the hash matches, you have, in theory, uncompressed and decoded the message.

But how can you be sure that you have the right message, and not a freak hash collision?

Gwyn came up with the suggestions of using two hashes - if you hash the final file with, say, MD5 and SHA1, it becomes extremely unlikely that a solution which matches with both hashes would be a false positive.

Laurie suggested to, instead, add the first, say, 30 characters of the message in plaintext - this would be about as secure as using two hashes, but a lot less expensive in terms of processing power. In addition, it’d be a cool nod to what Alan Turing’s team did when they were breaking the Enigma codes of World War 2, as we’re essentially using a crib.

So, in effect, the computer…

  1. takes the compressed message and decompresses it by replacing ‘a3 b2′ with aaabb
  2. generates the first possibility of the order of these letters
  3. checks the possibility against the hash
  4. if the hash matches, checks the second hash
  5. if the second hash matches, checks the crib text
  6. if all three match up, we’ve successfully decompressed the original text

Wow, that’s amazing! Have you guys revolutionised the world of text compression?

Why thank you. To be honest, probably not.

The problem is that brute-forcing through large swathes of text is bloody difficult: The string like the Lorem Ipsum example above has 457 characters. To put them in the right order, there are 45729 different combinations - that’s a 1 with 77 zeroes behind it. (and this is also the reason why capital letters is a bad idea. If we were to allow capitals, the number would grow to 45755, which is an even more ludicrously large number of possibilities)

Under ideal circumstances, we can check about 5 million hashes per second, which sounds pretty good, but to re-organise the simple 457 character item, it would still take one of today’s computers 8*1056 million years.

So the great news is that we’ve found a way of drastically reducing the size of plaintext. The second item of great news is that it’s relatively easy to decompress this compression algorithm. The massive red flag of bad news is that, while decompression is easy, it takes an unfathomable amount of time to do so, which makes the whole exercise prohibitively difficult.

So is all of this a waste of time?

Well, yes and no. There’s a competition out there which aims to compress a 100MB plain text version of Wikipedia. The current record stands at about 16MB, which is about twice as efficient as the commonly-used ZIP compression format. The GLH compression algorithm outlined here trounces the record - instead of taking up 16,481,655 bytes, we can describe all of the Hutter Prize data in only 225 bytes:

a6051355 b1212851 c2624828 d2569957 e8178365 f1509695 g1611390 h2988658 i5479606 j159395 k458992 l3169420 m2129600 n5028882 o5266127 p1724317 q270967 r4648610 s4643363 t6463244 u2095880 v739642 w1032926 x234215 y1088981 z124570 A13519824 B787826 C794548

Now if only we could find a way of decompressing this again in under 10 hours (hahahahahah), we could claim the 50,000 euro bounty!

(of course, none of this is new… But it does make me happy that we managed to come up with this stuff on our own!)

Fancy having a go yourself?

No problem, head to the GLH compression booth!

I want to see your code!

Sure thing. It cleans up a string called $toencode, and passes it to this function:

	$encoding = strtolower($toencode); // all to lower case

	$avoid   = array(" ", ",", "."); // Replace whitespace w/ A,B,C
	$replace = array("A", "B", "C");
	$encoding = str_replace($avoid, $replace, $encoding);

// strip all non-alphanumerics
	$encoding = preg_replace("/[^ABCa-z]/", "", $encoding); 

	$checkchars = array("a", "b", "c", "d", "e", "f", "g", "h",
	"i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u",
	"v", "w", "x", "y", "z", "A", "B", "C");

	foreach ($checkchars as &$character) {
	    $numberfound = substr_count($encoding, $character);
		if ($numberfound) { $outputstring .= $character . $numberfound . " "; }
		unset ($number);
	}

Now, $outputstring contains the compressed data, and all you need to do is to md5($encoding) and sha1($encoding) to get the hashes for both of them.

This piece of code is all you need to compress stuff. The bible and the 100MB of Wikipedia were done with this code too, taking about 1 second and 27 seconds, respectively, but this needed quite a serious amount of memory and processing power, so in the interest of keeping slicehost happy, maximum upload is a lot less than those two files :)

Could this be improved any further?

Well, yes, you could, in fact, in theory, remove the original text altogether. Take the original text and hash it against a series of different hashing mechanisms. To help it out, tell it a) how many characters there were in the original text, b) what the first 50 characters are, and c) what the character set is (i.e. only lower-case letters). You could then brute-force the whole message, which means that you could compress a message even further, down to just a series of hashes.

The problem here is that your reconstitution of the message goes from being silly to ludicrous: You’re now, in effect, taking blind guesses at the message, and checking it against a series of hashes. The bigger the message, the longer it will take. I predict it’s going to take a very long time before we can start using hashes-only to compress our data.

… Unless we get those quantum computers, of course.

10 comments.

  1. This is what a free coffee machine does to a development team.

    There are some optimisations here. If we know that the plaintext is syntactically valid English text, we need only check possible messages in this space. Essentially, we’re going from a brute-force attack to a dictionary attack.

    We should probably not stop decompressing when we find the first match - we need to search the whole space in order to guard against hash collisions like “CEASEFIRE NOW” –> “FIANCEE WORSE”.

  2. Whilst I agree that your decompression is going to be painfully slow, I’ve got to question the numbers on just how slow.

    You argue that with a 29 letter alphabet, you have 457^29 possible 457-character strings, of which only one is your lorem ipsum example. This is true, but ignores your compressed data entirely: for instance, one of those strings is 457 z’s, which you can clearly eliminate from consideration by comparing to your a4530 b660 c2406.. data. In fact, taking this approach you’d have an even shorter compression algorithm: just tell someone the message length and the hash, and let them generate strings of that length and test for collision to recover the message.

    By specifying the aggregate data, you’ve already told me what 457 letters I need, just not the order; so I have a pool to draw from. Were all those letters distinct, I’d have a 1/457 chance of getting the first one right, then 1/456 of the next and so on… for 457! possibilities. Still awful, but better- and it would also means that capitals or other symbols make no difference to the odds of success- imagine pulling unique scrabble tiles out of a bag and lining them up; whether the string is right or wrong depends on the order out the bag, not the alphabet employed to make the tiles.

    However, as we don’t have 457 distinct tiles, but some selection of 457 from 29 (or 55, or whatever) types, a lot of the strings generated will be the same. For instance, v1w1×1y1z1 is 5 letters, with 5!=120 possible arrangements. But a3b2 only gives you 10 possible strings- choose 2 places out of 5 to put the b’s, and the a’s fill in the rest. Working out the possible unique arrangements of 29 symbols with multiplicity sounds a bit tough for a Friday morning, but I’ll keep it in mind if I’m teaching Discrete Math again next semester!

  3. Oops, and now I’m making mistakes myself. For a 29 letter alphabet, there are 29^457 (669 digit number) possible 457-character strings, not 457^29 (78 digit number). So our comparison is 29^457 (naive string generation) against something smaller than 457! for selection without replacement via the compression data; but that’s a much bigger number so I really need to factor in those repeated strings. Also I shouldn’t have stopped reading at the code as I see you considered the hash-only approach…

  4. Some pencil and paper work shows it’s actually easy to calculate the number of distinct n-character strings featuring x1 of symbol 1, x2 of symbol 2… x29 of symbol 29 by (n!)/(x1!)(x2!)….(x29!) . So for the particular xi’s for your lorem ipsum example this gives a search space of size k where k is a 648 digit number that probably won’t make it through the spam filter, but is about 10^21 times smaller than the naive space of all 457-character strings.

  5. [...] once again got the chance yesterday to take 30 mins out of the otherwise stressful day to create a new compression algorithm which can reduce the content of Wikipedia (100Mb) down to 225 bytes, beating the previous record of [...]

  6. It certainly sounds plausible, but unfortunately it’s not quite. The killer lies in this sentence:

    “Gwyn came up with the suggestions of using two hashes - if you hash the final file with, say, MD5 and SHA1, it becomes extremely unlikely that a solution which matches with both hashes would be a false positive.”

    What’s true to say is that it’s less likely a double-match would be a false positive than a single match is. If the second hash is, say, 128 bits, then it’s 2^128 times less likely that it’s a false positive.

    Unfortunately, when the message length (in terms of the amount of information in the message) is so much longer than the hash, the chances of any given positive result being the correct result are are so much smaller than the chance of a it being a hash collision, that there are still far more double-hash collisions than there are genuine positives.

    It’s just about plausible for the first number you mention, 457^29 case, which is about the same as 2^256, (which is to say, there’s just as much information in the hash as there is in the ) assuming you have a total of 256 bits worth of hash, but it doesn’t scale up: the probability (and therefore number) of false positives increases exponentially with the amount of information you add, or linearly with the number of permutations.

    To break it down to a trivial case to make it obvious, say you have some letters you’re trying to arrange in such a way that there’s only 256 combinations of them. This naturally takes up 8 bits, and you try to encode this using a 2-bit hash. No matter how good, algorithmically, that 2-bit hash is, it can only ever reduce the number of possibilities by a factor of 22, so you’ll still have 256/4 = 64 positives, 63 of which are false. The same things applies when you scale up to 128-bit hashes with 2^(128+8-2)-bit messages.

    Wowwie, I’m such a killjoy. But at least it’s Friday, right? :)

  7. Actually, I was thinking about 457^29 while doing the dishes, and realised that this is probably intended to represent the raw number of combinations of 457 characters before cutting them down with the set, yes? If that’s the case, the corect number would be 29^457, which is much, much bigger.

    The number I was thinking you meant was the number of permutations after considering your letter set, which is actually

    n!/(Π:l&memb;L.nl!)

    (and if html doesn’t have capital π or set membership entities, I shall be upset)

    where L is the set of letters in your dictionary and nl is the number of instances of the letter l in the message.

    I don’t know how big that is for your 457 character message, but it’s probably quite big, although it should be significantly smaller than 29^457.

  8. Also: I was thinking about the same thing earlier today, actually. My
    calculation in the post presumes that a 4-letter plaintext message has
    4^26 possible options, but if the plaintext message is aaab, there are,
    in fact, only 4 possible combinations - because 4^26 doesn’t take into
    account that we don’t care which of the a’s are in which position.

    This drastically reduces the number of bruteforce possibilities, but I
    don’t really know how to express this mathematically - I believe
    ‘permutations’ is the mathematical field I’m looking at, but it’s well
    beyond the calculus level I stopped at, I think.

  9. Actually, for 4 elements of either a or b, there’s 6 possible combinations:
    Each of A and B have 2 elements, and we’re selecting 4, so that’s 4! / (2! . 2!), which is six. Iterating over the equivalent ones, there’s aabb abab abba and the complements of these, bba, baba and baab.

    Using your first set of numbers for your lorem ipsum without punctuation, we can calculate that in bc:

    define f(x) {
    if (x > 1) {
    return x * (f(x-1))
    }
    return (1)
    }
    f(457) / ( f(24) *f(4) *f(20) *f(12) *f(51) *f(4) *f(6) *f(2) *f(38) *f(27) *f(16) *f(16) *f(15) *f(9) *f(4) *f(19) *f(32) *f(32) *f(35) *f(5) )

    This comes out as the rather scary looking number:

    24979676534385811028057980738663929391311867252395764304516513933684\
    07606995096461303297817015351090284443259502380977579433712838416708\
    45657900380181004097489062467705784954149469973163067321876126281229\
    95907974733847026378002417559136955864960723139702457366278743422910\
    76636180848502418461402452003145682339219616138491354732354517049290\
    51034504318969704763573143447683688097296432449198854267451150579323\
    49228888314823997292233314267179896978363474923474304843679668468255\
    73366601896030037525678558877315402684142226202299224543792913670945\
    70000940885751377191355034261912546595703027389104226535408140288000\
    000000000000000000000000000000000000

    …which is rather a lot of permutations.

    If we assume 256 bits of hash data, we can divide this by 256 to get the number of permutations which will have the ‘correct’ checksum:

    21572869700269331363124886806112228769291733255504027739303788591328\
    37593298077865107496638668440706109266915612818099283527034418172707\
    01882144552441912911611801039139944057017098915043865571617967016035\
    92589594716339665906531835614143615869909227133379553874233921484767\
    47750403478423115256020571862056979617852935949009025260523624674180\
    80639965507183997390890815727819484329822253815589392749146335206762\
    86330737257959530160336903679419768794361552184405895656881407312010\
    61956861464362784040149412132287866816164180237900313291906666105277\
    855060107509415077378619659

    That gives us one ‘true’ positive, and 21572869700269331363124886806112228769291733255504027739303788591328\
    37593298077865107496638668440706109266915612818099283527034418172707\
    01882144552441912911611801039139944057017098915043865571617967016035\
    92589594716339665906531835614143615869909227133379553874233921484767\
    47750403478423115256020571862056979617852935949009025260523624674180\
    80639965507183997390890815727819484329822253815589392749146335206762\
    86330737257959530160336903679419768794361552184405895656881407312010\
    61956861464362784040149412132287866816164180237900313291906666105277\
    855060107509415077378619658 false positives.

    Using a dictionary is a fairly smart way of eliminating a lot of that, which reduces the order of the problem to something close to putting the correct words in the right order, but I don’t know if dictionaries are allowed in the competition’s rules? If they are I’d imagine most approaches exploit a dictionary in some way.

  10. I feel I must point out that the above comments attributed to me were cut and paste by Haje, and that some of those numbers ought to be superscript ;) Also, and that the HTML entity I was looking for was ‘isin’:

    n!/(Π:l∈L.nl!)

Post a comment.