Compression

My interests in compression have recently been focused on compressing text in embedded applications. In this scenario, the code size of the decompresser is as important as the effectiveness of the compressed text.

Smalltext

Smalltext, my best attempt thus far, compiles into 150 bytes of 32-bit Intel assembly (using gcc -Os). It uses the letter frequency of English to map certain characters to 4-bits. Each 4-bit nibble is either a short character or a reference to a lookup table. The next 4-bits are the table index. In smalltext we have 10 short characters, including the space character, and 6 lookup tables, covering 106 possible characters.

Tinyzip

Tinyzip was my first attempt at compression. It uses some of the techniques found in zip, consuming 391 bytes of code when compiled. This algorithm treats all characters as 7-bits, and can back reference repeated sequences that are more than 3 characters long. A back reference is a 21-bit chunk identified by the bits 01, which no printable characters use, and followed by a 12-bit offset and a 7-bit length.

Future Direction

I would like to combine these two methods. I could replace the least frequent short char with a code that signals a back reference. I could also reduce the space used by a back reference to 16-bits, with a 10-bit offset and a 6-bit length. The reference would need to refer to nibble index and not byte index.