Compression
My interests in compression have recently been focused on compressing text in
embedded applications. In this scenario, the code size of the decompresser is
as important as the effectiveness of the compressed text.
Smalltext
Smalltext, my best attempt thus far, compiles
into 150 bytes of 32-bit Intel assembly (using gcc -Os). It uses the
letter frequency
of English to map certain characters to 4-bits. Each 4-bit nibble is either a
short character or a reference to a lookup table. The next 4-bits are the table
index. In smalltext we have 10 short characters, including the space character,
and 6 lookup tables, covering 106 possible characters.
Tinyzip
Tinyzip was my first attempt at compression. It uses
some of the techniques found in zip, consuming 391 bytes of code when compiled.
This algorithm treats all characters as 7-bits, and can back reference repeated
sequences that are more than 3 characters long. A back reference is a 21-bit
chunk identified by the bits 01, which no printable characters use, and
followed by a 12-bit offset and a 7-bit length.
Future Direction
I would like to combine these two methods. I could replace the least frequent
short char with a code that signals a back reference. I could also reduce the
space used by a back reference to 16-bits, with a 10-bit offset and a 6-bit
length. The reference would need to refer to nibble index and not byte index.