This presentation aims to explain in the most simple way the base principle of a compression algorithm.
The mathematical aspect is purposely set aside. We start from the observation that for a given file size, the entropy of a sequence is inversely proportional to its length. Said differently: if the size of a file is shorter than all the possible combinations of a given sequence, something may be done. End of mathematical part.
So we represent 2 consecutive data extracted from a stream or a file in a graphical way:
here the pair [11;20] or [20;11]
0 | .... | 10 | 11 | 12 | 13 | 14 | .... | 255 |
---|---|---|---|---|---|---|---|---|
1 | ||||||||
. | ||||||||
. | ||||||||
. | ||||||||
20 | ||||||||
21 | ||||||||
22 | ||||||||
. | ||||||||
. | ||||||||
. | ||||||||
255 |
now we suppose that the stream or file only contains lower case text: according to the ascii table, it gives us the 97 to 122 range. If we represent it as a table, it forms a region which looks like:
0 | .... | 97 | 98 | 99 | ... | 120 | 121 | 122 | .... | 255 |
---|---|---|---|---|---|---|---|---|---|---|
1 | ||||||||||
. | ||||||||||
. | ||||||||||
. | ||||||||||
97 | ||||||||||
98 | ||||||||||
99 | ||||||||||
. | ||||||||||
120 | ||||||||||
121 | ||||||||||
122 | ||||||||||
. | ||||||||||
. | ||||||||||
. | ||||||||||
255 |
It is obvious that we don't need 8bits to code the region, and that 5bits are enough (26 equals to 11010 in binary).
. | . | . | 0 | 1 | 2 | ... | 23 | 24 | 25 | . | . |
---|---|---|---|---|---|---|---|---|---|---|---|
. | 0 | .... | 97 | 98 | 99 | ... | 120 | 121 | 122 | .... | 255 |
. | 1 | ||||||||||
. | . | ||||||||||
. | . | ||||||||||
. | . | ||||||||||
0 | 97 | ||||||||||
1 | 98 | ||||||||||
2 | 99 | ||||||||||
. | . | ||||||||||
23 | 120 | ||||||||||
24 | 121 | ||||||||||
25 | 122 | ||||||||||
. | . | ||||||||||
. | . | ||||||||||
. | . | ||||||||||
. | 255 |
Therefore, instead of being coded with 2x8 bits, every pair of data in this region can be coded with 2x5 bits, plus an extra bit which indicates if the biggest number is being read the first or not.
2*8=16 bits
becomes after compression
2*5+1=11 bits
So every pair in this range saves 5 bits But it's not enough
As we suppose that the whole file is not in that region, we must add a heading which we can set as:
compressed data
1 bit to indicate that data are coded
15 bits to give the length of the coded data
8 bits to indicate the table number (it is far better if there are several)
if the data are not coded
1 bit to indicate that data are not coded
23 bits to indicate the number of data to be copied as is
So we have a 24 bits heading which implies that the coding will only be usefull if the system compresses more than 2*24 bits. In our situation, we will need more than 9 pairs or about 4 words to begin with a 5 bits saving for each pair
An example working with a slightly different heading
several tables [ (0.255),(n*0,n*63) ] with n=1 to 5
for those who are interested, there is a code exemple under GPL here
with the muse of all compression algorithms Well to make it short, this implementation is only intended to be used as tests or leisure
it is long, very long and that's normal
structure of the heading
2^15 =0 not coded data
2^14...2^0 number of bytes to be read and copied without compression
2^15=1 coded data
2^13...2^3 number of bytes to be read and uncompressed
2^2...2^0 number of 63 steps offsets
javac Main.java
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
java Main c lena_std.tif
compression
49 15 49 15 49 15 49 15 49 15 ...
java Main d lena_std.tif.ar
decompression
ls -l lena*
-rw------- 1 remy remy 786572 2009-01-16 16:23 lena_std.tif
-rw-r--r-- 1 remy remy 777629 2009-02-23 15:03 lena_std.tif.ar
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
java Main c lena_std.tif
compression
49 15 49 15 49 15 49 15 49 15 ...
java Main d lena_std.tif.ar
decompression
ls -l lena*
-rw------- 1 remy remy 786572 2009-01-16 16:23 lena_std.tif
-rw-r--r-- 1 remy remy 777629 2009-02-23 15:03 lena_std.tif.ar
the advantages of the system
this algorithm make it possible to compress data on the fly. It is well adapted to stream compression as there is no frequency computation. It can also be adapted to specific cases (eg HTML) as the region used may neither be continuous nor made of an only block. It can even be multi-format as there are necessarily several tables.
As far as I know, this approach is not used in the compression field of application. I go even further and claim that it is a new method: I leave to the ones interested the research of anteriority...
a link about the most famous and used algorithms as a startup (about anteriority), thanks wikipedia: http://fr.wikipedia.org/wiki/Compression_de_données
thank you for your attention
If you ever have any remark about this, I would be grateful and interested.
free algo "public domain" If you like it, feel free to make (even a small) donation. Thanks