algo de compression

This presentation aims to explain in the most simple way the base principle of a compression algorithm.

The mathematical aspect is purposely set aside. We start from the observation that for a given file size, the entropy of a sequence is inversely proportional to its length. Said differently: if the size of a file is shorter than all the possible combinations of a given sequence, something may be done. End of mathematical part.

So we represent 2 consecutive data extracted from a stream or a file in a graphical way:

here the pair [11;20] or [20;11]

0 .... 1011121314....255
1
.
.
.
20
21
22
.
.
.
255

now we suppose that the stream or file only contains lower case text: according to the ascii table, it gives us the 97 to 122 range. If we represent it as a table, it forms a region which looks like:

0 .... 979899...120121122....255
1
.
.
.
97
98
99
.
120
121
122
.
.
.
255

It is obvious that we don't need 8bits to code the region, and that 5bits are enough (26 equals to 11010 in binary).

... 012...232425..
.0 .... 979899...120121122....255
.1
. .
..
. .
0 97
1 98
2 99
..
23 120
24 121
25 122
. .
. .
. .
. 255

Therefore, instead of being coded with 2x8 bits, every pair of data in this region can be coded with 2x5 bits, plus an extra bit which indicates if the biggest number is being read the first or not.

2*8=16 bits

becomes after compression

2*5+1=11 bits

So every pair in this range saves 5 bits But it's not enough
As we suppose that the whole file is not in that region, we must add a heading which we can set as:

compressed data
1 bit to indicate that data are coded
15 bits to give the length of the coded data
8 bits to indicate the table number (it is far better if there are several)

if the data are not coded
1 bit to indicate that data are not coded
23 bits to indicate the number of data to be copied as is

So we have a 24 bits heading which implies that the coding will only be usefull if the system compresses more than 2*24 bits. In our situation, we will need more than 9 pairs or about 4 words to begin with a 5 bits saving for each pair

An example working with a slightly different heading
several tables [ (0.255),(n*0,n*63) ] with n=1 to 5
for those who are interested, there is a code exemple under GPL here
with the muse of all compression algorithms Well to make it short, this implementation is only intended to be used as tests or leisure
it is long, very long and that's normal

2^15 =0 not coded data
2^14...2^0 number of bytes to be read and copied without compression
2^15=1 coded data
2^13...2^3 number of bytes to be read and uncompressed
2^2...2^0 number of 63 steps offsets

javac Main.java
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

java Main c lena_std.tif
compression
49 15 49 15 49 15 49 15 49 15 ...

java Main d lena_std.tif.ar
decompression

ls -l lena*
-rw------- 1 remy remy 786572 2009-01-16 16:23 lena_std.tif
-rw-r--r-- 1 remy remy 777629 2009-02-23 15:03 lena_std.tif.ar 