A Practical Introduction To Data Compression

Short version of this article
was published in 2000
in the popular Russian
"Computer Press" magazine.
It's of historical interest only
after the "Data Compression Methods"
book is published.

The Green Tree Of Compression Methods
A Practical Introduction To Data Compression

Prologue
========

What is information ? According to the most popular definition,
any interaction between objects, when one of them acquires some
substance, and the other(s) don't lose it, is called information
interaction, and the transmitted substance is called information.
Such information, that falls under this definition, can better
be called "physical". But it doesn't take into account non-physical,
"absolute" information, in no way depending from the material objects.
The simplest example is multiplication table, known by everyone
from first years of life to last. Info that sum of two squares of legs
is equal to the square of hypotenuse; that ratio of the circumference
length to the diameter is equal to "pi", and ratio of circles area to
area of square with side equal to circles radius - also "pi"; and so on.
It is exactly "absolute" information about the objects of non-
material world. If the material world disappears, all this information
stays immutable. Multiplication table, and all formulas and theorems of
all branches of mathematics.
More complex examples: the set of numbers coding DNA of some man
Nika, and the set of numbers resulting from scanning of his photo. Such
information, of course, has no sense outside material world, but, unlike
"chaotic" info, gets sense in the material world, and thereby as if like
raises own status, "develops" in it, displays some properties, aspects,
attributes, characteristics. Only due to it, only in the material world
(and also its models) the "best" part of information becomes "knowledge"
and even "wisdom", while other pieces remain as unmanifested "chaos".
Thus, the difference between analogue (physical) information and
digital (data) is much more principal and deep than it originally seems.

Answers
=======

Question1: How is information compressed ? There are hundreds of links
at http://www.compression-pointers.com , and hundreds of compressors
at http://act.by.net. What are all this compression methods- programs-
standards about, in short?

Answer1: That's rather easy. But some definitions first.
A BIT is an "atom" of digital information (DATA): imagine a small
"box" somewhere in space-time (in a memory chip, on a CD-ROM, floppy or
hard disk, communication line...) with only two possible states: full
(1, one, true, yes, exists) or empty (0, zero, false, no, doesn't exist)
A finite sequence of bits we call A CODE.
A (traditional) BYTE is a sequence of eight bits.
A GENERALIZED N-bit BYTE is a sequence of N bits, has 2^N (two
raised to the N-th power) possible states (thus, a traditional 8-bit
byte can have 256 different values: 0, 1, ... , 255).
A finite sequence of bytes we call A WORD, and number of bytes
in this finite sequence - the words length.
In general case, word consists of N-bit bytes, not 8-bit
(thus, a code is a word of generalized 1-bit bytes).

Q2: What's a symbol then? Exactly this word is present in the numerous
patents about devices and methods for data compression.

A2: A symbol is an atom of a language (a letter, a note, a figure...).
ASCII (American Standard Code for Information Interchange) sets
accordance between 8-bit bytes and symbols, but when we meet national
alphabets, we need more: 16 bits per symbol (UNICODE standard).
Finally,(lossless) COMPRESSION of a BLOCK (a finite piece of digital
information) is description how to bit-exactly create this given block,
put into another, shorter (compressed) block, occupying less bits than
the initial block we describe.
The so-called LOSSY compression consists of two different processes
(not independent, and parallel, as a rule):
(1) extracting useful parts from blocks, creating model depending on the
purpose of compression, the peculiarities of both source and destination
of information, and improving as much as possible (data for) step
(2) lossless compression.
When measuring physical parameters (brightness, frequency, voltage,
amplitude, resistance and so on), inexactitude is inevitable, therefore
"round-off" is quite permissible.

Q3: What's the difference between these definitions and existing terms?

A3: From time immemorial, when there were more 8-bit processors than 16,
and was no necessity in UNICODE, people usually mean "byte" when saying
"symbol", and mean "8-bit byte" when saying "byte". The sequence of two
bytes computer world calls "a word", and sequence of four bytes -
"double word" (dword). Very many non-Intel processors are constructed so
that every addressable memory unit contains 16-, 24- or 32-bit byte, not
8-bit, as in all IBM-compatible computers.
If "compressed" block is longer than initial - it's not compression
but recoding by definition, regardless of what nearly half of currently
existing compressors "think" about it (increasing size to 1% - 5%,
instead of 1 bit, in "worst cases" of incompressible data).
If lossy "compression" doesn't contain true compression algorithms -
it removes information not actual for known purpose, inside the limits
of chosen model.
However, there are so many definitions of "compression" and especially
"code", that misunderstandings appear much more often than they should.

Now, with such clear definitions, seems rather evident the following:

A) Any base compression method is good for compressing either blocks
of bits, or blocks of bytes, or blocks of words, as these are very
different objects. Exact mathematical definitions can be given for
these three base types of digital information, but in human words,
when compressing a block, algorithm assumes either that
bits in given block appear according to some law (block of bits),
or that not bits, but bytes appear according to some function, or
that neither bits nor bytes, but words appear according to a rule.

B) Given block will be compressed better, if its structure is known:
whether it is a block of bits, or a block of N-bit bytes, or
a block of words of N-bit bytes.
And, of course, if we know "the answer": the rules it was created
with, what was the length of records, bytes, how they were formed.
In other words, the more compression process "knows" about block
creation process, the better compression it does.
While practically all blocks are created as either
blocks of bits (1), or blocks of bytes (2), or blocks of words (3).

Two "boundary" types can be added: (0) "chaos" (no laws can be
detected - neither for words nor for bytes nor for bits) and
(4) "texts" (sets of words) appear according to some rule.
For example, block contains English sentences, Russian sentences,
and texts of functions on C programming language.
Other types, like mixtures of bytes and bits, or words and bits,
are (practically) much less possible.

C) In each of three cases, two variants are possible: probabilities of
elements (bits (1), bytes (2), words (3)) are different, and
a) don't depend on previous and following elements, the "context"
(so-called "Bernoulli source" of elements, without memory);
b) highly depend on previous and following elements
("Markov source", with memory).
Other cases are so rare that can be considered as "practically
impossible"
(for example, sometimes depend, sometimes not; depend, but not all
values; depend on selected elements of context, not nearest).

We see that case 2b (probabilities of bytes are different and depend
on the context) is identical to case 3 (probabilities of words are
different), while case 1b (bits, depend) is not identical to case 2
(simplest example - poorly compressed data, files like *.zip, *.gif).
That's because by definition a finite sequence of bytes is a word,
while a finite sequence of bits - is a code, not a byte. In other
words, byte is a record of fixed length, and word - of variable.

D) To simplify data structure and improve compression, main algorithm
when compressing blocks of words, creates blocks of bytes a/o bits;
when compressing blocks of bytes, creates blocks of bits, and only
when compressing blocks of bits - outputs to final compressed block.

For some reasons the majority of compression programs today simply
skip the third step (compression of blocks of bits), or stick it
together with the second (compression of blocks of bytes), starting
with assumption "a block of words is given". And only archivers with
options for "multimedia compression" check for blocks of bytes.
That's why people speak in terms of "primary" and "secondary"
compression: "modeling" and/or "sorting" (changing bytes order), and
"encoding" (bytes to codes). We can often see phrases like "it uses
LZ77 followed by arithmetic coder" in comp.compression newsgroup,
in archivers manuals, compression related articles. As we all know,
all first compressors like ARC, ARJ, LHA, PAK, ZIP, ZOO etc. were
using LZ77 or LZ78 followed by a Huffman or a Shannon-Fano encoder.

Q4: What are these reasons?

A4: First main reason is that most "natural" blocks appearing in
computers from input devices (keyboard, scanner, microphone, camera...)
are really blocks of words. In general case, words of N-bit bytes:
modern graphic devices operate 24- or 32-bit bytes - "pixels", while
audio - 8-, 16- or 32-bit bytes - "samples". The same with executable
instructions for Central Processor Unit (CPU), Floating Point Unit (FPU)
(files with executable code: .exe, .dll ...). Blocks of bytes and bits
appear, as a rule, when computers process that "natural" data.
Thus, the main problem is to find out how were those _words_ created,
what were the laws. It is the primary compression algorithm that plays
the main role in achieving best-in-the-world compression of certain
types of files ( see http://act.by.net ).
Second, it's much easier for computers to deal with bytes
than with bits. Each byte has a unique address in memory, each address
points to a byte (8-bit, if it's an IBM-compatible PC). It was only 20
years ago that first 16-bit processors appeared, able to process more
than 8-bit units with one instruction. There are practically no good
algorithms for compressing blocks of bits (due to first two reasons).
Excellent results give appropriate primary and secondary algorithms,
while the third step takes much time, giving little improvements.
More than 10 years ago, when "universal" .zip format was developed,
nearly all readable texts were blocks of 8-bit ASCII symbols, nearly all
executable modules were for 16- or even 8-bit processors, and nearly all
multimedia files - blocks of 8- or even of 4-bit bytes. That's why it
seemed that it's enough to have one algorithm for all data types:
8-bit bytes (multimedia files) and blocks of words (texts, executables).
Many special algorithms were developed since that time, but not very
many universal programs, achieving compression close to world records
on all types of files (RK, UHArc, RAR, ERI, ACE).

Q5: Any other methods of digital data classification ?

A5: Of course, many other exist. Only the most actual is described
above. The following can also be mentioned:
- One-dimensional, two-dimensional (pictures, for example), three-
dimensional (video, 3D images) and so on. We know that 2D-3D-ND
files are presently met in less than 10% of cases, while the
probability to see them in "unknown data" set is less than 1%
(they are stored separately, as a rule, and processed with special
algorithms). But in this case (it was detected that data is not
one-dimensional) the more important question is immediately solved:
whether it contains bits, bytes or words: bytes with probability
more than 0.999 .
- The so-called "textual" and "binary" data. The most classical
classification. Under our definitions, first are blocks of words,
and the second is everything else, including mixtures of all types.
- Extension-based classification (.ARC, .AVI, .BMP etc. files).
The way for those who are trying to make commercial archivers.
(People knowing PPM and BlockSorting families of compression methods
can think that order-1, order-2, order-3 and so on types of data exist).

Q6: What compression methods are known nowadays ?

A6: For blocks of words or bytes: LZ77, LZ78, LZW (Lempel-Ziv-Welch),
LZSS and many other variants of "sliding window"; BlockSorting: with the
given block - full (BWT, Burrows-Wheeler Transform) and partial (using N
preceding elements), or with "parallel" dependent block(s) of same size,
PPM (Prediction by Partial Match), Markov and other modeling.
For bytes: MTF (move to front); methods of Huffman and Shannon-Fano;
LPC (Linear Prediction Coding) and its plain case of one-element memory-
Delta Coding (compute/code differences between neighboring bytes);
Distance Coding (compute/code distances between equal values),also known
as Homala Benesa (how many larger (bytes) before next same (byte));
SEM (separate exponents and mantissas).
For bytes or bits: RLE (run length encoding), arithmetic coding with
all possible modifications (range coder, ELS - entropy logarithmic scale
and so on), ENUC (enumerative coding), ERI93.
These are only main families (branches from the three main branches
of the tree) of base compression methods. Each family contains lots of
different methods, each method has many variants, each variant has
some unique internal parameters.

Q7: They are usually somehow combined, aren't they ?

A7: Practice shows that for blocks of words it's good to first apply
one or few algorithms from the first family: in simplest case,
a selected one, in better case, one is selected once per block, before
starting to compress given block, in best case - somehow synthesizing
them during the process of compression of block of words (synthesis of
LZ and BS is particularly useful). On the second step one or few methods
from the second family are applied, on the third step - from the third.
In case of blocks of bytes the first, most difficult step, should
better be skipped, and in case of blocks of bits it's better to
immediately go to the third step, skipping both first and second.
It's not good to apply algorithms for words - to bytes or bits.
One example is PNG ( shows good performance, see http://go.to/artest ),
the other is 7-Zip ( http://www.7-zip.com ), giving
compression up to 20% better than PKzip, Info-zip, GZip, WinZip etc.
(for example, BINARIES.DAT from http://artest1.tripod.com/artest18.zip :
457724 bytes with "7zip a -mx", 548844 "pkzip -exx", 551255 "gzip -9"),
because it has the knowledge that sometimes even very long strings are
better compressed if regarded as bytes, not as words.
It's not good to apply algorithms for bytes - to words or bits
(the very first compressors: Huffman method, the only known at that
time, was applied to all types of data; the question about their
structure didn't even appear: one method exists, nothing to choose from)
But we shouldn't completely forget about "incorrect", "unpractical"
data: for example, such blocks of words that are better compressed if
RLE is applied first, before all other methods, are certainly possible
(RLE for words: considers that every word is written N times, and N>1 ).
Or such blocks of bits that will be compressed better with PPM:
poorly compressed data - .ZIP, .MP3, .GIF, .JPG, .PNG files and so on.

Q8: What can be said about static and adaptive coding ?

A8: At that distant past, when computers were huge, while files - small,
the problem of extracting logically different fragments from given file
did not arise. First, there were relatively less files with diverse
data, and also less types of data: bytes dimensionality did rarely
achieve even 24 or 32. Second, those logically different fragments were
so small that expenses of describing their borders were comparable with
gains in case of their usage. Recording of new binary tree (methods of
Huffman and Shannon-Fano, 8-bit bytes) takes about few hundreds bytes,
moreover complexity and time and memory requirements of both compression
and decompression algorithms can increase 2 or 3 times - therefore it's
meaningless to pay attention to fragments shorter than few kilobytes.
Especially if files average length is less than 100 kilobytes, and
probability of diversity of files contents is less than 1/100.
It's now different with that: average length is about ten times bigger,
substantial diversity of contents is also about ten times more probable.
Theoretically, any compression algorithm can be made
both "entirely static" (all parameters are assigned at the start only)
and "entirely adaptive" (all parameters are periodically recalculated).
But practically, saying "static or adaptive", people do honour to the
very first compression method, described by David Huffman in 1952.

Q9: How to detect if given file contains blocks of bits, bytes or words?

A9: This is one of the most actual, lengthy and interesting questions.
As can be seen from reports at http://compression.ca and http://go.to/artest ,
wrong treatment of blocks leads to compression up to 50 times worse,
especially on so-called "multimedia" files (take, for example, ASTAV.DAT
from ftp://ftp.cdrom.com/pub/simtelnet/msdos/astronmy/mpla_eng.zip 125K)

Most files contain blocks of words of 8- or 16-bit bytes, "multimedia"
(graphics, video, audio) files are blocks of generalized bytes (or very
short words) of 8, 16, 24 or 32 bits, as a rule. That's why the easiest
way is to look at files extension and/or the header, a/o test part with
few kilobytes and take the best on this fragment method. It's not nice
to apply algorithms intended for words to bytes or even to bits.
It's not much better to apply algorithms for bytes - to words or bits.

Q10: Any other methods except this easiest ?
A10: The following way takes much more time and programming.
Assume that we have an algorithm to measure the orderliness (m.o.)
in a given block of words, returning a value from 0 (complete chaos,
all words are different) to 1 (complete order, all words are identical);
and have an algorithm for m.o. in a given block of bytes,
and also an algorithm for m.o. in a given block of bits.
Then we just assume that we've got block of words, and count the
first value W, applying the first algorithm, then assume it's a block
of bytes and get the second number Y, and finally the same with the
third measure B. Comparing these three numbers (they are 0 <= W <= 1 ,
0 <= Y <=1 , 0 <= B <= 1 ) we easily see what our examined block more
corresponds to.
If we admit that byte can be either 8-, 16- or 32-bit, we need to
calculate three values for bytes and three for words, seven in all.

Q11: Looks like all three algorithms do something very close, but with
different elements - bits, bytes, words ?

A11: Exactly so. Simplest example, when elements are bits, and
algorithm first splits initial block to logically different fragments
(i.e. for example block with hundred "zeroes" followed by hundred "ones"
is processed as two "good" fragments, not as one "bad"):

B= ( |n[1] - N/2| + |n[0] - N/2| ) / N

where N - number of bits in block,
n[0] - number of "zeroes",
n[1] - number of "ones".
As we see, B=0 if n[0]=n[1]=N/2 ; B=1 if n[0]=N or n[1]=N .

Same trivial example for the case of 8-bit bytes:

Y = ( |n[255]-N/256| + ... + |n[0]-N/256| ) * 128 / (N*255)

Y=0 if n[255]=...=n[0]=N/256; Y=1 if n[255]=N or n[254]=N... or n[0]=N.

Q12: Quite simple for bits and bytes, but what about words, they can
consist of 8-, 16- or 32-bit bytes, while their lengths can be
2, 3, ... , L bytes, where L is close to 10 ?

A12: As in any research - ideas, assumptions, modeling, testing,
optimization... Searching for better algorithm is a very
fascinating process. You'll surely like it!
The future of data compression is certain, and the success is always
near. New compression tasks will appear 50 and 100 years after these
words are written, and new algorithms will be developed to better solve
those tasks. First, 8-bit elements will soon become as rare as 16-bit
ten or twenty years ago. Second, we are now limited with one-
dimensional, and not content-addressable memory. Though some processors
can treat memory as two-dimensional, they are relatively rare and novel.
Finally, if we look at definition of compression, there's no further
restriction of what a description should be. It can certainly reference
everything that destination can always access. But when that referenced
data is inaccessible, description is undecipherable. Thus, if we only
assume that all references used in description will be accessible,
but don't know exactly, we do "potentially lossy" compression.
Only non-physical information is always accessible, it doesn't
depend on material objects (see prologue). The size of it is infinite.
And when we study how it can be applied (wavelets, fractals,
for example...) - we actually study how we depend on it.

Epilogue
========

RHLAOC
~~~~~~
HLAO http://www.faqs.org/faqs/compression-faq/part2/
HL http://www.cs.sfu.ca/CC/365/li/squeeze/
O http://www.data-compression.com/theory.html
HL http://www.data-compression.com/lossless.html
RHL-O- http://www.dspguide.com/datacomp.htm
RH- http://www.go2net.com/internet/deep/1997/01/01/
RHL http://www.ganssle.com/articles/acompres.htm
RHLAO http://www.newmediarepublic.com/dvideo/compression/adv04.html
H O- http://www.rdrop.com/~cary/html/data_compression.html
RHLAO- http://www.arturocampos.com/cp_ch1.html
RHLA- http://w3.antd.nist.gov/cag/lecser_ay1/lecser_ay1.html
RH- - http://www.ccs.neu.edu/groups/honors-program/freshsem/19951996/jnl22/jeff.html
RHLAO http://www.ece.umn.edu/users/kieffer/ece5585.html
RHLAO- http://www.ics.uci.edu/~dan/pubs/DataCompression.html
RHL-O http://www.ils.unc.edu/~willc/dcfulltext.html
L O http://www.cs.pdx.edu/~idr/unbzip2.cgi?compression/acb_compress.html.bz2
R-L-O http://cswww.willamette.edu/~sarnold/cs443/compression.html
RHLAO- http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html
RHLA http://vectorsite.tripod.com/ttdcmp1.html
R L O http://members.aol.com/breunling/obcompr.htm
RHLAO- http://www.rasip.fer.hr/research/compress/algorithms/index.html
HLA - http://www.ifi.uio.no/~ftp/publications/cand-scient-theses/SHuseby/html/node41.html
RHLAO http://www.cs.su.oz.au/~symvonis/teaching/cs4-data-compression.html
-L- - http://www.image.cityu.edu.hk/~loben/thesis/node22.html
L O http://home.uleth.ca/~borrtj/pres/index.htm
RHL-O- http://www.scit.wlv.ac.uk/~c9581158/main.html
RHLA - http://www.eee.bham.ac.uk/WoolleySI/All7/body0.htm
RHLAO- http://wannabe.guru.org/alg/node163.html

R - means page contains description of Run Length Encoding
H - description of Huffman Coding
L - Lempel-Ziv methods
A - Arithmetic Coding
O - (at least some words about) other methods
C - attempts to classify types of data and/or compression methods

Unfortunately,
useful words were not found in all these overviews and introductions
about

- classification of types of digital data;
- classification of compression algorithms, the "tree" of methods;
- what algorithms are appropriate for what data types, i.e.
(no) answer
to the most actual question:
**********************************************************************
* How to choose the most effective algorithm, *
* having nothing but the data itself ? *
* (among hundreds existing and new appearing every day) *
**********************************************************************

This text,
English (ver.1.4 now) can be found at http://geocities.com/eri32/int.htm
Russian (ver.1.4) takes place at http://geocities.com/eri32/intro.htm

With best kind regards,
excuses for misprints,
A.Ratushnyak,
http://go.to/artest

go Back to main ARTest page

Всё о сжатии данных, изображений и видео

Видео Video

Сравнения кодеков MSU Video Codecs Comparisons

Video Quality Measurement Tool (MSU VQMT)

Benchmarks

3D-пїЅпїЅпїЅпїЅпїЅ 3D video

Проекты

Projects

Книга «Методы сжатия данных»

Разделы

About О сервере

Новости:

Новости:

The Green Tree Of Compression Methods A Practical Introduction To Data Compression