Short version of this article
was published in spring 2001
in the popular Russian
"Hard'n'Soft" magazine.
It's of historical interest only
after the "Data Compression Methods"
book is published.

Compression of Multimedia Information

version 1.0

Multimedia information (MMI) is understood, as a rule, as sound (audio stream), two-dimensional pictures, video (2D pictures stream), three-dimensional images. Within the century some other types like stream of 3D images or stream of smells will become more actual. What's the difference between MMI and other types of information ? 1) The SOURCE of MMI are devices measuring parameters of the physical world. With some stretch any information received by the digital device from the physical world via measuring (quantization) devices can be called MMI (for example, current strength, speed of movement, bearing of an apparent wind). This process is called analog-digital conversion (ADC) or quantization (the inverse process is digital to analog conversion, DAC). Result of the quantization process is multimedia data (MMD), form of MMI. Measurement errors are inevitable, and round-off in a chosen way, including leading to better compression, is valid. If it's assumed that the RECEIVER of block of MMD are sense organs of a human (eyes and ears, first of all) it's possible to round off in view of their properties. The features of communication channel that will be used for transmission (copper wires, optical fiber, air, water) can also be taken into consideration. 2) Dimensionality. From 4 bits (16 values) up to 32 bits (2 in 32-nd power is more than 4 billions) are used for storing the result of every measurement, but as a rule, 16, 24 or 32. That is dimensionality of bytes R=16, 24 or 32. Not 8, as in the majority of other cases: executable code for x86 processors, eight-bit ASCII-text, databases (16-bit UNICODE-text and executable code for other processors are less actual at present). Besides, block of MMD consists exclusively of R-bit bytes, in contrast to databases, code for x86, etc. 3) Linearity. Measurements are taken in equal time periods (stream of sound - audio, stream of pictures - video, etc.) and equal space distances (two-dimensional pictures, three-dimensional images...). Bytes-measurement results are put to MMD block sequentially. Sometimes when recording MMD they are periodically aligned, so that the distance from the beginning of the block to the beginning of aligned fragment satisfies some criterion. For instance, in .BMP-pictures every line of the image should start with the address divisible by four, and in .RAS-files - divisible by two. But bytes lying between fragments don't carry any useful information about MMD block. If they do, it's a block of processed MMD (PMMD), and the algorithm of its compression should consider that processing, otherwise compression will be much worse than in case of taking the processing into account. Except for palletized pictures and other blocks of PMMD, every value of byte _linearly_ depends _only_ on values of neighbouring in space-time bytes. In case of texts it depends also on the language (Russian, English, Chinese...) and type of file (DOS text, Unix text, text on C, on HTML). In case of executable code - on type of processor and the compiler. And algorithms of compression/decompression should either know features of processed blocks, or encode-transmit-receive-decode-use these features during operation. In case of MMD such complication is not present: properties of the physical world are always the same. 4) Channels. Three previous items describe a block of one-channel MMD. As a rule, MMD block contains information about several channels of MMD. For example, in stereo sound we have two channels - left and right. And in case of a picture it's possible to regard as a channel either each line or each column of digital image. Or not to treat at all columns or lines as different channels, but to consider that a vicinity of every byte is two- dimensional, not one-dimensional. Because first, if two-dimensional picture describes a surface of three-dimensional object (suppose a sphere), lengths of channels are different, and their choice is problematic. And second, every pixel of image often consists of several components: for instance, RGB - red, green, blue; HSB - hue, saturation and brightness. These components can also be regarded as different channels, and it's more convenient, as a rule: there are usually 2, 3 or 4 components, instead of number of lines or columns (up to few thousands). If block of MMD describes N channels, it either consists of N fragments of X bytes, or of X fragments of N bytes. That is N channels are put either sequentially or parallel. This will be definition of a channel. Contents (set of bytes) of each channel depend either on all other channels (RGB, HSB, etc.) or on adjacent only (lines, columns). That's, actually, all. We'll assume first item to be definition of MMD, and deviations from properties 2, 3 or 4 - are special cases of MMD. How to use four main features of MMD for better compression ? Some words about compression in general. I hope the reader is familiar with main concepts and ideas of compression (description eliminating redundancy) of information, stated in [1]. There are two approaches to compression - lossless and lossy. The second is applied to MMD more often, because of the features of the Source and the Receiver. Theoretically any lossy algorithm can be made lossless, but such reoriented algorithm is usually less effective, than methods having the same aim but other paths, and initially oriented on lossless compression (that's because "lossy" transformations increase dimensionality of bytes R). On the other hand, blocks/streams of data of any type .ABZ can be compressed with "losses" (probably, even improving contents from the point of view of the Receiver) by the algorithm specially designed for the data of this type .ABZ (for example, DocumentPress by Igor Pavlov[22], optimization of tables in headers of EXE-files). And three base strategies of compression: 1) "Window - dictionary" (Transformation of Stream). Description of incoming data using already available. For instance, LZ-methods for streams/blocks of words i.e. when combinations of incoming bytes are predictable by already processed combinations. Table conversions, RLE, LPC, DC, MTF, SEM, Subband Coding - for bytes, i.e. when it's meaningless to consider combinations of two or more bytes (or to "remember" such combinations, in case of Linear Prediction Coding). No probabilities are calculated, in contrast to the second case. As a result of transformation, several streams are formed, as a rule (except for LZ78, LZW, etc.) Even if total size of streams increases, their structure is improved, and further compression goes better and faster. 2) Statistics. a) Adaptive (stream). Calculation of probabilities for incoming data using statistics on already processed ("modelling"). Coding with usage of these calculated probabilities. For example, family of PPM-methods[2] for words, adaptive Huffman, Shannon-Fano, Arithmetic Coding - for bytes. In contrast to the first case, old statistics has the same weight as recent, if the algorithm doesn't specially strive against it. Besides, all combinations, including all that never appeared before and, most likely, will never appear in future, are considered probable. b) Block. Statistics is encoded separately and added to the compressed block. Static Huffman/Shannon-Fano and Arithmetic for bytes. Static (pre-conditioned) PPM for words[9]. 3) Transformation of Block. Incoming data is split into blocks which are then transformed as a whole (and in case of homogeneous data, it's often better to take the whole block that is to be compressed). BlockSorting methods, including BWT[3], Fourier Transform, Discrete Cosine Transform, Discrete Wavelet Transform, Fractal transforms, Enumerative Coding. As in the first case, several blocks instead of one can be formed. And even if their total length does not decrease, their structure is considerably improved, and futher compression goes easier, faster and better. In one sentence: compression algorithm can be either statistical or transforming, and handle incoming data either as stream or as blocks, and the more homogeneous and bigger the data and memory, the more effective block algorithms; the less homogeneous and smaller data and memory, the better stream methods; the more complex the Source, the more will improve compression the optimal transformation; the simpler the Source, the more effective rectilinear statistical solution (sources of Bernoulli and Markov). Complex, more perfect compression algorithms will synthesize all strategies, but now there are no such, and are three base strategies. Nowadays all existing compressors use for blocks of words either a method from LZ set, or from BlockSorting set, or from PPM family, but not the synthesis of all three or at least two strategies (see [4] for example, and also [5] and [6]). However, some synthesizing algorithms exist for a long time. First, successfully combine stream and block strategies net favorites - MP3, MPEG and JPEG. Thousands of scientific and popular articles are written about them and about Discrete Cosine Transform, the main algorithm they use, therefore they will not be discussed hereafter: see [7] and [8]. Second, the right compressor sequentially applies three or four algorithms (preprocessing-filtering, compression of words, then bytes, and finally bits), and the last one is statistical, as a rule, though first two or three can be transforming. Because of distinctive features of MMD - linearity, dimensionality, channels - transformations practically always improve compression. Besides, MMD come in big enough and homogeneous blocks, as a rule, and more and more memory is available to process it. That's why more effective MMD compressors use one of the methods of the third strategy (Transformation of Blocks) as primary method applied to true MMD and defining efficiency of the whole process (secondary algorithms work with PMMD). What transformations exist? 1. RLE, Run Length Encoding. The most trivial way of compression. Using least memory and time. The compressor finds repetitions (runs) of identical elements Z and substitutes them with set {element Z, a flag of repetition, number of elements in repetition N}. All other elements are simply copied from input stream to output. If there's more memory than few bytes, it's possible to save distances to next repetitions instead of flags of repetitions. As any stream method, RLE is applicable to blocks of data. In general, block methods differ only that they can't start working before the length of the buffer with data is given. The rest are their properties, not signs: - the less block size, the less effective is block method (this property is inherent to stream methods to a lesser degree); - any element of a block is processed in the same way, regardsless of its distance to the beginning and the end of the block (this is also peculiar to many stream methods, SEM and Delta Coding, for instance); - almost all block methods are two-pass (LZ77 etc. are much more effective with two passes "search for words" and "analysis and conversion", for example flexible parsing [18],[19]). If further compression of RLE output is expected, it's possible to create four output streams/blocks: lengths of repetitions N[i]; elements of repetitions Z[i]; lengths of other fragments, where repetitions were not found, R[i]; other fragments OF[i]. It can turn out that compression is better if first elements of other fragments (elements that "break off" repetitions) are put to the fifth stream/block, and probably last elements of OF - to the sixth. Main principles of data compression are clear from this most trivial example. It is necessary to mention that RLE is a special case of "sliding window", dictionary methods like LZ*, when the size of window-dictionary is one, that is offset can be only one element back, while length is not limited. RLE is more effective on MMD with small dimensionality R, from 1 to 8 bits. The more dimensionality R of MMD bytes, the less probable that application of RLE to that MMD will improve compression. 2. SEM, Separate Exponents and Mantissas. The big set of methods usually called Integer Coding[10], including structured model of Peter Fenwick, Rice codes[11], Fibonacci codes, Golomb codes[12], Elias Gamma and Delta codes[13] and many other, more complex variants, for example DAKX[14]. While in the very basic case two streams are formed: higher N bits, lower (R-N) bits. It's obvious that the first stream - Most Significant Bits - is compressed better, if N is properly chosen, especially in case of MMD. But separating exponents (numbers of the highest non-zero bits B) and mantissas (other bits lower than B) is even more effective. The first stream is byte-wide, the second is bit-wide. It's possible to further separate resulting streams, especially the first one. SEM is often applied to MMD compression, in combination with other methods, and it's always present in lossy MMD compression algorithms. 3. MTF, Move To Front, and DC, Distance Coding, are well described in Vadim Yoockins BWT-FAQ[3]. To add what I haven't seen in descriptions of these methods yet. MTF generally uses the sliding window with N elements, instead of one, as in common always considered variant. Frequencies of elements in this window are calculated, and the conversion table of elements from input to output stream is built. Other elements outgoing from the window are processed exactly as in the usual variant. Weights of elements in the window can decrease, for instance, linearly: from N for element that just enetered the window, to 1 for one that leaves the window on the next step. Or exponential, or no decrease at all: same weight for all elements, regardless of their distance to the top of the window. In contrast to Huffman/Shannon-Fano methods, "straight" tree is built, not "branchy": it's created so that there's only one element on every level. Such MTF is more close to 1-st order PPM, but, unlike PPM, it outputs bytes instead of probabilities in range from 0 to 1. 4. DC, unlike three above described methods, RLE, SEM and MTF, can be applied to N-dimensional data, not just one-dimensional. In bidimensional case it's possible to encode the distance to the nearest equal element either as pair of distances (X, Y) in the cartesian coordinate system, or as pair (angle, distance), or as "distance in a helix": 28 23 18 11 19 24 17 6 2 7 20 27 10 1 0 3 12 25 16 5 4 8 13 22 15 9 14 21 26 Similarly in case when only higher elements and on the same height and to the left are known: 17 14 18 11 8 6 9 12 16 7 3 2 4 10 15 13 5 1 0 Apparently, nobody tried to apply two-dimensional DC to compression of MMD like graphics, but such experiments can lead to good results on a wide class of data. Theoretically "Plane" one-dimensional DC, as well as RLE and MTF, is more effective on MMD with small bytes dimesionality R, but practical results are missing. Even examples of applying DC to one-dimensional MMD are unknown to me and most likely they are still absent, as DC has recently drawn attention of researchers, and by now it's studied notably less than other methods. 5. LPC, Linear Prediction Coding, including very complex, but very effective variants CELP (Code-Excited Linear Prediction)[15,16] and MELP (Mixed Excitation Linear Prediction)[17], developed for strong lossy compression of speech. Much easier and widespread the elementary variant called Delta Coding. For example, ADPCM, Adaptive Delta PCM, Pulse Code Modulation. Approximation B[i]=B[i-1] is used, and stream {B[i]} is transformed to {B[i] - B[i-1]}. Because of the third property of MMD ("LINEARITY") the difference between neighbouring elements is small, as a rule. Therefore Delta Coding is applicable to all "typical" MMD, regardless of their origin, dimensionality and all the rest. Moreover, third property of MMD is usually used to find MMD among unknown data. Because of the smallest resource - memory and time - consuming, among all. Through the second simple variant approximation B[i-1]= (B[i]+B[i-2])/2, conversion to B[i]- 2*B[i-1] +B[i-2] , we come to the general form: B[i]= sum( K[j]*B[j] ), 0 < j < i . The decrease of B[i] dependence from preceding elements is usually exponential, and it's absolutely useless to consider more than ten previous B[j]. In two-dimensional case the second variant is: B[i,k]= (B[i-1,k] + B[i,k-1])/2, having B[i,k] we create {B[i,k] - (B[i-1,k]+B[i,k-1])/2} . And the third: B[i,k]= (B[i-1,k] + B[i,k-1] + B[i-1,k-1])/3 . The Fourth: B[i,k] - B[i-1,k] = B[i,k-1] - B[i-1,k-1] . Further complications are not so effective in case of MMD. Taking into account two-three neighbouring elements leads to qualitative improvement of tens of percents, and all other elements add one or two percents. Nevertheless, it's useful to consider them, but for another purpose: procedure of noise account significantly improves MMD compression with Delta Coding. Every MMD element B[i] deflects from its predicted value not only because of the strong conditioned impacts, but also because of weak background oscillations, that is noise: B[i]=S[i]+N[i]. It's clear that first, the less contribution of noise N[i], the more effective LPC, and on fragments with "silence", containing only noise, LPC is not advantageous (since differences of R-bit elements will be (R+1)-bit, not R-bit). And second, the more elements considered, the easier it is to separate the main tendency from background noise. 6. Studying Delta Coding, it's easy to invent one more fundamental method - Subband Coding. Really, if we have two "parallel" blocks (with same lengths) or streams (synchronous) B[i] and C[i], and know that they are interdependent, how to use it? If B and C - higher and lower bits after SEM, it's difficult to contrive anything better than PBS (described below, in next item). But if B and C are parallel channels of MMD (indications of closely standing measuring devices) more often they will be rather similar: values of forces effects will be of same order (though contributions of noise can be different). For instance, left and right channels of stereo sound. Let's try to take it into account with the same approximating, as in LPC case: B[i]=C[i]. If it turned out that it's really better to encode D[i]=B[i]-C[i] than B[i] or C[i], what function can be applied to the other channel? Exactly. It's possible to store the sum A[i]=B[i]+C[i] instead of the second channel, and this also gives gain in compression ratio on the majority of MMD. The main disadvantage of such decomposition on average and difference (delta) was already mentioned above: if R bits are used for elements of streams B[i] and C[i], the sums and differences of them will be (R+1)-bit. If B[i] and C[i] are from 0 to 255, the sums are from 0 to 510, differences are from -255 to 255. What can be done with this? First, we see that A[i] and D[i] are either both even or both odd. That is low bit of A[i] coincides with low bit of D[i]. Therefore it's possible to create either D[i]= B[i]-C[i] and A[i]=(B[i] +C[i])/2, or D[i]=(B[i]-C[i])/2 and A[i]= B[i] +C[i]. To get rid of the second "superfluous" bit is possible only at the cost of some loss of compression ratio: V[i]= (B[i]+C[i])mod(2^R) and E[i]= (B[i]-V[i]/2)mod(2^R), or V[i]= (B[i]+C[i])mod(2^R) and E[i]= (C[i]-V[i]/2)mod(2^R), or E[i]= (B[i]-C[i])mod(2^R) and V[i]= (B[i]-E[i]/2)mod(2^R), or E[i]= (B[i]-C[i])mod(2^R) and V[i]= (C[i]+E[i]/2)mod(2^R) (division is integer, (A)mod(B) - remainder of A divided by B). Prove that B[i] and C[i] can be restored by any of these pairs of arrays, and can't be restored by these: A[i]=(B[i]+C[i])/2 and E[i]=(B[i]-A[i])mod(2^R), A[i]=(B[i]+C[i])/2 and E[i]=(C[i]-A[i])mod(2^R), D[i]=(B[i]-C[i])/2 and V[i]=(B[i]-D[i])mod(2^R), D[i]=(B[i]-C[i])/2 and V[i]=(C[i]+D[i])mod(2^R). If there are not two channels but more, suppose nine, search for optimal transform becomes a very non-trivial task. Let's remark only, that it's possible to divide channels into pairs and to apply Average+Delta conversion to these pairs, and then to results. Or to sequentially take the third, the fourth, etc. channels and on every N-th step save difference of N-th channel and the average of all N-1 previous. However, all this is only the foreword to Subband Coding. Let's return to case when only one channel is present. Is Average+Delta applicable? Yes. We divide all elements B[i] into pairs (B[2*j], B[2*j+1]) and transform to two streams - Average and Difference. In case of MMD the physical sense of Averages are lower frequencies, and Differences are higher frequencies. Traditional Subband Coding consists in multiple application of Average+Delta, each time to first of two created blocks, to lower frequencies. 7. Parallel Blocks Sorting. Description of this method is published for the first time. Leading experts confirm it. PBS is used for lossless compression of multimedia files in ERI32 compressor, setting up world record on the most popular set of 24-bit pictures, Kodak True Color Images [21]. Consider two parallel blocks: "Basis" and "Sorted" (being sorted). First, pass through the first block, counting elements in it (cycle to block size, calculate "lengths of boxes" N[i]). Second, using created array N[i], build array P[i]=N[i]+N[i-1] (cycle to 2^R, calculate "beginnings of boxes" P[i] - array of pointers). And finally, main loop: go synchronously through "Basis" and "Sorted", take element B from "Basis" and element S from "Sorted", write S to address (P[B]), increment P[X] (cycle to block size, "put elements to corresponding boxes"). Notice that two or even more "Bases" can be used, then in first and third cycles the "Total Basis" is calculated as sum B1 + B2*2^R + B3*2^(2*R) + ... The order of bases is very important: the first, the second, the third, etc. For example, having divided (SEM) block of 32-bit bytes to "high 8", "middle 8" and "low 16", HI, MD and LO, compression is better if total basis is MD+HI*256, not HI+MD*256. It's obvious that "Basis" can coincide with "Sorted", cyclically rotated on C bytes. The most interesting case is C=1, that is sorting N[i] by N[i-1]. And even in this "rotated" case, there can be many bases. If it is two-dimensional picture, the second basis can be "Sorted", rotated D bytes so that N[i-D] is second of two known neighbours, say, left and upper. It's possible because the rectangular block of two-dimensional MMD can be stored in one-dimensional memory sequentially, line after line. Then D will be just length of line of the rectangular block, (i-1)-th element - a pixel to the left of i-th, and (i-D)-th - a pixel above i-th. Unfortunately, there's no space for examination of even more complex algorithms. Even the very brief description of each of them will occupy from hundred to two hundred lines. 8. VQ, Vector Quantization[23] 9. Wavelet Transform[24] 10. Fractal Transform[25] 11. Enumerative Coding[26] 12. ERI93 Further details and articles are at and Summing up, let's return to MMD definition and consider its alternative variant. I. MMI - information coming from the physical world. MMD - result of its measurement by quantization devices. Who and how uses it - a second question. Computer synthesized MMD (SMMD) also exist, but their total amount is smaller, and they should be better considered as databases: their artificiality is practically always easily seen even by men, and algorithms of their detection are not more complex than algorithms of synthesis of "hand-made" MMD. In particular, such pictures are losslessly compressed more than 10:1, while created by quantization - approximately two-three to one. Optimal compression of artificial MMD are algorithms of their creation. But alternative "reverse" definition (thanks to E.Shelvin) is also possible. It considers the Receiver is more important than the Source. II. MMI - information intended for perception by sense organs of human. MMD - data of arbitrary structure/format, being representation of some amount of MMI. What is the source - a second question. But alternate definition seems to have more disadvantages than the first. Here are some of them: (1) While a recorded block of MMD, say, a sound, no one has heard yet - it's not MMD, but as soon as has heard - it is? Of course, it's possible to ask a similar "reverse" question to the first definition: if block of MMD, say, "white noise", is created by quantization of silence or a white page of paper, it's MMD, and if it results from computer algorithm - that's not MMD? Exactly so. Because, among all, the real object (the physical world) is always more complex than its (computer) model. Even elementary objects (silence, white noise), and the more complex is object, the more it differs from the model. (2) The significant part of data having three distinctive features of MMD (dimensionality, linearity, channels), coming from peculiarities of quantization and storage of MMI, will not be MMD only because of definition of MMD. (3) Main characteristic of MMI in alternative case is "quality" - that is relation of useful information and noise component in it. And main property of MMD - is presence of noise in it. Nevertheless, MMD without noise exist (and moreover, in ideal case MMD do not contain noise and are compressed losslessly). Example: tune up a microphone in silence, and a scanner - on a white page of paper so that noise appears only in thirtieth bits. After that when recording sound and scanning images, save only 16 bits of every sample and pixel. If contribution of noise is less than 0.001% , we can neglect it in most cases. Many thanks to Vadim Yoockin and Eugene Shelvin for comments and questions on first versions of this article. References ========== [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] This text, unlike [1], was originally written in Russian; Russian ver.1.0 takes place at English ver.1.0 can be found at With best kind regards, excuses for misprints, A.Ratushnyak, go Back to main ARTest page