COMPUTING PRACTICES
Edgar H. Sibley
Panel Editor
The state
of
the art in data compression is arithmetic coding, not the better-
known
Huffman
method. Arithmetic coding gives greater compression, is
faster for adaptive models, and clearly separates the model from the channel
encoding.
ARITHMETIC CODING FOR
DATA COIUPRESSION
IAN H. WIllEN, RADFORD M. NEAL, and JOHN G. CLEARY
Arithmetic coding is superior in most respects to the
better-known Huffman [lo] method. It represents in-
formation at least as compactly-sometimes consid-
erably more so. Its performance is optimal without
the need for blocking of input data. It encourages a
clear separation between the model for representing
data and the encoding of information with respect to
that model. It accommodates adaptive models easily
and is computationally efficient. Yet many authors
and practitioners seem unaware of the technique.
Indeed there is a widespread belief that Huffman
coding cannot be improved upon.
We aim to rectify this situation by presenting an
accessible implementation of arithmetic coding and
by detailing its performance characteristics. We start
by briefly reviewing basic concepts of data compres-
sion and introducing the model-based approach that
underlies most modern techniques. We then outline
the idea of arithmetic coding using a simple exam-
ple, before presenting programs for both encoding
and decoding. In these programs the model occupies
a separate module so that different models can easily
be used. Next we discuss the construction of fixed
and adaptive models and detail the compression
efficiency and execution time of the programs,
including the effect of different arithmetic word
lengths on compression efficiency. Finally, we out-
line a few applications where arithmetic coding is
appropriate.
Financial support for this work has been provided by the Natural Sciences
and E@neering Research Council of Canada.
UNIX is a registered trademark of AT&T Bell Laboratories.
0 1987 ACM OOOl-0782/87/OtiOO-0520 750
DATA COMPRESSION
To many, data compression conjures up an assort-
ment of ad hoc techniques such as conversion of
spaces in text to tabs, creation of special codes for
common words, or run-length coding of picture data
(e.g., see [8]). This contrasts with the more modern
model-based paradigm for coding, where, from an
input string of symbols and a model, an encoded string
is produced that is (usually) a compressed version of
the input. The decoder, which must have access to
the same model, regenerates the exact input string
from the encoded string. Input symbols are drawn
from some well-defined set such as the ASCII or
binary alphabets; the encoded string is a plain se-
quence of bits. The model is a way of calculating, in
any given context, the distribution of probabilities
for the next input symbol. It must be possible for the
decoder to produce exactly the same probability dis-
tribution in the same context. Compression is
achieved by transmitting the more probable symbols
in fewer bits than the less probable ones.
For example, the model may assign a predeter-
mined probability to each symbol in the ASCII
alphabet. No context is involved. These probabilities
can be determined by counting frequencies in repre-
sentative samples of text to be transmitted. Such a
fixed model is communicated in advance to both en-
coder and decoder, after which it is used for many
messages.
Alternatively, the probabilities that an adaptive
model assigns may change as each symbol is trans-
mitted, based on the symbol frequencies seen so far
in the message. There is no need for a representative
520
Communications of the ACM
June 1987 Volume 30 Number 6
Computing Practices
sample of text, because each message is treated as
an independent unit, starting from scratch. The en-
coder’s model changes with each symbol transmit-
ted, and the decoder’s changes with each symbol
received, in sympathy.
More complex models can provide more accurate
probabilistic predictions and hence achieve greater
compression. For example, several characters of pre-
vious context could condition the next-symbol prob-
ability. Such methods have enabled mixed-case Eng-
lish text to be encoded in around 2.2 bits/character
with two quite different kinds of model [4,
61.
Tech-
niques that do not separate modeling from coding
so distinctly, like that of Ziv and Lempel
(231,
do
not seem to show such great potential for compres-
sion, although they may be appropriate when the
aim is raw speed rather than compression per-
formance
[22].
The effectiveness of any model can be measured
by the entropy of the message with respect to it,
usually expressed in bits/symbol. Shannon’s funda-
mental theorem of coding states that, given messages
randomly generated from a model, it is impossible to
encode them into less bits (on average) than the en-
tropy of that model [21].
A message can be coded with respect to a model
using either Huffman or arithmetic coding. The for-
mer method is frequently advocated as the best pos-
sible technique for reducing the encoded data rate.
It is not. Given that each symbol in the alphabet
must translate into an integral number of bits in the
encoding, Huffman coding indeed achieves “mini-
mum redundancy.” In other words, it performs opti-
mally if all symbol probabilities are integral powers
of %. But this is not normally the case in practice;
indeed, Huffman coding can take up to one extra bit
per symbol. The worst case is realized by a source
in which one symbol has probability approaching
unity. Symbols emanating from such a source con-
vey negligible information on average, but require at
least one bit to transmit [7]. Arithmetic coding dis-
penses with the restriction that each symbol must
translate into an integral number of bits, thereby
coding more efficiently. It actually achieves the the-
oretical entropy bound to compression efficiency for
any source, including the one just mentioned.
In general, sophisticated models expose the defi-
ciencies of Huffman coding more starkly than simple
ones. This is because they more often predict sym-
bols with probabilities close to one, the worst case
for Huffman coding. For example, the techniques
mentioned above that code English text in 2.2 bits/
character both use arithmetic coding as the final
step, and performance would be impacted severely
June 1987 Volume 30 Number 6
if Huffman coding were substituted. Nevertheless,
since our topic is coding and not modeling, the illus-
trations in this article all employ simple models.
Even so, as we shall see, Huffman coding is inferior
to arithmetic coding.
The basic concept of arithmetic coding can be
traced back to Elias in the early 1960s (see [l,
pp.
61-621).
Practical techniques were first intro-
duced by Rissanen
[16]
and Pasco [15], and de-
veloped further by Rissanen [17]. Details of the
implementation presented here have not appeared
in the literature before; Rubin [2O] is closest to our
approach. The reader interested in the broader class
of arithmetic codes is referred to
[18];
a tutorial is
available in [l3]. Despite these publications, the
method is not widely known. A number of recent
books and papers on data compression mention it
only in passing, or not at all.
THE IDEA OF ARITHMETIC CODING
In arithmetic coding, a message is represented by an
interval of real numbers between
0
and
1.
As the
message becomes longer, the interval needed’to rep-
resent it becomes smaller, and the number of bits
needed to specify that interval grows. Successive
symbols of the message reduce the size of the inter-
val in accordance with the symbol probabilities gen-
erated by the model. The more likely symbols re-
duce the range by less than the unlikely symbols
and hence add fewer bits to the message.
Before anything is transmitted, the range for the
message is the entire interval
[0, l),
denoting the
half-open interval
0
5 x <
1.
As each symbol is
processed, the range is narrowed to that portion of it
allocated to the symbol. For example, suppose the
alphabet is (a, e, i, O, u, !I, and a fixed model is used
with probabilities shown in Table I. Imagine trans-
TABLE I. Example Fixed Model for Alphabet (a, e, i, o, u, !)
Symbol Probability
Range
.2
LO,
0.2)
.3 [0.2, 0.5)
.l [0.5, 0.6)
.2 [0.6,0.8)
.l
[0.8, 0.9)
.l [0.9, 1.0)
mitting the message eaii!. Initially, both encoder
and decoder know that the range is
[0, 1).
After
seeing the first symbol, e, the encoder narrows it to
[0.2, 04,
the range the model allocates to this sym-
bol. The second symbol, a, will narrow this new
range to the first one-fifth of it, since a has been
Communications of the ACM
521
Computing Practices
allocated [0, 0.2). This produces [O.Z, 0.26), since the
previous range was 0.3 units long and one-fifth of
that is 0.06. The next symbol, i, is allocated [0.5, 0.6),
which when applied to [0.2, 0.26) gives the smaller
range [0.23, 0.236). Proceeding in this way, the en-
coded message builds up as follows:
Initially
After seeing e
a
i
i
!
1)
;::2, 0.5)
0.26)
p.2,
0.236)
[0.23,
[0.233, 0.2336)
[0.23354, 0.2336)
Figure la. The second symbol scales it again into the
range [0.2, 0.26). But the picture cannot be contin-
ued in this way without a magnifying glass! Conse-
quently, Figure lb shows the ranges expanded to
full height at every stage and marked with a scale
that gives the endpoints as numbers.
Suppose all the decoder knows about the message
is the final range, [0.23354, 0.2336). It can -immedi-
ately deduce that the first character was e! since the
range lies entirely within the space the model of
Table I allocates for e. Now it can simulate the oper-
ation of the encoder:
Initially
After seeing e
P, 1)
[0.2, 0.5)
Figure 1 shows another representation of the en-
coding process. The vertical bars with ticks repre-
sent the symbol probabilities stipulated by the
model. After the first symbol has been processed, the
model is scaled into the range [0.2, 0.5), as shown in
This makes it clear that the second character is a,
since this will produce the range
After seeing a
[0.2, 0.26),
After
seeing
Nothing
0
’ !
ri
U
0
e a
i
e
3
a
FIGURE la.
Representation of the Arithmetic Coding Process
which entirely encloses the given range [0.23354,
0.2336). Proceeding like this, the decoder can iden-
tify the whole message.
It is not really necessary for the decoder to know
both ends of the range produced by the encoder.
Instead, a single number within the range--for ex-
ample, 0.23355-will suffice. (Other numbers, like
0.23354, 0.23357, or even 0.23354321, would do just
as well.) However, the decoder will face the problem
of detecting the end of the message, to determine
when to stop decoding. After all, the single number
0.0 could represent any of a, aa, aaa, aaaa, . . . . To
resolve the ambiguity, we ensure that each message
ends with a special EOF symbol known to both en-
coder and decoder. For the alphabet of Table I, ! will
be used to terminate messages, and only to termi-
After
seeing
Nothing
1
!
u
0
e
a
i
!
U
0
/
i
e
i !
0.5
0.26
!
u
0
/
i
e
a
0
i
a
i
e
a
i
0.2
i
0.2
a
FIGURE lb. Representation of the Arithmetic Coding
Process with the interval Scaled Up at Each Stage
522
Communications
of
the ACM
June 1987 Volume 30 Number 6
Computing Practices
/* ARITHMETIC ENCODING ALGORITHM. */
/*
/*
Call
Ensure
encode-symbol
that a distinguished
repeatedly
"terminator"
for each symbol
symbol
in
is
the
encoded
message.
last, then
*/
/* transmit any value in the range [low, high).
*/
*/
encode-symbol(symbo1,
range
cum-freq)
high
= high - low
low
= low
= low
f
f
range*cum-freq[symbol-11
range*cum-freq(symbol1
/* ARITHMETIC DECODING ALGORITHM. */
/*
/* Continue
"Value" is
calling
the number that
decode-symbol
has been received.
until the terminator symbol is returned.
*/
*/
decode-symbol(cum-freq)
find symbol such that
cum-freq[symbol] <= (value-low)/(high-low)
/* This ensures that value
< cum-freqrsymbol-11
lies within the new l /
;* (low, high) range that will be calculated by */
/* the following lines of code.
*/
range = high - low
high
1OW
= low
return
= low
t range*cum-freq[symbol-11
symbol
t range*cum-freq[symbol]
FIGURE 2. Pseudocode for the Encoding and Decoding Procedures
nate messages. When the decoder sees this symbol,
A PROGRAM FOR ARITHMETIC CODING
it stops decoding.
Figure 2 shows a pseudocode fragment that summa-
Relative to the fixed model of Table I, the entropy
rizes the encoding and decoding procedures devel-
of the five-symbol message eaii! is
oped in the last section. Symbols are numbered, 1, 2,
-log 0.3 - log 0.2 - log 0.1 - log 0.1 - log 0.1
3 . . . The frequency
range for the ith symbol is
from cum-freq[i] to cum-freq[i - 11. As i decreases,
= -log 0.00006 = 4.22
cum-freq[i] increases, and cum-freq[O] = 1. (The
(using base 10, since the above encoding was per-
reason for this “backwards” convention is that
formed in decimal). This explains why it takes five
cum-freq[O] will later contain a normalizing factor,
decimal digits to encode the message. In fact, the
and it will be convenient to have it begin the array.]
size of the final range is 0.2336 - 0.23354 = 0.00006,
The “current interval” is [Zozu, high), and for both
and the entropy is the negative logarithm of this
encoding and decoding, this should be initialized
figure. Of course, we normally work in binary,
to [O, 1).
transmitting binary digits and measuring entropy
Unfortunately, Figure 2 is overly simplistic. In
in bits.
practice, there are several factors that complicate
Five decimal digits seems a lot to encode a mes-
both encoding and decoding:
sage comprising four vowels! It is perhaps unfortu-
Incremental transmission and reception. The encode
nate that our example ended up by expanding
algorithm as described does not transmit anything
rather than compressing. Needless to say, however,
until the entire message has been encoded; neither
different models will give different entropies. The
does the decode algorithm begin decoding until it
best single-character model of the message eaii! is
has received the complete transmission. In most
the set of symbol frequencies (e(O.2), a(0.2), i(O.4),
applications an incremental mode of operation is
!(0.2)), which gives an entropy of 2.89 decimal digits.
necessary.
Using this model the encoding would be only three The desire to use integer arithmetic. The precision
digits long. Moreover, as noted earlier, more sophis- required to represent the [low, high) interval grows
ticated models give much better performance with the length of the message. Incremental opera-
in general. tion will help overcome this, but the potential for
]une 1987 Volume 30 Number 6
Communications
of
the ACM
523
Computing Practices
overflow and underflow must still be examined
carefully.
Representing the model so thnt it can be consulted
efficiently. The representation used for the model
should minimize the time required for the decode
algorithm to identify the next symbol. Moreover,
an adaptive model should be organized to minimize
the time-consuming task of maintaining cumulative
frequencies.
Figure 3 shows working code, in C, for arithmetic
encoding and decoding. It is considerably lmore de-
tailed than the bare-bones sketch of Figure Z! Imple-
mentations of two different models are given in
Figure
4;
the Figure 3 code can use either one.
The remainder of this section examines the code
of Figure 3 more closely, and includes a proof that
decoding is still correct in the integer implementa-
tion and a review of constraints on word lengths in
the program.
arithmetic-coding-h
1
2
3
4
5
6
7
a
9
10
11
12
13
14
15
:6
/' DECLARATIONS USED FOR ARITHMETIC ENCODING AND DECODING
l
/
/* SIZE OF ARITHMETIC CODE VALUES.
l
/
#define
typedef
fdefine
Code-value-bits
long code-value:
Top-value
16 /* Number of bits in a code value
/* Type of an arithmetic code value
/* Largest code value
l
/
l
/
l
/ (((long)l< /' HALF AND QUARTER POINTS IN THE CODE VALUE RANGE. l / *define #define Idefine mode1.h First-qtr Half Third-qtr (Top-value/ltl) (Z'First-qtr) (3’Firat-qtr) /* Point after /* Point after /* Point after first first third quarter half quarter l / l / "/ 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 /' INTERFACE TO THE MODEL. '/ /' THE SET OF SYMBOLS THAT MAY BE ENCODED. l / #define #define #define No-of-chars EOF-symbol No-of-symbols 256 (No-of-charetl) (No-of-charstll /* Number of character symbols /* Index of EOF symbol /* Total number of symbols '/ '/ */ /' TRANSLATION TABLES BETWEEN CHARACTERS AND SYMBOL INDEXES. l / int char-to-index[No-of-chars]; unsigned char index_to_char[No_of_symbols+l]: /* CUMULATIVE FREQUENCY TABLE. */ Idefine int /* To index from character /* To character from index l / '/ Max-frequency 16383 cum_frsq[No_of_symbols+l]; /* Maximum allowed frequency count 2a14 - 1 /* /* Cumulative symbol frequencies l / l / l / FIGURE 3. C Implementation of Arithmetic Encoding and Decoding 524 Communicationso~ the ACM June 1987 Volume 30 Number 6 Computing Practices encode.c 39 40 /' MAIN PROGRAM FOR ENCODING. l / 41 42 #include #include "mode1.h" 43 44 45 main0 46 1 startmodel~): /* Set up other modules. l / 47 start-outputlng-bitsO; 40 start-encodingo; 49 for l / 50 int (::I ( /' Loop through characters. 51 ch - getc(stdin); ch; lnt symbol; Read the next 52 If (ch--EOF) break; /* on end-of-file.*/ character. l / Exit loop 53 symbol- /* Translate 54 encode-symbol(symbol,cum-freq); char-to.-lndex[ch]; /* Encode that to an index. l / 55 update-model(symbo1); /* /* Update the model. symbol. l / l / 56 57 encode_symbol(EOF_~symbol,cum_freq); 1 /* Encode the EOF symbol. l / 58 done-encoding(); /* Send the last few blts. l / 59 done-outputinebitso; 60 exit (0) : arithmetic - encode.c 61 /* ARITHMETIC ENCODING ALGORITHM. */ 62 63 64 finclude "arithmetic-cod1ng.h" 65 66 static vold bll-plus-follov(); /* Routine that follows ‘/ 67 68 69 /' CURRENT STATE OF THE ENCODING. '/ 70 71 static 72 static code value long-bits-to-follow; low, high; /* /* Ends of the current code region 73 /* Number of the next bit. opposite bits to output after */ l '/ / 74 75 76 /* START ENCODING A STREAM OF SYMBOLS. l / 17 78 start-encoding0 79 I low 80 hiah - 0; /* Full code range. '/ Bl bits - Top-value; to follow - 0; /* No bits to follow next. l / 82 1 -- 83 84 85 /* ENCODE A SYMBOL. l / 86 81 encode-symhoI(symbol,cum-freq) 88 int symbol: /* Symbol to encode '/ a9 int cum-freql]: '/ 90 I long range; /* Cumulative symbol frequencies /' Size of the current code region l / range - 91 high - low (long) + (high-low)+l; /* Narrow the code region l / 92 93 94 low (ranye'cum-freq[symbol-ll)/cum-freq[O)-1: /* /' to symbol. that allotted to this l / ~ranpe*cum~freq(symbol))/cum~freq(O); - low + l / FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) ]une 1987 Volume 30 Number 6 Communications of the ACM 525 100 101 102 103 104 105 106 107 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 108 95 96 97 98 99 for (:;I I if (high bit~plus~follow(0); I else if (low>-Half) ( bit~plus~follow(1): low -- Half: high -* Half: I else if (low>-Flrst qtr sc hiQh bits-to-foliow /' Loop to ouLput bits. /* Output 0 if In low half. /* Output 1 if in hiQh half.*/ /+ Subtract ( offset to top. bit half. '1 l / */ l / ./ /* Output an opposite /. later if in middle /* Subtract /+ Otherwise offset exit low -- First-qtr; high -- First-qtr; t* 1; to middle'/ loop. '1 l / 1 else break: low * 2'low; high * 2*hightl; /* Scale up code range. /' FINISH ENCODING THE STREAM. l / done-encoding0 bits-to-follow +* 1: I if (low else bit-plus-follow(l): 1 /* OUTPUT BITS PLUS FOLLOWING OPPOSITE BITS. l / static void bitglus-foIlow(blt) int bit: output-bitcbit); f while (bits-to-follow>O) i output-bit (Ibit); bits to-follow -* 1; 1 - decode.c /* /' /* /* Output two bits that Select the quarter that the current code range contains. '1 '/ l / l / /* /* Output the bit. l / l / '/ l / /* Output bits-to-follow opposfte bits. Set /' bits-to-follow to zero. 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 /* MAIN PROGPAM FOR DECODING. */ #include (include "mode1.h" /* Set up other /* Loop through /* /' /* /' /* modules. characters. '/ '/ '/ 'I '/ l / main () start-model 0 : I start-inpUtinQ-bitso; start-decodingo: for (;;) ( lnt ch: int symboi; symbol - decode-symbolicurn-freq): if (symbol**EOF symbol) break; ch * lndex_to_c?;ar[symbolJ; putc(ch,stdout): update-model(symbo1); 1 exit. 1 (0) ; Decode next symbol. Exit loop if EOF symbol. Translate to a character.*/ Write that character. Update the model. FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) 526 Communications of the ACM ]une 1987 Volume 30 Number 6 Computing Practices arithmetic - decode.c -- -- 155 /* ARITHMETIC DECODING ALGORITHM. l / 156 157 #include “arithmetic-cod1ng.h” 158 159 160 /' CURRENT STATE OF THE DECODING. l / 161 162 static code value value; /* Currently-seen code value l / 163 static code:value low, high; /* Ends of current code replan ‘/ 164 165 166 /* START DECODING A STREAM OF SYMBOLS. '/ 167 168 Start-deC”di”Q() 169 ( lnt i: 170 value - 0; /* Input bits to flll the l / 171 for (1 - 1; I<-Code-value-bits; it+) ( /* code value. l / 172 value - 2'value,lnput_bit(); 173 1 174 low " 0: /* Full code ranpe. l / 175 hiQh - Top-value; 176 1 177 178 179 /* DECODE THE NEXT SYMBOL. l '/ 180 181 lnt decode-symbol (cum-freq) 182 int cwn-freq[ I; /* Cumulative symbol frequencies l / 183 ! lO”Q ra”Qe; /* Size of current code region l / 184 int cum; /* Cumulative frequency calculated ‘/ 185 int symbol: /* Symbol decoded l / 186 ra”Qe - (lO"Q)(hlQh-1oW)tl; 187 cum - /* Flnd cum freq for value. l / 188 (((lonQl(value-lou)tl)*cum~freq[O]-l)/ranQe: 189 for ~eyn!bol - 1; cum-fraq[symbolJ%wm; symboltt) ; /* Then find symbol. l / 190 hiQh - low t /* Narrow the code reQlon l / 191 (ranQe*cum-freq[symbol-ll)/cum-freq[O]-1; /* to that allotted to this l / 192 low - low t /* symbol. l / 193 194 (ra"Qe'cum~freq[symbo1])/cum~freq[O]; for I::1 ( /* Loop to QOt rld of bits. l / 195 if (hiQh 196 /’ nothing l / /’ Expand low half. '/ 197 1 198 else If (low>-Half) ( /* Expand hlph half. '/ 199 value -- Half; 200 low -- Half: /* Subtract offset to top. *I 201 hlQh -- Half; 202 203 else if (low>-First-qtr /* Expand middle half. l / 204 ‘L hiQh 205 value -- First-qtr; 206 low -- First-qtr; /* Subtract offset to mIddlr*/ 207 high -- First-qtr; 208 1 209 else bxeak: /* Otherwise exit loop. l / 210 low - 2*1ow: 211 high - 2'hiQhtl: /* Scale up code range. l / 212 value - 2*valuetlnput_bit(): /* Move in next input blt. l / 213 214 relucn 1 symbol; 215 t FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) ]une 1987 Volume 30 Number 6 Communications of the ACM 527 bit-input-c 216 217 218 /* BIT INPUT ROUTINES. "/ flnclude +include 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 "arithmetic-cod1ng.h" /' THE BIT BUFFER. ‘/ static static static int lnt blts_to-go: lnt garbage-bits: BIT INPUT. */ - 0; - 0: buffer: /* Bits waiting to be /* Number of bits still /* Number of bit6 past input in buffer end-of-file l / l / ‘/ /* INITIALIZE St6rt_inpUting_blt6() bits-to-go l garbage-bits 1 /* Buffer /* no bits starts in it. out with l / l / /* INPUT A BIT. l / int input-bit0 lnt t: ( if (bits-to-go--O) 1 buffer - getc(stdln): if (buffer--EOF) garbage-bits /* /* 1; Read bit6 /* the next byte are left Return if no '/ in buffer. *'/ bits*/ if ) (garbage~,blts>Code~value~blts-2) fprlntf(stderr,“Bad input fllen’j; ( t- 1 exit (-1); 1 1 bits-to-go - 8: /* after eof, but check /* for too many such. arbitrary l / l / t - buffer&l; buffer a>- 1; bits-to-go return -* t; I: /* Return the next bit from l / l / /* the bottom of the byte. FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) 528 Communications of the ACM June 1987 Volume 30 Number 6 bit-output.c 257 /* BIT OUTPUT ROUTINES. '/ 258 259 260 #include 261 262 /" THE BIT BUFFER. */ 263 264 static l / 265 static int int buffer: bits-to-go; /* /* Bits Number buffered bits for free output of in buffer l / 266 267 268 /* INITIALIZE FOR BIT OUTPUT. l / 269 270 start_.outputing-..bits() 271 1 buffer - 0; /* Buffer Is empty to start '/ 272 bitS_tO-QO- 8; /' with. l / 273 274 ) 215 276 /* OUTPUT A BIT. l / 277 270 output-blt(bit) 279 lnt bit: 200 281 ( buffer /* Dot bit In top of buffer.*/ if (bit) >>- buffer 1; I- 0x00; 282 bits-to-go -- 1; 283 if (bits-to-go--O) /* Output buffer if It 1s l / 204 putc(buffer,stdout): ( /' now full. l / 285 286 bits-to-go - 8; 287 288 t t 209 290 291 /* FLUSH OUT THE LAST BITS. l / 292 293 done-outputlng-bits0 putc(buffer>>blts_to-go,stdout): 294 FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued) ]une 1987 Volume 30 Number 6 Communications of the ACM 529 Computing Practices fixed-model. c 1 2 /* THE FIXED SDDRCE MODEL l / finclude 'mode1.h' 4 5 int freq[No-of-symbolstlj - ( 6 0, 7 1, 124, 1, 1, 1, 1, 1. 1, 1, 1, 1, 1, 1. 1, 1. a 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9 10 /* 1 l + + $ t ( I l 2, 2, 1, 7b, 19, 60, 11 1236, 9, 3, 1, 2:, 15, 2, 1, 21, 12 13 /'012 9:;C - > 3 4 5 6 7 S 14 6, 3, 2, 1, 1, 1, 5, 4, 7, 5, 4, 4, 15, 15, 8, 15 16 /*@ A B C D E F C H I J K L M N 17 9, 16, 16, 8, 6, 12, 23, 13, 1, 24, 15, 22, 12, 15, 10, 18 19 /'P Q R S T U V W X Y Z [ I .. 20 3, 1, 1, 1, 1, 1, 6, 3, 11, 1, 14, 1, 14, 26, 29, 21 22 /* ' i j k 1 m d f h b 23 6, 39, 250, 139, 42:, 293, 418, 1, 49:. 85, 17;, 232, 74:, 127, ll:, 24 25 /*p ( I ) - 8 t x Y q r 26 f. 2, 1, 2, 3. 111, 5, 388, 375, 531, 159, 5:, 91, 12, 101, 27 28 1. 1, 1. 1. 1. 1, 1, 1, 1. 1, 1. 1. 1. 1. 1, 29 1. 1, 1. 1. 1, 1. 1, 1, 1. 1, 1. 1, 1, 1, 1, 30 1. 1, 1, 1, 1, 1. 1, 1, 1, 1, 1. 1, 1. 1. 1. 31 1. 1. 1, 1, 1, 1, 1. 1, 1, 1, 1, 1. 1, 1, 1. 32 1, 1. 1. 1. 1. 1. 1, 1, 1, 1, 1, 1. 1. 1. 1. 33 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 34 1, 1, 1, 1, 1, 1. 1, 1, 1, 1, 1, 1. 1, 1, 1, 35 1, 1, 1, 1, i, I, I, I, 1, 1, 1, 1, 1, I, I, 36 37 1’ , 38 39 40 /* INITIALIZE THE MODEL. '/ 41 42 start-model0 43 ! lnt 1: /* Set up tables that 44 for (i - 0; i /* translate between symbol 45 char-to-index[i] - itl; /* indexes and characters. 46 index_to-char[i+l) - i; 47 I 40 cum-freq[No-of-symbols] - 0; /* Set up cumulative 49 for (1 - No-of-symbols; i>O; i--l ( 50 cum-freq[i-lj - cum-freq[i] t freq(iJ; /* frequency counts. 51 I /* Check counts within limit*/ 52 if (cum-freq[Ol > Max-frequency) abort0; 53 54 55 56 /* UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. 'I 57 58 update-model (symbol) 59 lnt symbol; /* Do nothing. l / 60 I 61 I 1, 1, / l / 1, ?*/ 1, O*/ 11, 5, 446. 1. l / 0 l / */ 1, 1. 1, 1, 1. 1, 1. 1, l / l / l / l / '/ FIGURE 4. Fixed and Adaptive Models for Use with Figure 3 530 Communications of the ACM ]une 1987 Volume .30 Number 6 Computing Practices adaptive-mode1.c 1 2 /* THE ADAPTIVE SOURCE MODEL l / 3 4 #include "modo1.h" 5 6 int freq]No-of-symbolstl]: /+ symbol frequencies l / 7 0 9 /' INITIALIZE THE MODEL. l / 10 start-model0 11 12 I int i: 13 for (i 14 char-to-index[l] - 0; i - itl: it+) ( /* Set up tables that l / 15 index-to-char[itl] - i: /* translate between symbol l / /* indexes and characters. l / 16 for I (i - 0; 17 i<-No-of-symbols; it+) ( 18 freq[i] - 1; /* Set up initial frequency l / 19 cum-freq[i] - No-of-symbols-i; /* /* counts symbols. to be one for all l / l / 20 21 freq[O] t - 0; /* Freq[Ol must not be the l / 22 t /* same as freq[l]. l / 23 24 25 /' UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. '/ 26 update~model(symbo1) 28 21 int symbol: /* Index '1 29 ( lnt 1; 30 if (cum-freq]O]--Max-frequency) /* New index of new symbol for symbol l / 31 int l / 32 cum - 0: cum; ( /* /* See are if at frequency thslr maxlawn. count8 l / 33 for 34 freq[i] (1 - No-of-symbols; l / 35 cum-freq[l] - (freq(i]+1]/2: la-O; i--) ( /* If so, halve all the - cum; /* /* counts non-zero). (keeplng them l / l / 36 cum t- freq[l]; 37 38 t t 39 ior 40 if (i (i - symbol; l--l : /* Find symbol's new Index. l / int ch-i, 41 ch-symbol ] freq(i]--freq[i-1): : 42 ch-i 43 ch-symbol - index-to-char[i]; /* Update the translation l / 44 index-to-char[i] - index-to-char[aymbol]; - ch-symbol: /* /* tables moved. lf the symbol has l / l / 45 index-to-char[symbol] 46 char-to-index]ch-i] - ch-1; 47 char-to-in' dex[ch-symbol] - symbol: - i; 40 49 freq[i] I +- 1; /* 50 while /* Increment count for the the frequency symbol and l / l / i -- (i>O) 1: ( /* update the cumulative l / 51 52 cum-freq]! 1 t- 1; /* frequencies. '/ 53 t t FIGURE 4. Fixed and Adaptive Models for Use with Figure 3 (continued) /me 1987 Volume 30 Number 6 Communications of the ACM 531 Computing Practices Representing the Model Implementations of models are discussed in the next section; here we are concerned only with the inter- face to the model (lines 20-38). In C, a byte is repre- sented as an integer between 0 and 255 (a char). Internally, we represent a byte as an integer be- tween 1 and 257 inclusive (an index), EOF being treated as a 257th symbol. It is advantageous to sort the model into frequency order, so as to minimize the number of executions of the decoding loop (line 189). To permit such reordering, the char/index translation is implemented as a pair of tables, index-to-char[] and char-to-index[]. In one of our models, these tables simply form the index by adding 1 to the char, but another implements a more com- plex translation that assigns small indexes to fre- quently used symbols. The probabilities in the model are represented as integer frequency counts, and cumulative counts are stored in the array cum- freq[]. As previously, this array is “backwards,” and the total frequency count, which is used to normalize all frequencies, appears in cum - fre9 [O]. Cumulative counts must not exceed a predetermined maximum, Max- frequency, and the model implementation must prevent overflow by scaling appropriately. It must also ensure that neigh- boring values in the cum- freq [ ] array differ by at least 1; otherwise the affected symbol cannot be transmitted. Incremental Transmission and Reception Unlike Figure 2 the program in Figure 3 repre- sents low and high as integers. A special data type, code-value, is defined for these quantities, together with some useful constants: Top-value, representing the largest possible code-value, and First-qtr, Half, and Third-@, representing parts of the range (lines 6-16). Whereas in Figure 2 the current inter- val is represented by [low, high), in Figure 3 it is [low, high]; that is, the range now includes the value of high. Actually, it is more accurate (though more confusing) to say that, in the program in Figure 3, the interval represented is [low, high + 0.11111 . .o). This is because when the bounds are scaled up to increase the precision, zeros are shifted into the low- order bits of low, but ones are shifted into high. Al- though it is possible to write the program to use a different convention, this one has some advantages in simplifying the code. As the code range narrows, the top bits of low and high become the same. Any bits that are the same can be transmitted immediately, since they cannot be affected by future narrowing. For encoding, since we know that low 5 high, this requires code like 532 Communications of the ACM for (;;I { if (high < Half) [ output-bit(O); low = 2*1ow; high = 2*high+l; else I if (low I Half) ( output-bit(l); low = 2*(low-Half); high = 2*(high-Half)+l; else I break; I which ensures that, upon completion, low < Half I high. This can be found in lines 95-113 of encode-symbol (1, although there are some extra com- plications caused by underflow possibilities (see the next subsection). Care is taken to shift ones in at the bottom when high is scaled, as noted above. Incremental reception is done using a number called value as in Figure 2, in which processed bits flow out the top (high-significance) end and newly received ones flow in the bottom. Initially, start-decoding0 (lines 168-176) fills value with re- ceived bits. Once decode-symbol () has identified the next input symbol, it shifts out now-useless high- order bits that are the same in low and high, shifting value by the same amount (and replacing lost bits by fresh input bits at the bottom end): for (;;) 1 if (high < Half) ( value = 2*value+input_bit( ); low = 2*1ow; high = 2*high+l; else I if (low > Half) ( value = 2*(value-Half)+input-bit(); low = 2*(low-Half); high = 2*(high-Half)+l; else I break; t (see lines 194-213, again complicated by precautions against underflow, as discussed below). Proof of Decoding Correctness At this point it is worth checking that identification of the next symbol by decode-symbol () works properly. Recall from Figure 2 that decode-symbol () must use value to find the symbol that, when en- coded, reduces the range to one that still includes value. Lines 186-188 in decode-symbol0 identify the symbol for which ]une 1987 Volume 30 Number 6 Computing Practices cum-freq[symbol] ~ (value - low + 1) * cum-freq[O] - 1 i high - low + 1 1 < cum-freq[symbol- 11, where L J denotes the “integer part of” function that comes from integer division with truncation. It is shown in the Appendix that this implies low + (high - low + 1) * cum-freq[symboZ] i cum-freq[O] 1 lV5lOW + (high - low + 1) * cum-freq[symbol- l] _ 1 L cum-freq[O] J ’ so that value lies within the new interval that decode-symbol () calculates in lines 190-193. This is sufficient to guarantee that the decoding operation identifies each symbol correctly. Underflow As Figure 1 shows, arithmetic coding works by scal- ing the cumulative probabilities given by the model into the interval [low, high] for each character trans- mitted. Suppose low and high are very close to- gether-so close that this scaling operation maps some different symbols of the model onto the same integer in the [low, high] interval. This would be disastrous, because if such a symbol actually oc- curred it would not be possible to continue encod- ing. Consequently, the encoder must guarantee that the interval [low, high] is always large enough to prevent this. The simplest way to do this is to ensure that this interval is at least as large as Max- frequency, the maximum allowed cumulative frequency count (line 36). How could this condition be violated? The bit- shifting operation explained above ensures that low and high can only become close together when they straddle Half. Suppose in fact they become as close as First-qtr I low < Half I high c Third-qtr. Then the next two bits sent will have opposite polar- ity, either 01 or 10. For example, if the next bit turns out to be zero (i.e., high descends below Half and [0, Half] is expanded to the full interval), the bit after that will be one, since the range has to be above the midpoint of the expanded interval. Con- versely, if the next bit happens to be one, the one after that will be zero. Therefore the interval can safely be expanded right now, if only we remember June 1987 Volume 30 Number 6 that, whatever bit actually comes next, its opposite must be transmitted afterwards as well. Thus lines 104-109 expand [First-qtr, Third-qtr] into the whole interval, remembering in bits-to- follow that the bit that is output next must be followed by an oppo- site bit. This explains why all output is done via bit-plus- foZZow() (lines 128-135), instead of directly with output-bit(). But what if, after this operation, it is still true that First-qtr I low < Half I high < Third-qtr? Figure 5 illustrates this situation, where the current [low, high] range (shown as a thick line) has been expanded a total of three times. Suppose the next bit turns out to be zero, as indicated by the arrow in Figure 5a being below the halfway point. Then the next three bits will be ones, since the arrow is not only in the top half of the bottom half of the original range, but in the top quarter, and moreover the top eighth, of that half-this is why the expansion can occur three times. Similarly, as Figure 6b shows, if the next bit turns out to be a one, it will be followed by three zeros. Consequently, we need only count the number of expansions and follow the next bit by that number of opposites (lines 106 and 131-134). Using this technique the encoder can guarantee that, after the shifting operations, either low < First-qtr < Half I high (14 or low < Half < Third-qtr I high. (lb) Therefore, as long as the integer range spanned by the cumulative frequencies fits into a quarter of that provided by code-values, the underflow problem cannot occur. This corresponds to the condition Max- frequency 5 Top-value + 1 + 1 4 which is satisfied by Figure 3, since Max- frequency = 214 - 1 and Top-value = 216 - 1 (lines 36, 9). More than 14 bits cannot be used to represent cumulative frequency counts without increasing the number of bits allocated to code-values. We have discussed underflow in the encoder only. Since the decoder’s job, once each symbol has been decoded, is to track the operation of the encoder, underflow will be avoided if it performs the same expansion operation under the same conditions. Overflow Now consider the possibility of overflow in the integer multiplications corresponding to those of Figure 2, which occur in lines 91-94 and 190-193 Communications of the ACM 533 (4 /- 08 Top-value - Third-qtr high Half First+ o- FIGURE 5. Scaling the Interval to Prevent Underflow of Figure 3. Overflow cannot occur provided the product fits within the integer word length available, since cumulative frequencies cannot exceed Max- frequewy. Rarjge might be as large as Top-value + 1. so the largest possible product in Figure 3 is 2”‘(2’4 - 1). which is less than 2”“. Long declarations are used for code-zwlue (line 7) and range (lines 89, 183) to ensure that arithmetic is done to 32-bit preci- sion. Constraints on the Implementation The constraints on word length imposed by under- flow and overflow can be simplified by assuming that frequency counts are represented in f bits, and code-zwlurs in c bits, The implementation will work correctly provided fsc-2 f+CSp, the precision to which arithmetic is performed. In most C implementations, p = 31 if long integers 534 Cotnrnu~~icatiom of the ACM lune 1987 Volume 30 Number 6 Computing Practices are used, and p = 32 if they are unsigned long. In Figure 3, f = 14 and c = 16. With appropriately modified declarations, unsigned long arithmetic with f = 15 and c = 17 could be used. In assembly lan- guage, c = 16 is a natural choice because it expedites some comparisons and bit manipulations (e.g., those of lines 65-113 and 164-213). If p is restricted to 16 bits, the best values possible are c = 8 and f = 7, making it impossible to encode a full alphabet of 256 symbols, as each symbol must have a count of at least one. A smaller alphabet (e.g., the 26 letters, or 4-bit nibbles) could still be handled. Termination To finish the transmission, it is necessary to send a unique terminating symbol (EOF-symbol, line 56) and then follow it by enough bits to ensure that the encoded string falls within the final range. Since done-encoding0 (lines 116-123) can be sure that low and high are constrained by either Eq. (la) or (lb) above, it need only transmit 01 in the first case or 10 in the second to remove the remaining ambiguity. It is convenient to do this using the bit-plus- foZZow() procedure discussed earlier. The input- bit () proce- dure will actually read a few more bits than were sent by output-bit(), as it needs to keep the low end of the buffer full. It does not matter what value these bits have, since EOF is uniquely determined by the last two bits actually transmitted. MODELS FOR ARITHMETIC CODING The program in Figure 3 must be used with a model that provides a pair of translation tables index-to-char[] and char-to-index[], and a cumula- tive frequency array cum - freq [ 1, The requirements on the latter are that 0 cum-freq[i - l] 2 cum-freq[i]; l an attempt is never made to encode a symbol i for which cum-freq[i - l] = cum-freq[i]; and l cum- freq [0] 5 Max- frequency. Provided these conditions are satisfied, the values in the array need bear no relationship to the actual cumulative symbol frequencies in messages. Encod- ing and decoding will still work correctly, although encodings will occupy less space if the frequencies are accurate. (Recall our successfully encoding eaii! according to the model of Table I, which does not actually reflect the frequencies in the message.) Fixed Models The simplest kind of model is one in which symbol frequencies are fixed. The first model in Figure 4 has symbol frequencies that approximate those of English (taken from a part of the Brown Corpus [12]). June 1987 Volume 30 Number 6 However, bytes that did not occur in that sample have been given frequency counts of one in case they do occur in messages to be encoded (so this model will still work for binary files in which all 256 bytes occur). Frequencies have been normal- ized to total 8000. The initialization procedure start-model () simply computes a cumulative version of these frequencies (lines 48-51), having first initial- ized the translation tables (lines 44-47). Execution speed would be improved if these tables were used to reorder symbols and frequencies so that the most frequent came first in the cum _ freq [ ] array. Since the model is fixed, the procedure update-model 0, which is called from both encode.c and decode.c, is null. An exact model is one where the symbol frequen- cies in the message are exactly as prescribed by the model. For example, the fixed model of Figure 4 is close to an exact model for the particular excerpt of the Brown Corpus from which it was taken. To be truly exact, however, symbols that did not occur in the excerpt would be assigned counts of zero, rather than one (sacrificing the capability of transmitting messages containing those symbols). Moreover, the frequency counts would not be scaled to a predeter- mined cumulative frequency, as they have been in Figure 4. The exact model can be calculated and transmitted before the message is sent. It is shown by Cleary and Witten [3] that, under quite general conditions, this will not give better overall compres- sion than adaptive coding (which is described next). Adaptive Models An adaptive model represents the changing symbol frequencies seen so fur in a message. Initially all counts might be the same (reflecting no initial infor- mation), but they are updated, as each symbol is seen, to approximate the observed frequencies. Pro- vided both encoder and decoder use the same initial values (e.g., equal counts) and the same updating algorithm, their models will remain in step. The en- coder receives the next symbol, encodes it, and up- dates its model. The decoder identifies it according to its current model and then updates its model. The second half of Figure 4 shows such an adap- tive model. This is the type of model recommended for use with Figure 3, for in practice it will outper- form a fixed model in terms of compression effi- ciency. Initialization is the same as for the fixed model, except that all frequencies are set to one. The procedure update-model (symbol) is called by both encode-symbol () and decode-symbol () (Figure 3, lines 54 and 151) after each symbol is processed. Updating the model is quite expensive because of the need to maintain cumulative totals. In the code Communications of the ACM 535 Computing Practices of Figure 4, frequency counts, which must be main- tained anyway, are used to optimize access by keep- ing the array in frequency order-an effective kind of self-organizing linear search [9]. Update-model0 first checks to see if the new model will exceed the cumulative-frequency limit, and if so scales all fre- quencies down by a factor of two (taking care to ensure that no count scales to zero) and recomputes cumulative values (Figure 4, lines 29-37). Then, if necessary, update-model () reorders the symbols to place the current one in its correct rank in the fre- quency ordering, altering the translation tables to reflect the change. Finally, it increments the appro- priate frequency count and adjusts cumulative fre- quencies accordingly. PERFORMANCE Now consider the performance of the algorithm of Figure 3, both in compression efficiency and execu- tion time. Compression Efficiency In principle, when a message is coded using arith- metic: coding, the number of bits in the encoded string is the same as the entropy of that message with respect to the model used for coding. Three factors cause performance to be worse than this in practice: (1) message termination overhead; (2) the use of fixed-length rather than infinite- precision arithmetic; and (3) scaling of counts so that their total is at most Max- frequency. None of these effects is significant, as we now show. In order to isolate the effect of arithmetic coding, the model will be considered to be exact (as defined above). Arithmetic coding must send extra bits at the end of each message, causing a message termina- tion overhead. Two bits are needed, sent by done-encoding() (Figure 3, lines 119-123), in order to disambiguate the final symbol. In cases where a bit stream must be blocked into 8-bit characters before encoding, it will be necessary to round out to the end of a block. Combining these, an extra 9 bits may be required. The overhead of using fixed-length arithmetic oc- curs because remainders are truncated on division. It can be assessed by comparing the algorithm’s per- formance with the figure obtained from a theoretical entropy calculation that derives its frequencies from counts scaled exactly as for coding. It is completely negligible-on the order of lop4 bits/symbol. 536 Communications of the ACM The penalty paid by scaling counts is somewhat larger, but still very small. For short messages (less than 214 bytes), no scaling need be done. E:ven with messages of 105-lo6 bytes, the overhead was found experimentally to be less than 0.25 percent of the encoded string. The adaptive model in Figure 4 scales down all counts whenever the total threatens to exceed Max- frequency. This has the effect of weighting re- cent events more heavily than events from earlier in the message. The statistics thus tend to track changes in the input sequence, which can be very beneficial. (We have encountered cases where limit- ing counts to 6 or 7 bits gives better results than working to higher precision.) Of course, this depends on the source being modeled. Bentley et al. [2] con- sider other, more explicit, ways of incorporating a recency effect. Execution Time The program in Figure 3 has been written for clarity rather than for execution speed. In fact, with the adaptive model in Figure 4, it takes about 420 PS per input byte on a VAX-11/780 to encode a text file, and about the same for decoding. However, easily avoidable overheads such as procedure calls account for much of this, and some simple optimizations in- crease speed by a factor of two. The following altera- tions were made to the C version shown: (1) The procedures input-bif(), output-bit(), and bit-plus- follow() were converted to macros to eliminate procedure-call overhead. (2) Frequently used quantities were put in register variables. (3) Multiplies by two were replaced by additions (C ‘I,=“). (4) Array indexing was replaced by pointer manip- ulation in the loops at line 189 in Figure 3 and lines 49-52 of the adaptive model in Figure 4. This mildly optimized C implementation has an execution time of 214 ps/252 ps per input byte, for encoding/decoding 100,000 bytes of English text on a VAX-11/780, as shown in Table II. Also given are corresponding figures for the same program on an Apple Macintosh and a SUN-3/75. As can be seen, coding a C source program of the same length took slightly longer in all cases, and a binary object pro- gram longer still. The reason for this will be dis- cussed shortly. Two artificial test files were included to allow readers to replicate the results. “Alphabet” consists of enough copies of the 26-letter alphabet to fill out 100,000 characters (ending with a partially completed alphabet). “Skew statistics” contains Ime 1987 Volume 30 Number 6 Computing Practices TABLE II. Results for Encoding and Decoding 100,000-Byte Files VAX-H/788 Macintosh 512 K SUN-3175 Encode time Decodetlme Encode time oeclxletlme Encode time oecodetime w (I4 (ccs) (as) (PS) b4 Mildly optimized C implementation Text file 57,718 214 262 687 98 121 C program 82,991 230 288 729 105 131 VAX object program 73,501 313 406 950 145 190 Alphabet 59,292 223 277 719 105 130 Skew statistics 12,092 143 170 507 70 85 Carefully optimized assembly-language implementation Text file 57,718 104 135 194 243 46 58 C program 62,991 109 151 208 266 51 65 VAX object program 73,501 158 241 280 402 75 107 Alphabet 59,292 105 145 204 264 51 65 Skew statistics 12,092 63 81 126 160 28 36 Notes: Times are measured in microseconds per byte of uncompressed data. The VAX-l l/760 had a floating-point accelerator, which reduces integer multiply and divide times. The Macintosh uses an E-MHz MC66000 with some memory wait states. The SUN-3175 uses a 16.67-MHz MC66020. All times exclude I/O and operating-system overhead in support of l/O. VAX and SUN figures give user time from the UNIX* time command; on the Macintosh, l/O was explicitly directed to an array. The 4.2BSD C compiler was used for VAX and SUN; Aztec C 1.06g for Macintosh. 10,000 copies of the string aaaabaaaac; it demon- TABLE III. Breakdown of Timings for the VAX-U/780 strates that files may be encoded into less than one Assembly-Language Version bit per character (output size of 12,092 bytes = 96,736 bits). All results quoted used the adaptive Encods time Decode the (PSI (4 model of Figure 4. A further factor of two can be gained by repro- Text file Bounds calculation 32 31 gramming in assembly language. A carefully opti- Bit shifting 39 30 mized version of Figures 3 and 4 (adaptive model) Model update 29 29 was written in both VAX and M68000 assembly lan- Symbol decode - 45 guages. Full use was made of registers, and advan- Other 4 0 tage taken of the Is-bit code-value to expedite some 104 135 crucial comparisons and make subtractions of Half C program trivial. The performance of these implementations Bounds calculation 30 28 on the test files is also shown in Table II in order Bit shifting 42 35 Model update to give the reader some idea of typical execution - 33 36 Symbol decode 51 speeds. Other 4 1 The VAX-11/780 assembly-language timings are 109 151 broken down in Table III. These figures were ob- VAX object program tained with the UNIX profile facility and are accu- Bounds calculation 34 31 rate only to within perhaps 10 percent. (This mech- Bit shifting 46 40 anism constructs a histogram of program counter Model update - 75 75 Symbol decode 94 values at real-time clock interrupts and suffers from Other 3 1 statistical variation as well as some systematic er- 158 241 rors.) “Bounds calculation” refers to the initial parts of encode-symbol () and decode-symbol () (Figure 3, lines 90-94 and 190-l%), which contain multiply update” refers to the adaptive update-model () proce- and divide operations. “Bit shifting” is the major dure of Figure 4 (lines 26-53). loop in both the encode and decode routines As expected, the bounds calculation and model (lines 96-113 and 194-213). The cum calculation in update take the same time for both encoding and decode-symbol (1, which requires a multiply/divide, decoding, within experimental error. Bit shifting was and the following loop to identify the next symbol quicker for the text file than for the C program and (lines 187-189), is “Symbol decode.” Finally, “Model object file because compression performance was ]une 1987 Volume 30 Number 6 Communications of the ACM - 537 Computing Practices better. The extra time for decoding over encoding is due entirely to the symbol decode step. This takes longer in the C program and object file tests because the loop of line 189 was executed more often (on average 9 times, 13 times, and 35 times, respec- tively). This also affects the model update time be- cause it is the number of cumulative counts that must be incremented in Figure 4, lines 49-52. In the worst case, when the symbol frequencies are uni- formly distributed, these loops are executed an aver- age of 128 times. Worst-case performance would be improved by using a more complex tree representa- tion for frequencies, but this would likely be slower for text files. SOME APPLICATIONS Applications of arithmetic coding are legion. By lib- erating coding with respect to a model from the mod- eling required for prediction, it encourages a whole new view of data compression [19]. This separation of function costs nothing in compression perfor- mance, since arithmetic coding is (practically) opti- mal with respect to the entropy of the model. Here we intend to do no more than suggest the scope of this view by briefly considering (1) (2) (3) (4) adaptive text compression, nonadaptive coding, compressing black/white images, and coding arbitrarily distributed integers. Of course, as noted earlier, greater coding efficien- cies could easily be achieved with more sophisti- cated models. Modeling, however, is an extensive topic in its own right and is beyond the scope of this article. Adaptive text compression using single-character adaptive frequencies shows off arithmetic coding to good effect. The results obtained using the pro- gram in Figures 3 and 4 vary from 4.8-5.3 bits/char- acter for short English text files (103-lo4 bytes) to 4.5-4.7 bits/character for long ones (105-lo6 bytes). Although adaptive Huffman techniques do exist (e.g., [5, i’]), they lack the conceptual simplicity of arithmetic coding. Although competitive :in compres- sion efficiency for many files, they are slower. For example, Table IV compares the performance of the mildly optimized C implementation of arithmetic coding with that of the UNIX compact program that implements adaptive Huffman coding using a similar model. (Compact’s model is essentially the same for long files, like those of Table IV, but is better for short files than the model used as an example in this article.) Casual examination of compact in’dicates that the care taken in optimization is roughly lcomparable for both systems, yet arithmetic coding hallves exe- cution time. Compression performance is somewhat better with arithmetic coding on all the example files. The difference would be accentuated with more sophisticated models that predict symbols with probabilities approaching one under certain circum- stances (e.g., the letter u following 9). Nonadaptive coding can be performed adthmeti- tally using fixed, prespecified models like that in the first part of Figure 4. Compression performance will be better than Huffman coding. In order to mini- mize execution time, the total frequency count, cum- freq[O], should be chosen as a power of two so the divisions in the bounds calculations (Figure 3, lines 91-94 and 190-193) can be done as shifts. En- code/decode times of around 60 ps/90 PS should then be possible for an assembly-language imple- mentation on a VAX-11/780. A carefully written im- plementation of Huffman coding, using table lookup for encoding and decoding, would be a bit faster in this application. Compressing black/white images using arithmetic coding has been investigated by Langdon and Rissanen [14], who achieved excellent results using TABLE IV. Comparison of Arithmetic and Adaptive Huffman Coding Arithmeticcadlng Encodetime Deco&time Adaptive Huffman coding 22 Text file C program VAX object program Alphabet Skew statistics 57,718 62,991 73,546 59,292 12,092 (4 214 230 313 223 143 (f4 262 288 406 277 170 St 57,781 63,731 76,950 60,127 16,257 Encode time b4 550 596 822 598 215 oscodetime CS) 414 441 606 411 132 Notes: The mildly optimized C implementation was used for arithmetic coding. UNIX compact was used for adaptive Huffman coding. Times are for a VAX-l l/780 and exclude I/O and operating-system overhead in support of I/O. 538 Commlkcations of the ACM June 1987 Volume 30 Number 6 Computing Practices a model that conditioned the probability of a pixel’s being black on a template of pixels surrounding it. The template contained a total of 10 pixels, selected from those above and to the left of the current one so that they precede it in the raster scan. This cre- ates 1024 different possible contexts, and for each the probability of the pixel being black was esti- mated adaptively as the picture was transmitted. Each pixel’s polarity was then coded arithmetically according to this probability. A 20-30 percent im- provement in compression was attained over earlier methods. To increase coding speed, Langdon and Rissanen used an approximate method of arithmetic coding that avoided multiplication by representing probabilities as integer powers of %. Huffman cod- ing cannot be directly used in this application, as it never compresses with a two-symbol alphabet. Run- length coding, a popular method for use with two- valued alphabets, provides another opportunity for arithmetic coding. The model reduces the data to a sequence of lengths of runs of the same symbol (e.g., for picture coding, run-lengths of black followed by white followed by black followed by white . . .). The sequence of lengths must be transmitted. The CCITT facsimile coding standard [ll] bases a Huffman code on the frequencies with which black and white runs of different lengths occur in sample documents. A fixed arithmetic code using these same frequencies would give better performance: adapting the fre- quencies to each particular document would be better still. Coding arbitrarily distributed integers is often called for in use with more sophisticated models of text, image, or other data. Consider, for instance, the lo- cally adaptive data compression scheme of Bentley et al. [2], in which the encoder and decoder cache the last N different words seen. A word present in the cache is transmitted by sending the integer cache index. Words not in the cache are transmitted by sending a new-word marker followed by the characters of the word. This is an excellent model for text in which words are used frequently over short intervals and then fall into long periods of dis- use. Their paper discusses several variable-length codings for the integers used as cache indexes. Arithmetic coding allows any probability distribution to be used as the basis for a variable-length encod- ing, including-among countless others-the ones implied by the particular codes discussed there. It also permits use of an adaptive model for cache in- dexes, which is desirable if the distribution of cache hits is difficult to predict in advance. Furthermore, with arithmetic coding, the code spaced allotted to the cache indexes can be scaled down to accommo- ]une 1987 Volume 30 Number 6 date any desired probability for the new-word marker. APPENDIX. Proof of Decoding Inequality Using one-letter abbreviations for cum- freq, symbol, low, high, and value, suppose c[s] 5 (V-Z + 1) x c[O] - 1 L h-I+1 1 < c[s - 11; in other words, c[s] 5 (v - I + 1) x c[O] - 1 r -& 5 c[s - l] - 1, (1) where r=h-l+l, OSEIT-. 1 r (The last inequality of Eq. (1) derives from the fact that c[s - l] must be an integer.) Then we wish to show that I ’ I v I h’, where 1’ and h’ are the updated values for low and high as defined below. ( I : - r (v - I + 1) X c[O] - 1 CPI [ r -e 1 from Eq. (l), 5v+l-- w ’ so 1’ 5 v since both v and 1’ are integers and c[O] > 0. (b) h’-‘+L’X:;“o,“J-I z-z+T (v - I + 1) x c[O] - 1 + 1 _ c _ 1 CPI [ r 1 from Eq. (1) REFERENCES 1. Abramson, N. Information Theory and Coding. McGraw-Hill, New York, 1963. This textbook contains the first reference to what was to become the method of arithmetic coding (pp. 61-62). 2. Bentley. J.L., Sleator, D.D., Tarjan, R.E., and Wei, V.K. A locally adaptive data compression scheme. Commun. ACM 29, 4 (Apr. 1986), 320-330. Shows how recency effects can be incorporated explicitly into a text compression system. Communications of the ACM 939 Computing Practices 16. Rissanen, J.J. Generalized Kraft inequality and arithmmatic coding. IBM 1. Res. Dev. 20 (May 1976), 198-203. Another early exposition of the idea of arithmetic coding. 17. Rissanen. J.J. Arithmetic codings as number representations. Acta Polytech. Stand. Math. 31 (Dec. 1979), 44-51. Further develops arith- metic coding as a practical technique for data representation. 18. Rissanen, J., and Langdon, G.G. Arithmetic coding. IBM J. Res. Dev. 23, 2 (Mar. 1979). 149-162. Describes a broad class of arithmetic codes. 3. Cleary, J.C., and Witten, I.H. A comparison of enumerative and adaptive codes. IEEE Trans. Inf. Theory IT-30,2 (Mar. 1984). 306-315. Demonstrates under quite general conditions that adaptive coding outperforms the method of calculating and transmitting an exact model of the message first. 4. Clearv. I.G.. and Witten. I.H. Data comoression using adaptive cod- ing and’partial string matching. IEEE T>ans. Commu~ C&-32,4 (Apr. 1984). 396-402. Presents an adaptive modeling method that reduces a large sample of mixed-case English text to around 2.2 bits/character when arithmetically coded. 5 Cormack, G.V., and Horspool, R.N. Algorithms for adaptive Huffman Lett. 18, 3 (Mar. 19841, 159-166. Describes how codes. Ir$ Process. adaptive Huffman coding can be implemented efficiently. 6. Cormack, G.V., and Horspool, R.N. Data compression using dynamic Markov modeling. Res. Rep., Computer Science Dept., Univ. of Walerloo, Ontario, Apr. 1985. Also to be published in Cornput. 1. Presents an adaptive state-modeling technique that, in conjunction with arithmetic coding, produces results competitive with those of [4]. 7. Gallager, R.G. Variations on a theme by Huffman. IEEE Trans. tnf Theory IT-24, 6 (Nov. 19781, 668-674. Presents an adaptive Huffman coding algorithm, and derives new bounds on the redundancy of Huffman codes. 8. Held, G. Data Compression: Techniques and Applications. Wiley, New York, 1984. Explains a number of ad hoc techniques for compressing text. 9. Hester, J.H., and Hirschberg, D.S. Self-organizing linear search. ACM Comput. SKI-V. 17, 3 (Sept. 1985), 295-311. A general analysis of the technique used in the present article to expedite access to an array of dynamically changing frequency counts. 10. Huffman, D.A. A method for the construction of minimum- redundancy codes. Proc. Inst. Electr. Radio Eng. 40, 9 (Sept. 1952), 1098-1101. The classic paper in which Huffman introduced his famous coding method. 11. Hunter, R., and Robinson, A.H. International digital facsimile coding standards. Proc. Inst. Electr. Electron. Eng. 68, 7 (July 1980), 854-867. Describes the use of Huffman coding to compress run lengths in black/white images. 12. Kucera, H., and Francis, W.N. Compufational Analysis of Present-Day American English. Brown University Press, Providence, R.I.. 1967. This large corpus of English is often used for computer experiments with text, including some of the evaluation reported in the present article. 13. Langdon, G.G. An introduction to arithmetic coding. IBMI. Res. Dev. 28, 2 (Mar. 1984), 135-149. Introduction to arithmetic coding from the point of view of hardware implementation. 14. Langdon, G.G., and Rissanen, J. Compression of black-white images with arithmetic coding. 1EEE Trans. Commun. COM 29,6 (June 1981), 858.-867. Uses a modeling method specially tailored to black/white pictures, in conjunction with arithmetic coding, to achieve excel- lent compression results. 15. Pasco, R. Source coding algorithms for fast data compression. Ph.D. thesis, Dept. of Electrical Engineering, Stanford Univ., Stanford, Calif., 1976. An early exposition of the idea of arithmetic coding, but lacking the idea of incremental operation. 19. Rissanen, J.. and Langdon, G.G. Universal modeling and coding. IEEE Trans. Znf Theory IT-27, 1 (Jan. 1981), 12-23. Shows how data corn- pression can be separated into modeling for prediction and coding with respect to a model. 20. Rubin, F. Arithmetic stream coding using fixed precision registers. IEEE Trans. If. Theory IT-25, 6 (Nov. 1979), 672-675. One of the first papers to present all the essential elements of practical arithmetic coding, including fixed-point computation and incremental opera- tion. 21. Shannon, C.E.. and Weaver, W. The Mathematical Theory of Commu- nication. University of Illinois Press, Urbana, Ill., 1949. A classic book that develops communication theory from the ground up. 22. Welch, T.A. A technique for high-performance data compression. Computer 17, 6 (June 1984), 8-19. A very fast coding technique based on the method of [23], but whose compression performance is poor by the standards of [4] and [6]. An improved implementation of this method is widely used in UNIX systems under the name compress. 23. Ziv, J., and Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. InJ Theory IT-24, 5 (Sept. 1978), 530-536. Describes a method of text compression that works by replacing a substring with a pointer to an earlier occurrence of the same substring. Although it performs quite well, it does not provide a clear separation between modeling and coding. CR Categories and Subject Descriptors: E.4 [Data]: Coding and Infor- mation Theory-d&a compaction and compression; H.l.l [Models and Principles]: Systems and Information Theory-information theory General Terms: Algorithms, Performance Additional Key Words and Phrases: Adaptive modeling. arithmetic coding, Huffman coding Received 8/86; accepted Z2/86 Authors’ Present Address: Ian H. Witten, Radford M. Neal, and John G. Cleary, Dept. of Computer Science, The University of Calgary, 2500 University Drive NW, Calgary, Canada T2N lN4. Permission to copy without fee all or part of this material IS granted provided that the copies are not made or distributed for direct commer- cial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. In response to membership requests . . . CURRlCUl.A RECOMMENDATIONS FOR COMPUTING; Volume I: Curricula Recommendations for Computer Science Volume II: Curricula Recommendations for Information Systems Volume III: Curricula Recommendations for kelated Computer Science Programs in Vocational- Technical Schools, Community and Junior Colleges and Health Computing (212) 869-7440 ext. 309 Information available from Deborah Cotton-Single Copy Sales 540 Communications of the ACM ]une 1987 Volume 30 Number 6
本文发布于:2024-09-21 13:50:25,感谢您对本站的认可!
本文链接:https://www.17tex.com/fanyi/27728.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
留言与评论(共有 0 条评论) |