首页 > 外语翻译

Arithmetic Coding For Data Compression

2023年12月23日发(作者：聒噪的拼音)

COMPUTING PRACTICES

Edgar H. Sibley

Panel Editor

The state

the art in data compression is arithmetic coding, not the better-

known

Huffman

method. Arithmetic coding gives greater compression, is

faster for adaptive models, and clearly separates the model from the channel

encoding.

ARITHMETIC CODING FOR

DATA COIUPRESSION

IAN H. WIllEN, RADFORD M. NEAL, and JOHN G. CLEARY

Arithmetic coding is superior in most respects to the

better-known Huffman [lo] method. It represents in-

formation at least as compactly-sometimes consid-

erably more so. Its performance is optimal without

the need for blocking of input data. It encourages a

clear separation between the model for representing

data and the encoding of information with respect to

that model. It accommodates adaptive models easily

and is computationally efficient. Yet many authors

and practitioners seem unaware of the technique.

Indeed there is a widespread belief that Huffman

coding cannot be improved upon.

We aim to rectify this situation by presenting an

accessible implementation of arithmetic coding and

by detailing its performance characteristics. We start

by briefly reviewing basic concepts of data compres-

sion and introducing the model-based approach that

underlies most modern techniques. We then outline

the idea of arithmetic coding using a simple exam-

ple, before presenting programs for both encoding

and decoding. In these programs the model occupies

a separate module so that different models can easily

be used. Next we discuss the construction of fixed

and adaptive models and detail the compression

efficiency and execution time of the programs,

including the effect of different arithmetic word

lengths on compression efficiency. Finally, we out-

line a few applications where arithmetic coding is

appropriate.

Financial support for this work has been provided by the Natural Sciences

and E@neering Research Council of Canada.

UNIX is a registered trademark of AT&T Bell Laboratories.

0 1987 ACM OOOl-0782/87/OtiOO-0520 750

DATA COMPRESSION

To many, data compression conjures up an assort-

ment of ad hoc techniques such as conversion of

spaces in text to tabs, creation of special codes for

common words, or run-length coding of picture data

(e.g., see [8]). This contrasts with the more modern

model-based paradigm for coding, where, from an

input string of symbols and a model, an encoded string

is produced that is (usually) a compressed version of

the input. The decoder, which must have access to

the same model, regenerates the exact input string

from the encoded string. Input symbols are drawn

from some well-defined set such as the ASCII or

binary alphabets; the encoded string is a plain se-

quence of bits. The model is a way of calculating, in

any given context, the distribution of probabilities

for the next input symbol. It must be possible for the

decoder to produce exactly the same probability dis-

tribution in the same context. Compression is

achieved by transmitting the more probable symbols

in fewer bits than the less probable ones.

For example, the model may assign a predeter-

mined probability to each symbol in the ASCII

alphabet. No context is involved. These probabilities

can be determined by counting frequencies in repre-

sentative samples of text to be transmitted. Such a

fixed model is communicated in advance to both en-

coder and decoder, after which it is used for many

messages.

Alternatively, the probabilities that an adaptive

model assigns may change as each symbol is trans-

mitted, based on the symbol frequencies seen so far

in the message. There is no need for a representative

520

Communications of the ACM

June 1987 Volume 30 Number 6

Computing Practices

sample of text, because each message is treated as

an independent unit, starting from scratch. The en-

coder’s model changes with each symbol transmit-

ted, and the decoder’s changes with each symbol

received, in sympathy.

More complex models can provide more accurate

probabilistic predictions and hence achieve greater

compression. For example, several characters of pre-

vious context could condition the next-symbol prob-

ability. Such methods have enabled mixed-case Eng-

lish text to be encoded in around 2.2 bits/character

with two quite different kinds of model [4,

61.

Tech-

niques that do not separate modeling from coding

so distinctly, like that of Ziv and Lempel

(231,

not seem to show such great potential for compres-

sion, although they may be appropriate when the

aim is raw speed rather than compression per-

formance

[22].

The effectiveness of any model can be measured

by the entropy of the message with respect to it,

usually expressed in bits/symbol. Shannon’s funda-

mental theorem of coding states that, given messages

randomly generated from a model, it is impossible to

encode them into less bits (on average) than the en-

tropy of that model [21].

A message can be coded with respect to a model

using either Huffman or arithmetic coding. The for-

mer method is frequently advocated as the best pos-

sible technique for reducing the encoded data rate.

It is not. Given that each symbol in the alphabet

must translate into an integral number of bits in the

encoding, Huffman coding indeed achieves “mini-

mum redundancy.” In other words, it performs opti-

mally if all symbol probabilities are integral powers

of %. But this is not normally the case in practice;

indeed, Huffman coding can take up to one extra bit

per symbol. The worst case is realized by a source

in which one symbol has probability approaching

unity. Symbols emanating from such a source con-

vey negligible information on average, but require at

least one bit to transmit [7]. Arithmetic coding dis-

penses with the restriction that each symbol must

translate into an integral number of bits, thereby

coding more efficiently. It actually achieves the the-

oretical entropy bound to compression efficiency for

any source, including the one just mentioned.

In general, sophisticated models expose the defi-

ciencies of Huffman coding more starkly than simple

ones. This is because they more often predict sym-

bols with probabilities close to one, the worst case

for Huffman coding. For example, the techniques

mentioned above that code English text in 2.2 bits/

character both use arithmetic coding as the final

step, and performance would be impacted severely

June 1987 Volume 30 Number 6

if Huffman coding were substituted. Nevertheless,

since our topic is coding and not modeling, the illus-

trations in this article all employ simple models.

Even so, as we shall see, Huffman coding is inferior

to arithmetic coding.

The basic concept of arithmetic coding can be

traced back to Elias in the early 1960s (see [l,

pp.

61-621).

Practical techniques were first intro-

duced by Rissanen

[16]

and Pasco [15], and de-

veloped further by Rissanen [17]. Details of the

implementation presented here have not appeared

in the literature before; Rubin [2O] is closest to our

approach. The reader interested in the broader class

of arithmetic codes is referred to

[18];

a tutorial is

available in [l3]. Despite these publications, the

method is not widely known. A number of recent

books and papers on data compression mention it

only in passing, or not at all.

THE IDEA OF ARITHMETIC CODING

In arithmetic coding, a message is represented by an

interval of real numbers between

and

As the

message becomes longer, the interval needed’to rep-

resent it becomes smaller, and the number of bits

needed to specify that interval grows. Successive

symbols of the message reduce the size of the inter-

val in accordance with the symbol probabilities gen-

erated by the model. The more likely symbols re-

duce the range by less than the unlikely symbols

and hence add fewer bits to the message.

Before anything is transmitted, the range for the

message is the entire interval

[0, l),

denoting the

half-open interval

5 x <

As each symbol is

processed, the range is narrowed to that portion of it

allocated to the symbol. For example, suppose the

alphabet is (a, e, i, O, u, !I, and a fixed model is used

with probabilities shown in Table I. Imagine trans-

TABLE I. Example Fixed Model for Alphabet (a, e, i, o, u, !)

Symbol Probability

Range

LO,

0.2)

.3 [0.2, 0.5)

.l [0.5, 0.6)

.2 [0.6,0.8)

[0.8, 0.9)

.l [0.9, 1.0)

mitting the message eaii!. Initially, both encoder

and decoder know that the range is

[0, 1).

After

seeing the first symbol, e, the encoder narrows it to

[0.2, 04,

the range the model allocates to this sym-

bol. The second symbol, a, will narrow this new

range to the first one-fifth of it, since a has been

Communications of the ACM

521

Computing Practices

allocated [0, 0.2). This produces [O.Z, 0.26), since the

previous range was 0.3 units long and one-fifth of

that is 0.06. The next symbol, i, is allocated [0.5, 0.6),

which when applied to [0.2, 0.26) gives the smaller

range [0.23, 0.236). Proceeding in this way, the en-

coded message builds up as follows:

Initially

After seeing e

;::2, 0.5)

0.26)

p.2,

0.236)

[0.23,

[0.233, 0.2336)

[0.23354, 0.2336)

Figure la. The second symbol scales it again into the

range [0.2, 0.26). But the picture cannot be contin-

ued in this way without a magnifying glass! Conse-

quently, Figure lb shows the ranges expanded to

full height at every stage and marked with a scale

that gives the endpoints as numbers.

Suppose all the decoder knows about the message

is the final range, [0.23354, 0.2336). It can -immedi-

ately deduce that the first character was e! since the

range lies entirely within the space the model of

Table I allocates for e. Now it can simulate the oper-

ation of the encoder:

Initially

After seeing e

P, 1)

[0.2, 0.5)

Figure 1 shows another representation of the en-

coding process. The vertical bars with ticks repre-

sent the symbol probabilities stipulated by the

model. After the first symbol has been processed, the

model is scaled into the range [0.2, 0.5), as shown in

This makes it clear that the second character is a,

since this will produce the range

After seeing a

[0.2, 0.26),

After

seeing

Nothing

’ !

e a

FIGURE la.

Representation of the Arithmetic Coding Process

which entirely encloses the given range [0.23354,

0.2336). Proceeding like this, the decoder can iden-

tify the whole message.

It is not really necessary for the decoder to know

both ends of the range produced by the encoder.

Instead, a single number within the range--for ex-

ample, 0.23355-will suffice. (Other numbers, like

0.23354, 0.23357, or even 0.23354321, would do just

as well.) However, the decoder will face the problem

of detecting the end of the message, to determine

when to stop decoding. After all, the single number

0.0 could represent any of a, aa, aaa, aaaa, . . . . To

resolve the ambiguity, we ensure that each message

ends with a special EOF symbol known to both en-

coder and decoder. For the alphabet of Table I, ! will

be used to terminate messages, and only to termi-

After

seeing

Nothing

i !

0.5

0.26

0.2

FIGURE lb. Representation of the Arithmetic Coding

Process with the interval Scaled Up at Each Stage

522

Communications

the ACM

June 1987 Volume 30 Number 6

Computing Practices

/* ARITHMETIC ENCODING ALGORITHM. */

Call

Ensure

encode-symbol

that a distinguished

repeatedly

"terminator"

for each symbol

symbol

the

encoded

message.

last, then

/* transmit any value in the range [low, high).

encode-symbol(symbo1,

range

cum-freq)

high

= high - low

low

= low

range*cum-freq[symbol-11

range*cum-freq(symbol1

/* ARITHMETIC DECODING ALGORITHM. */

/* Continue

"Value" is

calling

the number that

decode-symbol

has been received.

until the terminator symbol is returned.

decode-symbol(cum-freq)

find symbol such that

cum-freq[symbol] <= (value-low)/(high-low)

/* This ensures that value

< cum-freqrsymbol-11

lies within the new l /

;* (low, high) range that will be calculated by */

/* the following lines of code.

range = high - low

high

1OW

= low

return

= low

t range*cum-freq[symbol-11

symbol

t range*cum-freq[symbol]

FIGURE 2. Pseudocode for the Encoding and Decoding Procedures

nate messages. When the decoder sees this symbol,

A PROGRAM FOR ARITHMETIC CODING

it stops decoding.

Figure 2 shows a pseudocode fragment that summa-

Relative to the fixed model of Table I, the entropy

rizes the encoding and decoding procedures devel-

of the five-symbol message eaii! is

oped in the last section. Symbols are numbered, 1, 2,

-log 0.3 - log 0.2 - log 0.1 - log 0.1 - log 0.1

3 . . . The frequency

range for the ith symbol is

from cum-freq[i] to cum-freq[i - 11. As i decreases,

= -log 0.00006 = 4.22

cum-freq[i] increases, and cum-freq[O] = 1. (The

(using base 10, since the above encoding was per-

reason for this “backwards” convention is that

formed in decimal). This explains why it takes five

cum-freq[O] will later contain a normalizing factor,

decimal digits to encode the message. In fact, the

and it will be convenient to have it begin the array.]

size of the final range is 0.2336 - 0.23354 = 0.00006,

The “current interval” is [Zozu, high), and for both

and the entropy is the negative logarithm of this

encoding and decoding, this should be initialized

figure. Of course, we normally work in binary,

to [O, 1).

transmitting binary digits and measuring entropy

Unfortunately, Figure 2 is overly simplistic. In

in bits.

practice, there are several factors that complicate

Five decimal digits seems a lot to encode a mes-

both encoding and decoding:

sage comprising four vowels! It is perhaps unfortu-

Incremental transmission and reception. The encode

nate that our example ended up by expanding

algorithm as described does not transmit anything

rather than compressing. Needless to say, however,

until the entire message has been encoded; neither

different models will give different entropies. The

does the decode algorithm begin decoding until it

best single-character model of the message eaii! is

has received the complete transmission. In most

the set of symbol frequencies (e(O.2), a(0.2), i(O.4),

applications an incremental mode of operation is

!(0.2)), which gives an entropy of 2.89 decimal digits.

necessary.

Using this model the encoding would be only three The desire to use integer arithmetic. The precision

digits long. Moreover, as noted earlier, more sophis- required to represent the [low, high) interval grows

ticated models give much better performance with the length of the message. Incremental opera-

in general. tion will help overcome this, but the potential for

]une 1987 Volume 30 Number 6

Communications

the ACM

523

Computing Practices

overflow and underflow must still be examined

carefully.

Representing the model so thnt it can be consulted

efficiently. The representation used for the model

should minimize the time required for the decode

algorithm to identify the next symbol. Moreover,

an adaptive model should be organized to minimize

the time-consuming task of maintaining cumulative

frequencies.

Figure 3 shows working code, in C, for arithmetic

encoding and decoding. It is considerably lmore de-

tailed than the bare-bones sketch of Figure Z! Imple-

mentations of two different models are given in

Figure

the Figure 3 code can use either one.

The remainder of this section examines the code

of Figure 3 more closely, and includes a proof that

decoding is still correct in the integer implementa-

tion and a review of constraints on word lengths in

the program.

arithmetic-coding-h

/' DECLARATIONS USED FOR ARITHMETIC ENCODING AND DECODING

/* SIZE OF ARITHMETIC CODE VALUES.

#define

typedef

fdefine

Code-value-bits

long code-value:

Top-value

16 /* Number of bits in a code value

/* Type of an arithmetic code value

/* Largest code value

/ (((long)l<

/' HALF AND QUARTER POINTS IN THE CODE VALUE RANGE.

*define

#define

Idefine

mode1.h

First-qtr

Half

Third-qtr

(Top-value/ltl)

(Z'First-qtr)

(3’Firat-qtr)

/* Point after

first

third

quarter

half

quarter

l /

/' INTERFACE TO THE MODEL. '/

/' THE SET OF SYMBOLS THAT MAY BE ENCODED.

#define

No-of-chars

EOF-symbol

No-of-symbols

256

(No-of-charetl)

(No-of-charstll

/* Number of character symbols

/* Index of EOF symbol

/* Total number of symbols

/' TRANSLATION TABLES BETWEEN CHARACTERS AND SYMBOL INDEXES.

int char-to-index[No-of-chars];

unsigned char index_to_char[No_of_symbols+l]:

/* CUMULATIVE FREQUENCY TABLE. */

Idefine

int

/* To index from character

/* To character from index

Max-frequency 16383

cum_frsq[No_of_symbols+l];

/* Maximum allowed frequency count

2a14 - 1

/* Cumulative symbol frequencies

l /

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding

524

Communicationso~ the ACM

June 1987 Volume 30 Number 6

Computing Practices

encode.c

/' MAIN PROGRAM FOR ENCODING.

#include

"mode1.h"

main0

startmodel~):

Set

other modules.

start-outputlng-bitsO;

start-encodingo;

for

int

(::I (

/' Loop through characters.

ch - getc(stdin);

ch; lnt symbol;

Read the next

If (ch--EOF) break;

on end-of-file.*/

character.

Exit

loop

symbol-

Translate

encode-symbol(symbol,cum-freq);

char-to.-lndex[ch];

Encode that

to an index.

update-model(symbo1);

/* Update the model.

symbol.

l /

encode_symbol(EOF_~symbol,cum_freq);

/* Encode the EOF symbol.

done-encoding();

/* Send the last few blts.

done-outputinebitso;

exit (0) :

arithmetic

encode.c

/* ARITHMETIC ENCODING ALGORITHM. */

finclude "arithmetic-cod1ng.h"

static vold bll-plus-follov(); /* Routine that follows

‘/

/' CURRENT STATE OF THE ENCODING. '/

static

code value

long-bits-to-follow;

low, high;

Ends of the current code region

Number of

the next bit.

opposite bits to output after

/* START ENCODING A STREAM OF SYMBOLS.

start-encoding0

low

hiah

- 0;

/* Full code

range.

bits

- Top-value;

to follow - 0;

/* No bits to follow next.

1 --

/* ENCODE A SYMBOL.

encode-symhoI(symbol,cum-freq)

int symbol:

/* Symbol to encode '/

int cum-freql]:

long range;

/* Cumulative symbol frequencies

/' Size of the current code region

l /

range -

91 high - low

(long)

(high-low)+l;

/* Narrow the code region

low

(ranye'cum-freq[symbol-ll)/cum-freq[O)-1: /*

symbol.

that allotted to this

~ranpe*cum~freq(symbol))/cum~freq(O);

- low +

l /

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued)

]une 1987 Volume 30 Number 6

Communications of the ACM

525

100

101

102

103

104

105

106

107

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

108

for (:;I I

if (high

bit~plus~follow(0);

else if (low>-Half) (

bit~plus~follow(1):

low -- Half:

high -* Half:

else if (low>-Flrst qtr

sc hiQh

bits-to-foliow

/' Loop to ouLput bits.

/* Output 0 if In low half.

/* Output 1 if in hiQh half.*/

/+ Subtract

(

offset to top.

bit

half.

/* Output an opposite

/. later

in middle

/* Subtract

/+ Otherwise

offset

exit

low -- First-qtr;

high -- First-qtr;

t* 1;

to middle'/

loop.

l /

else break:

low * 2'low;

high * 2*hightl;

/* Scale up code range.

/' FINISH ENCODING THE STREAM.

done-encoding0

bits-to-follow +* 1:

if (low

else bit-plus-follow(l):

/* OUTPUT BITS PLUS FOLLOWING OPPOSITE BITS.

static void bitglus-foIlow(blt)

int bit:

output-bitcbit);

while (bits-to-follow>O) i

output-bit

(Ibit);

bits to-follow -* 1;

1 -

decode.c

Output two bits that

Select the quarter that

the current code

range

contains.

Output

the bit.

l /

/* Output bits-to-follow

opposfte

bits. Set

/' bits-to-follow to zero.

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

/* MAIN PROGPAM FOR DECODING. */

#include

(include

"mode1.h"

/* Set up other

/* Loop through

modules.

characters.

l /

main ()

start-model

0 :

start-inpUtinQ-bitso;

start-decodingo:

for (;;) (

lnt ch:

int symboi;

symbol - decode-symbolicurn-freq):

if (symbol**EOF symbol) break;

ch * lndex_to_c?;ar[symbolJ;

putc(ch,stdout):

update-model(symbo1);

exit.

(0) ;

Decode next symbol.

Exit

loop

if EOF symbol.

Translate to a character.*/

Write that character.

Update the model.

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding

(continued)

526

Communications of the

ACM

]une 1987 Volume 30 Number 6

Computing Practices

arithmetic

decode.c

-- --

155

/* ARITHMETIC DECODING ALGORITHM. l /

156

157

#include

“arithmetic-cod1ng.h”

158

159

160

/' CURRENT STATE OF THE DECODING. l /

161

162

static

code

value value;

/* Currently-seen

code value

l /

163 static code:value low,

high; /* Ends of current code

replan

‘/

164

165

166 /* START

DECODING A

STREAM OF SYMBOLS. '/

167

168

Start-deC”di”Q()

169

(

lnt i:

170 value - 0;

/* Input

bits to flll the

l /

171 for (1 - 1; I<-Code-value-bits; it+) (

/* code

value. l /

172 value - 2'value,lnput_bit();

173

174

low " 0:

/* Full

code ranpe.

l /

175

hiQh -

Top-value;

176

177

178

179

/* DECODE THE NEXT SYMBOL. l '/

180

181 lnt

decode-symbol

(cum-freq)

182

int cwn-freq[ I; /* Cumulative symbol frequencies

l /

183

lO”Q ra”Qe;

/* Size of current

code

region

l /

184 int cum; /*

Cumulative

frequency

calculated

‘/

185 int symbol:

/* Symbol decoded

l /

186

ra”Qe -

(lO"Q)(hlQh-1oW)tl;

187

cum -

/* Flnd

cum freq for value. l /

188 (((lonQl(value-lou)tl)*cum~freq[O]-l)/ranQe:

189 for ~eyn!bol - 1; cum-fraq[symbolJ%wm;

symboltt)

; /* Then find symbol. l /

190

hiQh - low t /* Narrow the code reQlon

191 (ranQe*cum-freq[symbol-ll)/cum-freq[O]-1; /* to that allotted to this l /

192

low - low t

/* symbol.

l /

193

194

(ra"Qe'cum~freq[symbo1])/cum~freq[O];

for I::1 (

/* Loop to QOt rld of bits.

195

if (hiQh

196

/’ nothing

/’ Expand

low half. '/

197

198

else If (low>-Half) (

/* Expand hlph

half.

199

value -- Half;

200 low

-- Half:

/* Subtract offset to top. *I

201

hlQh -- Half;

202

203

else if

(low>-First-qtr

/* Expand

middle half.

l /

204

‘L hiQh

205 value -- First-qtr;

206 low --

First-qtr;

/* Subtract offset to mIddlr*/

207

high -- First-qtr;

208

209

else

bxeak:

/* Otherwise exit

loop.

l /

210 low - 2*1ow:

211

high -

2'hiQhtl:

/* Scale up

code range.

l /

212

value - 2*valuetlnput_bit():

/* Move in next

input

blt. l /

213

214

relucn

symbol;

215

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding

(continued)

]une 1987 Volume 30 Number 6

Communications

the ACM

527

bit-input-c

216

217

218

/* BIT INPUT ROUTINES. "/

flnclude

+include

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

"arithmetic-cod1ng.h"

/' THE BIT BUFFER.

‘/

static

int

lnt blts_to-go:

lnt

garbage-bits:

BIT INPUT. */

- 0;

- 0:

buffer:

/* Bits waiting to

/* Number

bits

still

/* Number of bit6 past

input

buffer

end-of-file

l /

‘/

/* INITIALIZE

St6rt_inpUting_blt6()

bits-to-go

garbage-bits

/* Buffer

/* no bits

starts

it.

out with

l /

/* INPUT A BIT.

int input-bit0

lnt t:

(

if (bits-to-go--O) 1

buffer - getc(stdln):

if (buffer--EOF)

garbage-bits

Read

bit6

the next byte

are left

Return

if no '/

in buffer. *'/

bits*/

)

(garbage~,blts>Code~value~blts-2)

fprlntf(stderr,“Bad input fllen’j;

(

exit (-1);

bits-to-go - 8:

/* after eof,

but check

for too

many

such.

arbitrary

t - buffer&l;

buffer a>- 1;

bits-to-go

return

/* Return the next bit

from l

/* the bottom of the

byte.

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued)

528

Communications of the ACM

June 1987 Volume 30 Number 6

bit-output.c

257

/* BIT OUTPUT ROUTINES. '/

258

259

260

#include

261

262

/" THE BIT BUFFER. */

263

264 static

l /

265

static

int

buffer:

bits-to-go;

Bits

Number

buffered

bits

for

free

output

in buffer

l /

266

267

268

/* INITIALIZE FOR BIT OUTPUT.

269

270

start_.outputing-..bits()

271

buffer - 0;

/* Buffer Is empty to start '/

272 bitS_tO-QO- 8;

/' with.

l /

273

274

)

215

276 /* OUTPUT

BIT.

277

270

output-blt(bit)

279

lnt bit:

200

281

(

buffer

/* Dot bit In top of buffer.*/

if (bit)

>>-

buffer

I- 0x00;

282

bits-to-go -- 1;

283

if (bits-to-go--O)

/* Output buffer if It 1s

204

putc(buffer,stdout):

(

/' now full.

l /

285

286

bits-to-go - 8;

287

288

209

290

291

/* FLUSH OUT THE LAST BITS.

292

293

done-outputlng-bits0

putc(buffer>>blts_to-go,stdout):

294

FIGURE 3. C Implementation of Arithmetic Encoding and Decoding (continued)

]une 1987 Volume 30 Number 6

Communications of the ACM 529

Computing Practices

fixed-model. c

/* THE FIXED SDDRCE MODEL

finclude 'mode1.h'

int freq[No-of-symbolstlj - (

1, 124, 1, 1, 1, 1,

1. 1, 1, 1, 1, 1,

1. 1, 1.

1, 1, 1, 1, 1, 1,

1, 1, 1,

1 l +

+ $ t (

I l

2, 2, 1, 7b, 19, 60, 11

1236,

9, 3, 1, 2:, 15, 2,

1, 21,

/'012

9:;C - >

3 4 5 6 7 S

14 6, 3, 2, 1, 1, 1,

5, 4, 7, 5, 4, 4,

15, 15, 8,

/*@ A B C D E F C H I J K L M N

16,

16, 8, 6, 12, 23, 13,

1, 24, 15, 22,

12, 15, 10,

/'P Q R S T U V W X Y Z [ I ..

3, 1, 1, 1, 1, 1,

6, 3, 11, 1,

14, 1, 14, 26,

29,

/* '

i j k

1 m

d f h

6, 39, 250, 139, 42:,

293,

418,

1, 49:. 85, 17;,

232, 74:, 127, ll:,

/*p

( I ) -

f. 2, 1, 2, 3.

111, 5, 388, 375,

531, 159, 5:, 91, 12,

101,

1. 1, 1. 1. 1. 1,

1, 1, 1. 1, 1. 1.

1. 1. 1,

1. 1, 1. 1. 1, 1.

1, 1, 1. 1, 1. 1,

1, 1, 1,

1. 1, 1, 1, 1, 1.

1, 1, 1, 1, 1. 1,

1. 1. 1.

1. 1. 1, 1, 1, 1,

1. 1, 1, 1, 1, 1.

1, 1, 1.

1, 1. 1. 1. 1. 1.

1, 1, 1, 1, 1, 1.

1. 1. 1.

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1,

1, 1, 1, 1, 1, 1.

1, 1, 1,

1, 1, 1, 1,

i, I, I, I,

1, 1, 1, 1,

I, I,

1’ ,

/* INITIALIZE THE MODEL. '/

start-model0

! lnt 1:

/* Set up tables that

for (i - 0; i

/* translate between symbol

char-to-index[i] - itl;

/* indexes and

characters.

index_to-char[i+l) - i;

cum-freq[No-of-symbols] - 0;

/* Set up cumulative 49

for (1 - No-of-symbols;

i>O; i--l (

cum-freq[i-lj - cum-freq[i] t freq(iJ; /* frequency counts.

/* Check counts within limit*/ 52

if (cum-freq[Ol > Max-frequency) abort0;

/* UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. 'I

update-model (symbol)

lnt

symbol;

/* Do nothing.

/ l /

?*/

O*/

11,

446.

l /

0 l /

l /

FIGURE 4. Fixed and Adaptive Models for Use with Figure 3

530

Communications of the ACM

]une 1987 Volume .30 Number 6

Computing Practices

adaptive-mode1.c

/* THE ADAPTIVE SOURCE MODEL

#include "modo1.h"

int freq]No-of-symbolstl]: /+ symbol frequencies

l /

/' INITIALIZE THE MODEL.

10 start-model0

int i:

for

char-to-index[l]

- 0; i

- itl:

it+) ( /* Set

tables that

l /

index-to-char[itl] - i:

/* translate between symbol

/* indexes and characters.

16 for

(i - 0;

i<-No-of-symbols; it+) (

freq[i] - 1;

/* Set up initial frequency

cum-freq[i] - No-of-symbols-i;

counts

symbols.

to be one for all

l /

freq[O]

- 0;

Freq[Ol must

not be the

/* same as freq[l].

l /

/' UPDATE THE MODEL TO ACCOUNT FOR A NEW SYMBOL. '/

26 update~model(symbo1)

int symbol: /* Index

(

lnt 1;

if (cum-freq]O]--Max-frequency)

/* New index

of new symbol

for

symbol

l /

int

l /

cum - 0:

cum;

(

See

are

frequency

thslr maxlawn.

count8

for

freq[i]

(1 - No-of-symbols;

l /

cum-freq[l]

(freq(i]+1]/2:

la-O; i--) ( /* If so, halve all the

- cum;

counts

non-zero).

(keeplng them

l /

cum t- freq[l];

ior

if (i

- symbol;

l--l : /*

Find symbol's new Index.

int ch-i,

ch-symbol

]

freq(i]--freq[i-1):

ch-i

ch-symbol

- index-to-char[i];

/* Update the translation

index-to-char[i]

- index-to-char[aymbol];

- ch-symbol:

tables

moved.

lf the symbol has

l /

index-to-char[symbol]

char-to-index]ch-i]

- ch-1;

char-to-in'

dex[ch-symbol]

- symbol:

- i;

freq[i]

+- 1; /*

while

Increment

count for

the

frequency

symbol and

i --

(i>O)

(

/* update the cumulative

cum-freq]!

1 t- 1; /* frequencies.

FIGURE 4. Fixed and Adaptive Models for Use with Figure 3 (continued)

/me 1987 Volume 30 Number 6

Communications of the ACM

531

Computing Practices

Representing the Model

Implementations of models are discussed in the next

section; here we are concerned only with the inter-

face to the model (lines 20-38). In C, a byte is repre-

sented as an integer between 0 and 255 (a

char).

Internally, we represent a byte as an integer be-

tween 1 and 257 inclusive (an

index),

EOF being

treated as a 257th symbol. It is advantageous to sort

the model into frequency order, so as to minimize

the number of executions of the decoding loop

(line 189). To permit such reordering, the

char/index

translation is implemented as a pair of tables,

index-to-char[]

and

char-to-index[].

In one of our

models, these tables simply form the

index

by adding

1 to the

char,

but another implements a more com-

plex translation that assigns small indexes to fre-

quently used symbols.

The probabilities in the model are represented as

integer frequency counts, and cumulative counts are

stored in the array cum- freq[]. As previously, this

array is “backwards,” and the total frequency count,

which is used to normalize all frequencies, appears

in cum - fre9 [O]. Cumulative counts must not exceed

a predetermined maximum,

Max- frequency,

and the

model implementation must prevent overflow by

scaling appropriately. It must also ensure that neigh-

boring values in the

cum-

freq

[ ] array differ by at

least 1; otherwise the affected symbol cannot be

transmitted.

Incremental Transmission and Reception

Unlike Figure 2 the program in Figure 3 repre-

sents

low

and

high

as integers. A special data type,

code-value,

is defined for these quantities, together

with some useful constants:

Top-value,

representing

the largest possible

code-value,

and

First-qtr,

Half,

and

Third-@,

representing parts of the range

(lines 6-16). Whereas in Figure 2 the current inter-

val is represented by

[low, high),

in Figure 3 it is

[low, high];

that is, the range now includes the value

high.

Actually, it is more accurate (though more

confusing) to say that, in the program in Figure 3,

the interval represented is

[low, high +

0.11111

. .o).

This is because when the bounds are scaled up to

increase the precision, zeros are shifted into the low-

order bits of low, but ones are shifted into

high.

Al-

though it is possible to write the program to use a

different convention, this one has some advantages

in simplifying the code.

As the code range narrows, the top bits of low and

high

become the same. Any bits that are the same

can be transmitted immediately, since they cannot

be affected by future narrowing. For encoding, since

we know that

low 5 high,

this requires code like

532

Communications of

the ACM

for (;;I {

if (high < Half) [

output-bit(O);

low = 2*1ow;

high = 2*high+l;

else

if (low I Half) (

output-bit(l);

low = 2*(low-Half);

high = 2*(high-Half)+l;

else

break;

which ensures that, upon completion,

low <

Half

I high.

This can be found in lines 95-113 of

encode-symbol

(1, although there are some extra com-

plications caused by underflow possibilities (see the

next subsection). Care is taken to shift ones in at the

bottom when

high

is scaled, as noted above.

Incremental reception is done using a number

called value as in Figure 2, in which processed

bits flow out the top (high-significance) end and

newly received ones flow in the bottom. Initially,

start-decoding0

(lines 168-176) fills value with re-

ceived bits. Once

decode-symbol

() has identified the

next input symbol, it shifts out now-useless high-

order bits that are the same in low and

high,

shifting

value

by the same amount (and replacing lost bits by

fresh input bits at the bottom end):

for (;;) 1

if (high < Half) (

value = 2*value+input_bit( );

low

= 2*1ow;

high

= 2*high+l;

else

if (low > Half) (

value = 2*(value-Half)+input-bit();

low

= 2*(low-Half);

high

= 2*(high-Half)+l;

else

break;

(see lines 194-213, again complicated by precautions

against underflow, as discussed below).

Proof of Decoding Correctness

At this point it is worth checking that identification

of the next symbol by

decode-symbol ()

works

properly. Recall from Figure 2 that

decode-symbol ()

must use value to find the symbol that, when en-

coded, reduces the range to one that still includes

value. Lines

186-188

decode-symbol0

identify the

symbol for which

]une 1987 Volume 30 Number 6

Computing Practices

cum-freq[symbol]

~ (value - low + 1) * cum-freq[O] - 1

high - low + 1

< cum-freq[symbol- 11,

where L J denotes the “integer part of” function that

comes from integer division with truncation. It is

shown in the Appendix that this implies

low + (high - low + 1) * cum-freq[symboZ]

cum-freq[O]

lV5lOW

+ (high - low +

* cum-freq[symbol-

_ 1

cum-freq[O]

J ’

so that value lies within the new interval that

decode-symbol () calculates in lines 190-193. This is

sufficient to guarantee that the decoding operation

identifies each symbol correctly.

Underflow

As Figure

shows, arithmetic coding works by scal-

ing the cumulative probabilities given by the model

into the interval [low, high] for each character trans-

mitted. Suppose low and high are very close to-

gether-so close that this scaling operation maps

some different symbols of the model onto the same

integer in the [low, high] interval. This would be

disastrous, because if such a symbol actually oc-

curred it would not be possible to continue encod-

ing. Consequently, the encoder must guarantee

that the interval [low, high] is always large enough

to prevent this. The simplest way to do this is

to ensure that this interval is at least as large as

Max- frequency, the maximum allowed cumulative

frequency count (line

36).

How could this condition be violated? The bit-

shifting operation explained above ensures that low

and high can only become close together when they

straddle Half. Suppose in fact they become as close

First-qtr I low < Half I high

Third-qtr.

Then the next two bits sent will have opposite polar-

ity, either 01 or

10.

For example, if the next bit turns

out to be zero (i.e., high descends below Half and

[0, Half] is expanded to the full interval), the bit

after that will be one, since the range has to be

above the midpoint of the expanded interval. Con-

versely, if the next bit happens to be one, the one

after that will be zero. Therefore the interval can

safely be expanded right now, if only we remember

June 1987 Volume 30 Number 6

that, whatever bit actually comes next, its opposite

must be transmitted afterwards as well. Thus lines

104-109 expand [First-qtr, Third-qtr] into the whole

interval, remembering in bits-to- follow that the

bit that is output next must be followed by an oppo-

site bit. This explains why all output is done via

bit-plus- foZZow() (lines 128-135), instead of directly

with output-bit().

But what if, after this operation, it is still true that

First-qtr I low < Half I high < Third-qtr?

Figure 5 illustrates this situation, where the current

[low, high] range (shown as a thick line) has been

expanded a total of three times. Suppose the next bit

turns out to be zero, as indicated by the arrow in

Figure 5a being below the halfway point. Then the

next three bits will be ones, since the arrow is not

only in the top half of the bottom half of the original

range, but in the top quarter, and moreover the top

eighth, of that half-this is why the expansion can

occur three times. Similarly, as Figure 6b shows, if

the next bit turns out to be a one, it will be followed

by three zeros. Consequently, we need only count

the number of expansions and follow the next bit by

that number of opposites (lines

106

and 131-134).

Using this technique the encoder can guarantee

that, after the shifting operations, either

low < First-qtr < Half I high

(14

low < Half < Third-qtr I high.

(lb)

Therefore, as long as the integer range spanned by

the cumulative frequencies fits into a quarter of that

provided by code-values, the underflow problem

cannot occur. This corresponds to the condition

Max- frequency 5

Top-value +

+ 1

which is satisfied by Figure 3, since Max- frequency

214

and Top-value =

216

(lines 36, 9). More

than 14 bits cannot be used to represent cumulative

frequency counts without increasing the number of

bits allocated to code-values.

We have discussed underflow in the encoder only.

Since the decoder’s job, once each symbol has been

decoded, is to track the operation of the encoder,

underflow will be avoided if it performs the same

expansion operation under the same conditions.

Overflow

Now consider the possibility of overflow in the

integer multiplications corresponding to those of

Figure 2, which occur in lines 91-94 and 190-193

Communications

the ACM 533

Top-value -

Third-qtr

high

Half

First+

FIGURE 5. Scaling the Interval to Prevent Underflow

of Figure 3. Overflow cannot occur provided the

product

fits within the integer word length available,

since cumulative frequencies cannot exceed

Max- frequewy.

Rarjge might be as large as Top-value

+ 1. so the largest possible product in Figure 3 is

2”‘(2’4 -

1). which is less than 2”“. Long declarations

are used for

code-zwlue

(line 7) and

range

(lines 89,

183) to ensure that arithmetic is done to 32-bit preci-

sion.

Constraints on the Implementation

The constraints on word length imposed by under-

flow and overflow can be simplified by assuming

that frequency counts are represented in f bits, and

code-zwlurs

in c bits, The implementation will work

correctly provided

fsc-2

f+CSp,

the precision to which arithmetic is performed.

In most C implementations, p = 31 if long integers

534

Cotnrnu~~icatiom of the ACM

lune 1987 Volume 30 Number 6

Computing Practices

are used, and p = 32 if they are unsigned long. In

Figure 3, f = 14 and c =

16.

With appropriately

modified declarations, unsigned long arithmetic with

f = 15 and c = 17 could be used. In assembly lan-

guage, c =

is a natural choice because it expedites

some comparisons and bit manipulations (e.g., those

of lines 65-113 and 164-213).

If p is restricted to

bits, the best values possible

are c = 8 and f = 7, making it impossible to encode a

full alphabet of 256 symbols, as each symbol must

have a count of at least one. A smaller alphabet (e.g.,

the 26 letters, or 4-bit nibbles) could still be handled.

Termination

To finish the transmission, it is necessary to send a

unique terminating symbol (EOF-symbol, line 56)

and then follow it by enough bits to ensure that the

encoded string falls within the final range. Since

done-encoding0 (lines 116-123) can be sure that low

and high are constrained by either Eq.

(la)

or (lb)

above, it need only transmit 01 in the first case or 10

in the second to remove the remaining ambiguity. It

is convenient to do this using the bit-plus-

foZZow()

procedure discussed earlier. The input- bit () proce-

dure will actually read a few more bits than were

sent by output-bit(), as it needs to keep the low end

of the buffer full. It does not matter what value these

bits have, since EOF is uniquely determined by the

last two bits actually transmitted.

MODELS FOR ARITHMETIC CODING

The program in Figure 3 must be used with a

model that provides a pair of translation tables

index-to-char[] and char-to-index[], and a cumula-

tive frequency array cum - freq [ 1, The requirements

on the latter are that

0 cum-freq[i -

2 cum-freq[i];

an attempt is never made to encode a symbol i for

which cum-freq[i - l] = cum-freq[i]; and

cum- freq [0] 5 Max- frequency.

Provided these conditions are satisfied, the values in

the array need bear no relationship to the actual

cumulative symbol frequencies in messages. Encod-

ing and decoding will still work correctly, although

encodings will occupy less space if the frequencies

are accurate. (Recall our successfully encoding eaii!

according to the model of Table I, which does not

actually reflect the frequencies in the message.)

Fixed Models

The simplest kind of model is one in which symbol

frequencies are fixed. The first model in Figure 4

has symbol frequencies that approximate those of

English (taken from a part of the Brown Corpus [12]).

June 1987 Volume 30 Number 6

However, bytes that did not occur in that sample

have been given frequency counts of one in case

they do occur in messages to be encoded (so this

model will still work for binary files in which

all 256 bytes occur). Frequencies have been normal-

ized to total 8000. The initialization procedure

start-model () simply computes a cumulative version

of these frequencies (lines 48-51), having first initial-

ized the translation tables (lines 44-47). Execution

speed would be improved if these tables were used

to reorder symbols and frequencies so that the most

frequent came first in the cum _ freq [ ] array. Since

the model is fixed, the procedure update-model 0,

which is called from both encode.c and decode.c, is

null.

An exact model is one where the symbol frequen-

cies in the message are exactly as prescribed by the

model. For example, the fixed model of Figure 4 is

close to an exact model for the particular excerpt of

the Brown Corpus from which it was taken. To be

truly exact, however, symbols that did not occur in

the excerpt would be assigned counts of zero, rather

than one (sacrificing the capability of transmitting

messages containing those symbols). Moreover, the

frequency counts would not be scaled to a predeter-

mined cumulative frequency, as they have been in

Figure 4. The exact model can be calculated and

transmitted before the message is sent. It is shown

by Cleary and Witten [3] that, under quite general

conditions, this will not give better overall compres-

sion than adaptive coding (which is described next).

Adaptive Models

An adaptive model represents the changing symbol

frequencies seen so fur in a message. Initially all

counts might be the same (reflecting no initial infor-

mation), but they are updated, as each symbol is

seen, to approximate the observed frequencies. Pro-

vided both encoder and decoder use the same initial

values (e.g., equal counts) and the same updating

algorithm, their models will remain in step. The en-

coder receives the next symbol, encodes it, and up-

dates its model. The decoder identifies it according

to its current model and then updates its model.

The second half of Figure 4 shows such an adap-

tive model. This is the type of model recommended

for use with Figure 3, for in practice it will outper-

form a fixed model in terms of compression effi-

ciency. Initialization is the same as for the fixed

model, except that all frequencies are set to one.

The procedure update-model (symbol) is called by

both encode-symbol () and decode-symbol () (Figure 3,

lines 54 and 151) after each symbol is processed.

Updating the model is quite expensive because of

the need to maintain cumulative totals. In the code

Communications of the

ACM 535

Computing Practices

of Figure 4, frequency counts, which must be main-

tained anyway, are used to optimize access by keep-

ing the array in frequency order-an effective kind

of self-organizing linear search [9]. Update-model0

first checks to see if the new model will exceed the

cumulative-frequency limit, and if so scales all fre-

quencies down by a factor of two (taking care to

ensure that no count scales to zero) and recomputes

cumulative values (Figure 4, lines 29-37). Then, if

necessary, update-model () reorders the symbols to

place the current one in its correct rank in the fre-

quency ordering, altering the translation tables to

reflect the change. Finally, it increments the appro-

priate frequency count and adjusts cumulative fre-

quencies accordingly.

PERFORMANCE

Now consider the performance of the algorithm of

Figure 3, both in compression efficiency and execu-

tion time.

Compression Efficiency

In principle, when a message is coded using arith-

metic: coding, the number of bits in the encoded

string is the same as the entropy of that message

with respect to the model used for coding. Three

factors cause performance to be worse than this in

practice:

(1) message termination overhead;

(2) the use of fixed-length rather than infinite-

precision arithmetic; and

(3) scaling of counts so that their total is at most

Max- frequency.

None of these effects is significant, as we now show.

In order to isolate the effect of arithmetic coding, the

model will be considered to be exact (as defined

above).

Arithmetic coding must send extra bits at the

end of each message, causing a message termina-

tion overhead. Two bits are needed, sent by

done-encoding() (Figure

lines 119-123), in order to

disambiguate the final symbol. In cases where a bit

stream must be blocked into 8-bit characters before

encoding, it will be necessary to round out to the

end of a block. Combining these, an extra 9 bits may

be required.

The overhead of using fixed-length arithmetic oc-

curs because remainders are truncated on division.

It can be assessed by comparing the algorithm’s per-

formance with the figure obtained from a theoretical

entropy calculation that derives its frequencies from

counts scaled exactly as for coding. It is completely

negligible-on the order of lop4 bits/symbol.

536 Communications of the ACM

The penalty paid by scaling counts is somewhat

larger, but still very small. For short messages (less

than 214 bytes), no scaling need be done. E:ven with

messages of 105-lo6 bytes, the overhead was found

experimentally to be less than 0.25 percent of the

encoded string.

The adaptive model in Figure

scales down all

counts whenever the total threatens to exceed

Max- frequency. This has the effect of weighting re-

cent events more heavily than events from earlier in

the message. The statistics thus tend to track

changes in the input sequence, which can be very

beneficial. (We have encountered cases where limit-

ing counts to 6 or 7 bits gives better results than

working to higher precision.) Of course, this depends

on the source being modeled. Bentley et al. [2] con-

sider other, more explicit, ways of incorporating a

recency effect.

Execution Time

The program in Figure 3 has been written for clarity

rather than for execution speed. In fact, with the

adaptive model in Figure

it takes about 420 PS per

input byte on a VAX-11/780 to encode a text file,

and about the same for decoding. However, easily

avoidable overheads such as procedure calls account

for much of this, and some simple optimizations in-

crease speed by a factor of two. The following altera-

tions were made to the C version shown:

(1)

The procedures input-bif(), output-bit(), and

bit-plus- follow() were converted to macros to

eliminate procedure-call overhead.

(2)

Frequently used quantities were put in register

variables.

(3)

Multiplies by two were replaced by additions

(C ‘I,=“).

(4)

Array indexing was replaced by pointer manip-

ulation in the loops at line 189 in Figure 3 and

lines 49-52 of the adaptive model in Figure 4.

This mildly optimized C implementation has an

execution time of 214 ps/252 ps per input byte, for

encoding/decoding 100,000 bytes of English text on

a VAX-11/780, as shown in Table II. Also given are

corresponding figures for the same program on an

Apple Macintosh and a SUN-3/75. As can be seen,

coding a C source program of the same length took

slightly longer in all cases, and a binary object pro-

gram longer still. The reason for this will be dis-

cussed shortly. Two artificial test files were included

to allow readers to replicate the results. “Alphabet”

consists of enough copies of the 26-letter alphabet to

fill out 100,000 characters (ending with a partially

completed alphabet). “Skew statistics” contains

Ime 1987 Volume 30 Number 6

Computing Practices

TABLE II. Results for Encoding and Decoding 100,000-Byte Files

VAX-H/788 Macintosh 512 K

SUN-3175

Encode time Decodetlme

Encode time oeclxletlme Encode time oecodetime

(I4 (ccs)

(as) (PS) b4

Mildly optimized C implementation

Text file 57,718 214 262 687

121

program

82,991 230 288 729

105 131

VAX object program 73,501 313 406 950

145 190

Alphabet 59,292 223 277 719 105

130

Skew statistics 12,092 143 170 507

Carefully optimized assembly-language implementation

Text file 57,718 104 135 194

243 46

program

62,991 109 151 208 266 51 65

VAX object program 73,501 158 241 280 402

75 107

Alphabet 59,292 105 145 204 264

Skew statistics 12,092 63

126

160 28

Notes: Times are measured in microseconds per byte of uncompressed data.

The VAX-l l/760 had a floating-point accelerator, which reduces integer multiply and divide times.

The Macintosh uses an E-MHz MC66000 with some memory wait states.

The SUN-3175 uses a 16.67-MHz MC66020.

All times exclude I/O and operating-system overhead in support of l/O. VAX and SUN figures give user time from the UNIX* time command;

on the Macintosh, l/O was explicitly directed to an array.

The 4.2BSD C compiler was used for VAX and SUN; Aztec C 1.06g for Macintosh.

10,000 copies of the string aaaabaaaac; it demon-

TABLE III. Breakdown of Timings for the VAX-U/780

strates that files may be encoded into less than one

Assembly-Language Version

bit per character (output size of 12,092 bytes =

96,736 bits). All results quoted used the adaptive

Encods time Decode the

(PSI (4

model of Figure 4.

A further factor of two can be gained by repro-

Text file

Bounds calculation 32

gramming in assembly language. A carefully opti-

Bit shifting

39 30

mized version of Figures 3 and 4 (adaptive model)

Model update 29

was written in both VAX and M68000 assembly lan-

Symbol decode

guages. Full use was made of registers, and advan-

Other

tage taken of the Is-bit code-value to expedite some

104 135

crucial comparisons and make subtractions of Half

C program

trivial. The performance of these implementations

Bounds calculation

on the test files is also shown in Table II in order

Bit shifting 42

Model update

to give the reader some idea of typical execution

Symbol decode

speeds.

Other

The VAX-11/780 assembly-language timings are

109 151

broken down in Table III. These figures were ob-

VAX object program

tained with the UNIX profile facility and are accu-

Bounds calculation 34

rate only to within perhaps 10 percent. (This mech-

Bit shifting

46 40

anism constructs a histogram of program counter

Model update

75 75

Symbol decode 94

values at real-time clock interrupts and suffers from

Other 3 1

statistical variation as well as some systematic er-

158

241

rors.) “Bounds calculation” refers to the initial parts

of encode-symbol () and decode-symbol () (Figure 3,

lines 90-94 and 190-l%), which contain multiply update” refers to the adaptive update-model () proce-

and divide operations. “Bit shifting” is the major dure of Figure

(lines

26-53).

loop in both the encode and decode routines As expected, the bounds calculation and model

(lines 96-113 and 194-213). The cum calculation in update take the same time for both encoding and

decode-symbol (1, which requires a multiply/divide, decoding, within experimental error. Bit shifting was

and the following loop to identify the next symbol quicker for the text file than for the C program and

(lines 187-189), is “Symbol decode.” Finally, “Model

object file because compression performance was

]une 1987 Volume 30 Number 6

Communications of the ACM

537

Computing Practices

better. The extra time for decoding over encoding is

due entirely to the symbol decode step. This takes

longer in the C program and object file tests because

the loop of line

189

was executed more often (on

average 9 times, 13 times, and 35 times, respec-

tively). This also affects the model update time be-

cause it is the number of cumulative counts that

must be incremented in Figure 4, lines 49-52. In the

worst case, when the symbol frequencies are uni-

formly distributed, these loops are executed an aver-

age of

128

times. Worst-case performance would be

improved by using a more complex tree representa-

tion for frequencies, but this would likely be slower

for text files.

SOME APPLICATIONS

Applications of arithmetic coding are legion. By lib-

erating coding with respect to a model from the mod-

eling required for prediction, it encourages a whole

new view of data compression [19]. This separation

of function costs nothing in compression perfor-

mance, since arithmetic coding is (practically) opti-

mal with respect to the entropy of the model. Here

we intend to do no more than suggest the scope of

this view by briefly considering

(1)

(2)

(3)

(4)

adaptive text compression,

nonadaptive coding,

compressing black/white images, and

coding arbitrarily distributed integers.

Of course, as noted earlier, greater coding efficien-

cies could easily be achieved with more sophisti-

cated models. Modeling, however, is an extensive

topic in its own right and is beyond the scope of this

article.

Adaptive text compression using single-character

adaptive frequencies shows off arithmetic coding

to good effect. The results obtained using the pro-

gram in Figures

and

vary from

4.8-5.3

bits/char-

acter for short English text files

(103-lo4

bytes) to

4.5-4.7

bits/character for long ones

(105-lo6

bytes).

Although adaptive Huffman techniques do exist

(e.g., [5, i’]), they lack the conceptual simplicity of

arithmetic coding. Although competitive :in compres-

sion efficiency for many files, they are slower. For

example, Table IV compares the performance of the

mildly optimized C implementation of arithmetic

coding with that of the UNIX compact program that

implements adaptive Huffman coding using a similar

model. (Compact’s model is essentially the same for

long files, like those of Table IV, but is better for

short files than the model used as an example in this

article.) Casual examination of compact in’dicates that

the care taken in optimization is roughly lcomparable

for both systems, yet arithmetic coding hallves exe-

cution time. Compression performance is somewhat

better with arithmetic coding on all the example

files. The difference would be accentuated with

more sophisticated models that predict symbols with

probabilities approaching one under certain circum-

stances (e.g., the letter u following 9).

Nonadaptive coding can be performed adthmeti-

tally using fixed, prespecified models like that in

the first part of Figure 4. Compression performance

will be better than Huffman coding. In order to mini-

mize execution time, the total frequency count,

cum- freq[O], should be chosen as a power of two so

the divisions in the bounds calculations (Figure 3,

lines 91-94 and 190-193) can be done as shifts. En-

code/decode times of around

ps/90 PS should

then be possible for an assembly-language imple-

mentation on a

VAX-11/780. A

carefully written im-

plementation of Huffman coding, using table lookup

for encoding and decoding, would be a bit faster in

this application.

Compressing black/white images using arithmetic

coding has been investigated by Langdon and

Rissanen [14], who achieved excellent results using

TABLE IV. Comparison of Arithmetic and Adaptive Huffman Coding

Arithmeticcadlng

Encodetime

Deco&time

Adaptive Huffman coding

Text file

C program

VAX object program

Alphabet

Skew statistics

57,718

62,991

73,546

59,292

12,092

214

230

313

223

143

(f4

262

288

406

277

170

57,781

63,731

76,950

60,127

16,257

Encode time

550

596

822

598

215

oscodetime

CS)

414

441

606

411

132

Notes: The mildly optimized C implementation was used for arithmetic coding.

UNIX

compact

was used for adaptive Huffman coding.

Times are for a VAX-l l/780 and exclude I/O and operating-system overhead in support of I/O.

538

Commlkcations of

the

ACM June 1987 Volume 30 Number 6

Computing Practices

a model that conditioned the probability of a pixel’s

being black on a template of pixels surrounding it.

The template contained a total of 10 pixels, selected

from those above and to the left of the current one

so that they precede it in the raster scan. This cre-

ates 1024 different possible contexts, and for each

the probability of the pixel being black was esti-

mated adaptively as the picture was transmitted.

Each pixel’s polarity was then coded arithmetically

according to this probability. A 20-30 percent im-

provement in compression was attained over earlier

methods. To increase coding speed, Langdon and

Rissanen used an approximate method of arithmetic

coding that avoided multiplication by representing

probabilities as integer powers of %. Huffman cod-

ing cannot be directly used in this application, as it

never compresses with a two-symbol alphabet. Run-

length coding, a popular method for use with two-

valued alphabets, provides another opportunity for

arithmetic coding. The model reduces the data to a

sequence of lengths of runs of the same symbol (e.g.,

for picture coding, run-lengths of black followed by

white followed by black followed by white . . .). The

sequence of lengths must be transmitted. The CCITT

facsimile coding standard [ll] bases a Huffman code

on the frequencies with which black and white runs

of different lengths occur in sample documents. A

fixed arithmetic code using these same frequencies

would give better performance: adapting the fre-

quencies to each particular document would be

better still.

Coding

arbitrarily distributed integers

is often called

for in use with more sophisticated models of text,

image, or other data. Consider, for instance, the lo-

cally adaptive data compression scheme of Bentley

et al. [2], in which the encoder and decoder cache

the last N different words seen. A word present in

the cache is transmitted by sending the integer

cache index. Words not in the cache are transmitted

by sending a new-word marker followed by the

characters of the word. This is an excellent model

for text in which words are used frequently over

short intervals and then fall into long periods of dis-

use. Their paper discusses several variable-length

codings for the integers used as cache indexes.

Arithmetic coding allows any probability distribution

to be used as the basis for a variable-length encod-

ing, including-among countless others-the ones

implied by the particular codes discussed there. It

also permits use of an adaptive model for cache in-

dexes, which is desirable if the distribution of cache

hits is difficult to predict in advance. Furthermore,

with arithmetic coding, the code spaced allotted to

the cache indexes can be scaled down to accommo-

]une 1987 Volume 30 Number 6

date any desired probability for the new-word

marker.

APPENDIX. Proof of Decoding Inequality

Using one-letter abbreviations for cum-

freq,

symbol, low, high,

and

value,

suppose

c[s] 5

(V-Z +

x c[O] -

h-I+1

< c[s - 11;

in other words,

c[s] 5

(v - I + 1) x c[O] - 1

5 c[s - l] - 1,

(1)

where

r=h-l+l, OSEIT-.

(The last inequality of Eq.

(1)

derives from the

fact that c[s -

must be an integer.) Then we

wish to show that I ’ I v I

h’,

where

1’

and

h’

are the updated values for low and

high

as defined

below.

( I :

r (v - I + 1) X c[O] - 1

CPI

[

-e

from Eq.

(l),

5v+l--

w ’

1’

5 v since both v and 1’ are integers and

c[O] > 0.

(b)

h’-‘+L’X:;“o,“J-I

z-z+T

(v - I +

1) x c[O] - 1 + 1 _ c _ 1

CPI

[

from Eq. (1)

REFERENCES

1. Abramson, N. Information Theory

and Coding.

McGraw-Hill, New

York, 1963. This textbook contains the first reference to what was to

become the method of arithmetic coding (pp. 61-62).

2. Bentley. J.L., Sleator, D.D., Tarjan, R.E., and Wei, V.K. A locally

adaptive data compression scheme. Commun.

ACM 29, 4

(Apr. 1986),

320-330.

Shows how recency effects can be incorporated explicitly

into a text compression system.

Communications of the ACM 939

Computing Practices

16.

Rissanen, J.J. Generalized Kraft inequality and arithmmatic coding.

IBM

Res. Dev.

20 (May 1976), 198-203. Another early exposition of

the idea of arithmetic coding.

17.

Rissanen. J.J. Arithmetic codings as number representations.

Acta

Polytech. Stand. Math. 31

(Dec. 1979), 44-51. Further develops arith-

metic coding as a practical technique for data representation.

18. Rissanen, J., and Langdon, G.G. Arithmetic coding.

IBM J. Res. Dev.

23, 2 (Mar. 1979). 149-162. Describes a broad class of arithmetic

codes.

3. Cleary, J.C., and Witten, I.H. A comparison of enumerative and

adaptive codes. IEEE Trans. Inf. Theory IT-30,2 (Mar. 1984). 306-315.

Demonstrates under quite general conditions that adaptive coding

outperforms the method of calculating and transmitting an exact

model of the message first.

4. Clearv. I.G.. and Witten. I.H. Data comoression using adaptive cod-

ing and’partial string matching.

IEEE T>ans. Commu~ C&-32,4

(Apr. 1984). 396-402. Presents an adaptive modeling method that

reduces a large sample of mixed-case English text to around

2.2 bits/character when arithmetically coded.

Cormack, G.V., and Horspool, R.N. Algorithms for adaptive Huffman

Lett. 18, 3 (Mar. 19841, 159-166. Describes how codes. Ir$

Process.

adaptive Huffman coding can be implemented efficiently.

Cormack, G.V., and Horspool, R.N. Data compression using dynamic

Markov modeling. Res. Rep., Computer Science Dept., Univ. of

Walerloo, Ontario, Apr. 1985. Also to be published in Cornput. 1.

Presents an adaptive state-modeling technique that, in conjunction

with arithmetic coding, produces results competitive with those

of [4].

7. Gallager, R.G. Variations on a theme by Huffman.

IEEE Trans. tnf

Theory

IT-24,

6 (Nov. 19781, 668-674. Presents an adaptive Huffman

coding algorithm, and derives new bounds on the redundancy of

Huffman codes.

8. Held,

G. Data Compression: Techniques and Applications.

Wiley, New

York, 1984. Explains a number of ad hoc techniques for compressing

text.

9. Hester, J.H., and Hirschberg, D.S. Self-organizing linear search.

ACM

Comput.

SKI-V. 17, 3 (Sept. 1985), 295-311. A general analysis of the

technique used in the present article to expedite access to an array

of dynamically changing frequency counts.

10. Huffman, D.A. A method for the construction of minimum-

redundancy codes.

Proc. Inst. Electr. Radio Eng. 40, 9

(Sept. 1952),

1098-1101. The classic paper in which Huffman introduced his

famous coding method.

11. Hunter, R., and Robinson, A.H. International digital facsimile coding

standards.

Proc. Inst. Electr. Electron. Eng. 68, 7

(July 1980), 854-867.

Describes the use of Huffman coding to compress run lengths in

black/white images.

12. Kucera, H., and Francis, W.N.

Compufational Analysis

of Present-Day

American English.

Brown University Press, Providence, R.I.. 1967.

This large corpus of English is often used for computer experiments

with text, including some of the evaluation reported in the present

article.

13. Langdon, G.G. An introduction to arithmetic coding. IBMI.

Res. Dev.

28, 2 (Mar. 1984), 135-149. Introduction to arithmetic coding from

the point of view of hardware implementation.

14.

Langdon, G.G., and Rissanen, J. Compression of black-white images

with arithmetic coding.

1EEE

Trans. Commun. COM 29,6 (June 1981),

858.-867. Uses a modeling method specially tailored to black/white

pictures, in conjunction with arithmetic coding, to achieve excel-

lent compression results.

15. Pasco, R. Source coding algorithms for fast data compression. Ph.D.

thesis, Dept. of Electrical Engineering, Stanford Univ., Stanford,

Calif., 1976. An early exposition of the idea of arithmetic coding, but

lacking the idea of incremental operation.

19.

Rissanen, J.. and Langdon, G.G. Universal modeling and coding.

IEEE

Trans.

Znf

Theory IT-27,

1 (Jan. 1981), 12-23. Shows how data corn-

pression can be separated into

modeling

for prediction and

coding

with respect to a model.

20.

Rubin, F. Arithmetic stream coding using fixed precision registers.

IEEE Trans. If. Theory IT-25, 6 (Nov.

1979), 672-675. One of the first

papers to present all the essential elements of practical arithmetic

coding, including fixed-point computation and incremental opera-

tion.

21.

Shannon, C.E.. and Weaver, W.

The Mathematical Theory of Commu-

nication.

University of Illinois Press, Urbana, Ill., 1949. A classic

book

that develops communication theory from the ground up.

22. Welch, T.A. A technique for high-performance data compression.

Computer 17,

6 (June 1984), 8-19. A very fast coding technique based

on the method of [23], but whose compression performance is poor

by the standards of [4] and [6]. An improved implementation of this

method is widely used in UNIX systems under the name

compress.

23. Ziv, J., and Lempel, A. Compression of individual sequences via

variable-rate coding.

IEEE Trans. InJ Theory IT-24, 5

(Sept. 1978),

530-536. Describes a method of text compression that works by

replacing a substring with a pointer to an earlier occurrence of the

same substring. Although it performs quite well, it does not provide

a clear separation between modeling and coding.

CR Categories and Subject Descriptors: E.4 [Data]: Coding and Infor-

mation Theory-d&a

compaction and compression;

H.l.l [Models and

Principles]: Systems and Information Theory-information

theory

General Terms: Algorithms, Performance

Additional Key Words and Phrases: Adaptive modeling. arithmetic

coding, Huffman coding

Received 8/86; accepted Z2/86

Authors’ Present Address: Ian H. Witten, Radford M. Neal, and John

G. Cleary, Dept. of Computer Science, The University of Calgary,

2500 University Drive NW, Calgary, Canada T2N lN4.

Permission to copy without fee all or part of this material IS granted

provided that the copies are not made or distributed for direct commer-

cial advantage, the ACM copyright notice and the title of the publication

and its date appear, and notice is given that copying is by permission of

the Association for Computing Machinery. To copy otherwise, or to

republish, requires a fee and/or specific permission.

In response to membership requests . . .

CURRlCUl.A RECOMMENDATIONS FOR COMPUTING;

Volume I: Curricula Recommendations for Computer Science

Volume II: Curricula Recommendations for Information Systems

Volume III: Curricula Recommendations for kelated Computer Science Programs in Vocational-

Technical Schools, Community and Junior Colleges and Health Computing

(212) 869-7440

ext.

309

Information available from Deborah Cotton-Single Copy Sales

540

Communications

the ACM

]une 1987 Volume 30 Number 6

本文发布于:2024-09-21 13:50:25，感谢您对本站的认可！

本文链接：https://www.17tex.com/fanyi/27728.html

上一篇：Shift Arithmetic

下一篇：python取模运算

标签：拼音作者

留言与评论（共有 0 条评论）