DATA COMPRESSION ALGORITHMS OF ARC, PKZIP, AND LHARC
            ====================================================


     Much of the typical  modem  user's  online  time  is  spent performing
 uploads or  downloads of files from BBS's, Online Services like Compuserve
 or GEnie, or Information  Networks like  Usenet or  Internet.   Given that
 this  always  takes  up  a  lot  of time, and usually costs a considerable
 amount of money, the need to  shorten the  time necessary  to perform file
 transfers, and  other modem  applications has  always been prevalent.  One
 innovation in this field has been  the development  of advanced Algorithms
 for compacting,  or compressing  data so  it takes up much less space, and
 packing multiple files into one Archive, or data  file, so  many files can
 be sent at one time.

     The current  technology, an  offspring of data encryption methods used
 in World War II, reduces the time it  takes to  transfer a  file through a
 modem, by  reducing the  size of the data itself.  Given the proliferation
 of many data compression methods (ARC, PKZIP, ZOO,  SIT, and  LHARC, for a
 few  examples)  that  try  to  provide  the  most efficient method of data
 compression, the topic has always been controversial in nature.

     Haruhiko Okumura provided  a  great  source  of  knowledge  about data
 compression algorithms  by writing this essay, which describes some of the
 effort involved in creating  a  data  compression  standard.    Except for
 modifications in its formatting, or presentation, and various notes placed
 in this text to provide more information on certain subjects,  the content
 of Haruhiko Okumura's text is identical....


 Introduction:  History of LHARC's Forefathers
 ---------------------------------------------

     In the  spring of 1988, I wrote a very simple data compression program
 named LZSS in C language, and uploaded it  to the  Science SIG  (forum) of
 PC-VAN, Japan's biggest personal computer network.  That program was based
 on Storer and Szymanski's slightly modified version of  one of  Lempel and
 Ziv's algorithms.   Despite its simplicity, for most files its compression
 outperformed the archivers then widely used.

     Kazuhiko Miki rewrote my LZSS in  Turbo Pascal  and assembly language,
 and soon made it evolve into a complete archiver, which he named LARC. The
 first versions of LZSS and LARC were rather  slow.   So I  rewrote my LZSS
 using a binary tree, and so did Miki.  Although LARC's encoding was slower
 than the fastest archiver available, its decoding was quite fast,  and its
 algorithm was  so simple that even self-extracting files (compressed files
 plus decoder) it created  were  usually  smaller  than non-self-extracting
 files from other archivers. 

     Soon many  hobby programmers joined the archiver project at the forum.
 Very many suggestions were made, and LARC was revised again and again.  By
 the summer  of 1988,  LARC's speed  and compression  have improved so much
 that LARC-compressed programs were beginning to be uploaded in many forums
 of PC-VAN and other networks. 

     In that summer I wrote another program, LZARI, which combined the LZSS
 algorithm with adaptive arithmetic  compression.   Although it  was slower
 than LZSS,  its compression  performance was amazing.  Miki, the author of
 LARC, uploaded LZARI to  NIFTY-Serve, another  big information  network in
 Japan.    In  NIFTY-Serve,  Haruyasu  Yoshizaki  replaced LZARI's adaptive
 arithmetic coding with a  version of  adaptive Huffman  coding to increase
 speed.   Based on  this algorithm, which he called LZHUF, he developed yet
 another archiver, LHarc. 


 Data Compression Algorithms, Lempel-Ziv, and ARC.TTP
 ----------------------------------------------------

       In what follows, I will review several of these algorithms and
 supply simplified codes in C language.


 1.  RLL Encoding

     Replacing several  (usually 8  or 4)  "space" characters  by one "tab"
 character is a very primitive method for data compression.  Another simple
 method is Run-Length coding  , which  encodes the  message "AAABBBBAACCCC"
 into "3A4B2A4C", for example.


 2. LZSS coding

     This scheme  is initiated  by Ziv and Lempel [1].  A slightly modified
 version is described by Storer and Szymanski [2].  An implementation using
 a binary  tree is  proposed by  Bell [3].   The algorithm is quite simple:
 Keep a ring buffer,  which  initially  contains  "space"  characters only.
 Read several  letters from the file to the buffer.  Then search the buffer
 for the longest string that matches  the letters  just read,  and send its
 length and position in the buffer. 

     If the  buffer size  is 4096  bytes, the position can be encoded in 12
 bits.  If we represent the  match  length  in  four  bits,  the <position,
 length> pair  is two bytes long.  If the longest match is no more than two
 characters, then we send just one character without  encoding, and restart
 the process with the next letter.  We must send one extra bit each time to
 tell the decoder whether we are  sending a  <position, length>  pair or an
 unencoded character. 


 3. LZW coding

     This scheme  was devised  by Ziv and Lempel [4], and modified by Welch
 [5].  The LZW coding has been adopted by most  of the  existing archivers,
 such as  ARC and PKZIP.  The algorithm can be made relatively fast, and is
 suitable for hardware implementation as well.  A  Pascal program  for this
 algorithm is given in Storer's book [6]. 


     The algorithm  can be  outlined as  follows: Prepare  a table that can
 contain several  thousand items.   Initially  register in  its 0th through
 255th positions  the usual  256 characters.  Read several letters from the
 file to be encoded, and search the table for the  longest match.   Suppose
 the longest  match is  given by  the string  "ABC".   Send the position of
 "ABC" in the table.  Read the next character from the file.  If it is "D",
 then register  a new  string "ABCD"  in the table, and restart the process
 with the letter "D".  If the table becomes full,  discard the  oldest item
 or, preferably, the least used. 


 4. Huffman coding

     Classical  Huffman  coding  is  invented  by  Huffman  [7].   A fairly
 readable account is given in  Sedgewick  [8].    Suppose  the  text  to be
 encoded is  "ABABACA", with four A's, two B's, and a C.  We represent this
 situation as follows:

                                 4    2    1
                                 |    |    |
                                 A    B    C

     Combine the least frequent two characters  into one,  resulting in the
 new frequency 2 + 1 = 3:

                                 4      3
                                 |     /  \
                                 A    B    C

     Repeat the above step until the whole characters combine into a tree:

                                     7
                                   /  \
                                  /     3
                                 /    /  \
                                A    B    C

     Start at  the top  ("root") of  this encoding  tree, and travel to the
 character you want to encode.  If you go left, send a  "0"; otherwise send
 a "1". Thus, "A" is encoded by "0", "B" by "10", "C" by "11".  Altogether,
 "ABABACA" will be encoded  into ten  bits, "0100100110".   To  decode this
 code,  the  decoder  must  know  the  encoding  tree,  which  must be sent
 separately.

     A modification to this  classical Huffman  coding is  the adaptive, or
 dynamic, Huffman  coding.   See, e.g.,  Gallager [9].  In this method, the
 encoder and the decoder processes the first letter of the  text as  if the
 frequency of  each character  in the  file were one, say.  After the first
 letter has been processed,  both parties  increment the  frequency of that
 character by  one.   For example,  if the  first letter  is 'C', then freq
 ['C'] becomes two, whereas every other  frequencies are  still one.   Then
 the both  parties modify  the encoding  tree accordingly.  Then the second
 letter will be encoded and decoded, and so on.


 5. Arithmetic coding

     The original concept of arithmetic coding is proposed by P. Elias.  An
 implementation in C language is described by Witten and others [10]. 

     Although  the  Huffman  coding  is  optimal  if each character must be
 encoded into a fixed (integer) number of bits,  arithmetic coding  wins if
 no such restriction is made.

     As an  example we  shall encode  "AABA" using  arithmetic coding.  For
 simplicity suppose we know beforehand that  the probabilities  for "A" and
 B" to appear in the text are 3/4 and 1/4, respectively.

     Initially, consider an interval:

                             0 <= x < 1.

     Since the  first character  is "A" whose probability is 3/4, we shrink
 the interval to the lower 3/4:

                             0 <= x < 3/4.

     The next character is "A" again, so we take the lower 3/4:

                             0 <= x < 9/16.

     Next comes "B" whose probability is 1/4, so we take the upper 1/4:

                             27/64 <= x < 9/16,

     Because "B" is the second element in our alphabet, {A, B}.  The last
 character is "A" and the interval is

                             27/64 <= x < 135/256,

     which can be written in binary notation

                             0.011011 <= x < 0.10000111.

     Choose from this interval any number that can be represented in fewest
 bits, say  0.1, and  send the  bits to  the right of "0."; in this case we
 send only one bit, "1".  Thus we have encoded  four letters  into one bit!
 With the  Huffman coding, four letters could not be encoded into less than
 four bits. 

     To decode the code "1", we just reverse the process:  First, we supply
 the "0."  to the  right of  the received  code "1",  resulting in "0.1" in
 binary notation, or 1/2.  Since  this number  is in  the first  3/4 of the
 initial interval  0 <= x < 1, the first character must be "A".  Shrink the
 interval into the lower 3/4.  In this new interval, the number 1/2 lies in
 the lower  3/4 part, so the second character is again "A", and so on.  The
 number of letters in the original  file  must  be  sent  separately  (or a
 special 'EOF' character must be appended at the end of the file).

     The  algorithm  described  above  requires  that  both  the sender and
 receiver know  the  probability  distribution  for  the  characters.   The
 adaptive  version  of  the  algorithm  removes  this  restriction by first
 supposing uniform  or  any  agreed-upon  distribution  of  characters that
 approximates  the  true  distribution,  and then updating the distribution
 after each character is sent and received.


 6. LZARI

     In each  step  the  LZSS  algorithm  sends  either  a  character  or a
 <position, length>  pair.  Among these, perhaps character "e" appears more
 frequently than "x", and a <position,  length> pair  of length  3 might be
 commoner than one of length 18, say.  Thus, if we encode the more frequent
 in fewer bits and the less frequent in more bits, the total  length of the
 encoded text  will be diminished.  This consideration suggests that we use
 Huffman or arithmetic coding, preferably of  an adaptive  kind, along with
 LZSS.   This is  easier said  than done,  because there  are many possible
 <position, length> combinations.   Adaptive compression  must keep running
 statistics  of  frequency  distribution.    Too many items make statistics
 unreliable.


           LZARI, and the Creation of a Data Compression Program
           -----------------------------------------------------

     What follows is not even an approximate solution to the  problem posed
 above, but anyway this was what I did in the summer of 1988. 

     I extended  the character set from 256 to three-hundred or so in size,
 and let characters 0 through 255  be the  usual 8-bit  characters, whereas
 characters 253  + n represent that what follows is a position of string of
 length n, where n = 3, 4 , ....  These extended set of  characters will be
 encoded with adaptive arithmetic compression.

     I also  observed that  longest-match strings  tend to be the ones that
 were read relatively recently.    Therefore,  recent  positions  should be
 encoded into  fewer bits.   Since  4096 positions  are too  many to encode
 adaptively,   I fixed  the probability  distribution of  the positions "by
 hand".  The  distribution  function  given  in the accompanying LZARI.C is
 rather tentative;  it  is  not  based  on  thorough  experimentation.   In
 retrospect, I could encode adaptively the most significant 6 bits, say, or
 perhaps  by  some  more  ingenious  method  adapt  the  parameters  of the
 distribution function to the running statistics.

     At any  rate, the present version of LZARI treats the positions rather
 separately, so that  the  overall  compression  is  by  no  means optimal.
 Furthermore,  the  string  length  threshold above which strings are coded
 into <position, length> pairs  is  fixed,  but  logically  its  value must
 change according  to the  length of  the <position,  length> pair we would
 get.


 7. LZHUF

     LZHUF, the algorithm of Haruyasu Yoshizaki's  archiver LHarc, replaces
 LZARI's adaptive  arithmetic coding  with adaptive Huffman.  LZHUF encodes
 the most significant 6  bits of  the position  in its  4096-byte buffer by
 table lookup.   More  recent, and hence more probable, positions are coded
 in less bits.  On the other hand, the remaining 6 bits are sent verbatim.

     Because Huffman coding encodes each  letter  into  a  fixed  number of
 bits,  table  lookup  can  be  easily  implemented.   Though theoretically
 Huffman cannot  exceed  arithmetic  compression,  the  difference  is very
 slight, and LZHUF is fairly fast. 


 References:
 -----------

   [1] J. Ziv and A. Lempel, IEEE Trans. IT-23, 337-343 (1977).

   [2] J. A. Storer and T. G. Szymanski, J. ACM, 29, 928-951
       (1982).

   [3] T. C. Bell, IEEE Trans. COM-34, 1176-1182 (1986).

   [4] J. Ziv and A. Lempel, IEEE Trans. IT-24, 530-536 (1978).

   [5] T. A. Welch, Computer, 17, No.6, 8-19 (1984).

   [6] J. A. Storer, Data Compression: Methods and Theory
       (Computer Science Press, 1988).

   [7] D. A. Huffman, Proc IRE 40, 1098-1101 (1952).

   [8] R. Sedgewick, Algorithms, 2nd ed. (Addison-Wesley, 1988).

   [9] R. G. Gallager, IEEE Trans. IT-24, 668-674 (1978).

  [10] I. E. Witten, R. M. Neal, and J. G. Cleary, Commun. ACM
       30, 520-540 (1987).
         ________________________________________________________

                  Utdrag ur ST Report av den 15 mars 1991

                     29 mars 1991 - Ghlenn Willard 6929