Links
Home
Photos
Donut Rules
Guestbook
Drunken Alien Hunters
Links

    Huffman Encoding

My program uses the Huffman algorithm to compress a text string.  Instead of using the English word frequencies given in the book, I wrote a program to scan through a text file to obtain these frequencies and build a Huffman code.  Then, I encoded sample strings with this code.  I did this on two different languages: the symbols in the English languange and the symbols in the C language.

 

     I first scanned through a sample text file to get a nice probability table.  I then used the following Huffman algorithm to build a prefix free code.  Following is the algorithm used to build the Huffman code:

1. Calcualte the probability for each symbol if not

   given and place into an array of tree nodes.

2. Sort the symbols by probability from greatest to

   least.

3. Merge the two smallest nodes, add the

   probabilities

and place parent tree back into array of nodes.

4. Since, two trees have been merged, decrease the

   size of the array of nodes by one.

5. Goto step 2.

This code will build a nice Huffman tree, to build the Huffman code one must use a tree traversal algorithm using a push down stack to keep track of left and right moves.  When moving left in the tree, a 0 is pushed onto the stack.  When moving right in the tree, a 1 is pushed onto the stack. Upon getting to a leaf, the contents of the stack is printed.

The algorithm is as follows:

1.  If the left child exist, go to it and push a

    zero.  Upon returning from the left child pop a

    zero.

 

2.  If both left and right child do not exist, save

    the symbol and the contents of the stack to the

    symbol table.  The content of the stack is the

    Huffman code for that symbol.

 

3.  If a right child exists push a one onto the

    stack, upon returning from a right child pop a

    one off of the stack.

This algorithm can be used to build a nice Huffman code.

 

To encode a symbol simply look it’s code up in the Huffman code table and write it to the output.

 

To decode a symbol keep reading in 1’s and 0’s until a match has been found in the Huffman code table.


 

Limitations

    Huffman’s code has two main limitations.  First, It cannot see when there is much repetition of a symbol.  For example, in the following string: aaaaaaaaaaa. One could encode it with an 11 followed by an a.  This would be more efficient than Huffman’s code.  Also, if one does not assume a set of frequencies, a code table must be sent along with the information.  With small files, this code table can be large compared to the size of the file.  For example, a file consisting of the following: ate, has only three symbols.  Sending the entire Huffman table for Ascii symbols would make this file larger than it’s original size.
Testing of the Algorithm

     I tested my algorithm by building a code table based on the occurrences of Ascii symbols in one of my history reports and my C program.  This produced two different Huffman codes.  I then took sample text out of these files and compressed it line by line.  After compressing a line, I showed the percentage compression.  Next, I showed the text decompressed. 

 

     First, I show each symbol along with it’s Ascii Hex value and number of occurrences of that symbol.  I did this because such characters a newline, tab, and Carriage return either print weirdly or not at all.  For example, the first symbol in the following table does not print because it is a tab, it simply moves that line of the code table over by a few space.  Carriage returns don’t print at all, so the Ascii hex value helps one to pin down exactly what symbol it is.

    
Conclusion

     Based on the results that I got from my program, It looks like one can get pretty good compression.  My code was able to compress strings about 16 to 50 percent.  Huffman code is quick, which makes it good for realtime applications.  However, for such applications as file compression the programmer can use a code that works slower but also gets better compression.  A Huffman code is a good beginning to compression, and I have learned a great deal with this assignment.

 


 

Compression of strings using probabilities from one of my History reports.

Character and Ascii hex Value Number of occurrences

         9                     18

P   50                    8

r   72                    279

a   61                    362

k   6b                    32

s   73                    264

h   68                    302

    20                    1087

T   54                    39

n   6e                    347

d   64                    222

o   6f                    374

'   27                    7

B   42                    19

e   65                    576

y   79                    85

u   75                    106

j   6a                    13

b   62                    62

i   69                    366

l   6c                    169

t   74                    419

c   63                    105

g   67                    82

I   49                    40

x   78                    13

p   70                    97

w   77                    104

.   2e                    76

m   6d                    100

U   55                    2

v   76                    63

f   66                    85

z   7a                    2

A   41                    10

,   2c                    54

 

    a                     52

"   22                    8

H   48                    14

;   3b                    5

W   57                    12

E   45                    5

q   71                    3

R   52                    4

-   2d                    2

C   43                    4

Y   59                    1

M   4d                    2

S   53                    3

G   47                    1

O   4f                    2

F   46                    1

J   4a                    1

1   31                    2

9   39                    1

4   34                    1

7   37                    1

L   4c                    2

3   33                    2

5   35                    2

%   25                    4

\   5c                    2

[   5b                    4

]   5d                    4

(   28                    5

)   29                    5

=   3d                    2

0   30                    1

<   3c                    1

+   2b                    2

{   7b                    1

}   7d                    1

?   3f                    2

 

 

Character and Ascii hex Value     Code String

e   65                    111

r   72                    1101

h   68                    1100

.   2e                    101111

T   54                    1011101

I   49                    1011100

g   67                    101101

f   66                    101100

n   6e                    1010

l   6c                    10011

y   79                    100101

p   70                    100100

a   61                    1000

i   69                    0111

o   6f                    0110

(   28                    0101111111

E   45                    0101111110

5   35                    01011111011

S   53                    01011111010

;   3b                    0101111100

W   57                    010111101

x   78                    010111100

 

    a                     0101110

m   6d                    010110

w   77                    010101

c   63                    010100

t   74                    0100

u   75                    001111

,   2c                    0011101

0   30                    001110011111

9   39                    0011100111101

<   3c                    0011100111100

q   71                    00111001110

'   27                    0011100110

j   6a                    001110010

H   48                    001110001

C   43                    00111000011

O   4f                    001110000101

1   31                    001110000100

P   50                    0011100000

d   64                    00110

b   62                    0010111

v   76                    0010110

k   6b                    00101011

"   22                    0010101011

+   2b                    001010101011

7   37                    0010101010101

{   7b                    0010101010100

z   7a                    001010101001

3   33                    001010101000

F   46                    0010101001111

}   7d                    0010101001110

\   5c                    001010100110

=   3d                    001010100101

4   34                    0010101001001

Y   59                    0010101001000

]   5d                    00101010001

M   4d                    001010100001

U   55                    001010100000

L   4c                    001010011111

J   4a                    0010100111101

G   47                    0010100111100

%   25                    00101001110

[   5b                    00101001101

R   52                    00101001100

         9                     001010010

B   42                    001010001

-   2d                    001010000111

?   3f                    001010000110

)   29                    00101000010

A   41                    0010100000

s   73                    00100

    20                    000

 

 

Starting Coding Process

 

Encoding String:

rajesh was here

 

Compressed String:

 1101100000111001011100100110000001010110000010000011001111101111

 

String has been compressed 47 percent

 

Decoding String:

rajesh was here

 

 

 

 

Encoding String:

 this report is good

 

Compressed String:

 0100110001110010000011011111001000110110101000000111001000001011010110011000110

 

String has been compressed 48 percent

 

Decoding String:

this report is good

 

 

 

 

Encoding String:

 i am writing this report

 

Compressed String:

 0111000100001011000001010111010111010001111010101101000010011000111001000001101111100100011011010100

 

String has been compressed 48 percent

 

Decoding String:

i am writing this report

 

 

 

 

Encoding String:

 give me an a

 

Compressed String:

 10110101110010110111000010110111000100010100001000

 

String has been compressed 48 percent

 

Decoding String:

give me an a

 


 

 

Compression of strings from my C program

Character and Ascii hex Value     Number of occurrences

#   23                    4

i   69                    240

n   6e                    255

c   63                    159

l   6c                    66

u   75                    132

d   64                    85

e   65                    297

    20                    1128

<   3c                    16

s   73                    107

t   74                    361

o   6f                    201

.   2e                    41

h   68                    83

>   3e                    21

 

    a                     323

b   62                    57

r   72                    327

g   67                    69

F   46                    7

I   49                    3

L   4c                    23

E   45                    7

*   2a                    38

,   2c                    85

;   3b                    171

y   79                    83

p   70                    177

f   66                    148

_   5f                    32

         9                     184

{   7b                    47

a   61                    205

}   7d                   47

(   28                    162

)   29                    162

m   6d                    86

v   76                    27

k   6b                    10

w   77                    18

[   5b                    67

]   5d                    67

q   71                    7

2   32                    12

0   30                    59

=   3d                    111

"   22                    72

5   35                    5

6   36                    2

-   2d                    40

1   31                    37

+   2b                    31

%   25                    11

x   78                    18

3   33                    3

\   5c                    28

z   7a                    9

j   6a                    11

/   2f                    27

N   4e                    10

U   55                    10

C   43                    9

!   21                    8

O   4f                    4

8   38                    2

7   37                    1

'   27                    24

|   7c                    4

&   26                    2

T   54                    10

P   50                    6

 

 

Character and Ascii hex Value       Code String

e   65                    1111

f   66                    11101

c   63                    11100

 

    a                     1101

)   29                    11001

(   28                    11000

r   72                    1011

C   43                    101011111

U   55                    101011110

k   6b                    101011101

T   54                    101011100

-   2d                    1010110

.   2e                    1010101

>   3e                    10101001

N   4e                    101010001

j   6a                    101010000

h   68                    101001

y   79                    101000

,   2c                    100111

d   64                    100110

;   3b                    10010

p   70                    10001

m   6d                    100001

L   4c                    10000011

%   25                    100000101

2   32                    100000100

{   7b                    1000000

t   74                    0111

         9                     01101

o   6f                    01100

}   7d                    0101111

'   27                    01011101

I   49                    01011100111

3   33                    01011100110

P   50                    0101110010

q   71                    0101110001

7   37                    010111000011

&   26                    010111000010

#   23                    01011100000

s   73                    010110

a   61                    01010

=   3d                    010011

/   2f                    01001011

v   76                    01001010

b   62                    0100100

i   69                    01000

\   5c                   00111111

E   45                    0011111011

F   46                    0011111010

!   21                    0011111001

O   4f                    00111110001

|   7c                    00111110000

0   30                    0011110

+   2b                    00111011

_   5f                    00111010

l   6c                    0011100

n   6e                    00110

u   75                    001011

]   5d                    0010101

[   5b                    0010100

g   67                    0010011

<   3c                    001001011

6   36                    001001010111

8   38                    001001010110

5   35                    00100101010

z   7a                    0010010100

w   77                    001001001

x   78                    001001000

"   22                    0010001

1   31                    00100001

*   2a                    00100000

    20                    000

 

 

Starting Coding Process

 

Encoding String:

#include <stdio.h>

 

Compressed String:

 010111000000100000110111000011100001011100110111100000100101101011001111001100100001100101010110100110101001

 

String has been compressed 25 percent

 

Decoding String:

#include <stdio.h>

 

 

 

 

Encoding String:

 #include <stdlib.h>

 

Compressed String:

 010111000000100000110111000011100001011100110111100000100101101011001111001100011100010000100100101010110100110101001

 

String has been compressed 23 percent

 

Decoding String:

#include <stdlib.h>

 

 

 

 

Encoding String:

 #include <string.h>

 

Compressed String:

 01011100000010000011011100001110000101110011011110000010010110101100111101101000001100010011101010110100110101001

 

String has been compressed 26 percent

 

Decoding String:

#include <string.h>

 

 

 

 

Encoding String:

 FILE *out,*in;

 

Compressed String:

 0011111010010111001111000001100111110110000010000001100001011011110011100100000010000011010010

 

String has been compressed 16 percent

 

Decoding String:

FILE *out,*in;

 

 

 

 

Encoding String:

 typedef struct node *node_ptr;

 

Compressed String:

 01111010001000111111001101111111010000101100111101100101111100011100000110011001001101111000001000000011001100100110111100111010100010111101110010

 

String has been compressed 39 percent

 

Decoding String:

typedef struct node *node_ptr;

 

 

Source Code

 

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

FILE *out,*in;

typedef struct node *node_ptr;

typedef struct node           //Tree node//

       {

       char character;

       int  prob ;          //number of occurence of the character

       node_ptr left;

       node_ptr right;

       char *code;          //huffman code stored here

       } nodetype;

 

int  encode(nodetype *array,int number);

int  decode(nodetype *array,char *code,int number,FILE *out2);

void stack_push(char *string,char number);

void stack_pop(char *string);

void binary_write();

void traverse(nodetype *array,nodetype something[], char *string);

void huffman(nodetype *array,int number);

int occurence(nodetype *pointer);

int oc_cmp(const void *vp, const void *vq);

int array_pt;

void main()

{

     char string[20]="";

     int i,number;

     nodetype array[256]={0}; //allocate space for up to 256 characters

     number=occurence(array); //count numbers of each symbol in a file

     fprintf(out,"Character and Ascii hex Value\tNumber of occurences\n");

     for(i=0;(i<=(number-1));i++)   //print out character and number of occurence

       {

       fprintf(out,"%c   %2x\t\t\t\t  %d\n",array[i].character,array[i].character ,array[i].prob,array[i].code);

       }

     array_pt=number-1;

     qsort(array,number,sizeof(nodetype),oc_cmp); //sort symbols by weight for huffman

 

     huffman(array,number);  //build huffman tree

 

 

     traverse(array,array,string);  //print out huffman codes

 

 

     fprintf(out,"\n\nCharacter and Ascii hex Value\tCode String\n");

     for(i=0;(i<=(number-1));i++)

       {

       fprintf(out,"%c   %2x\t\t\t\t  %s\n",array[i].character,array[i].character ,array[i].code);

       }

     fclose(out); //close output file

     encode(array,number-1); //encode strings line by line

     fclose(in);             //close input file

 

 

}

 

 

 

int  decode(nodetype *array,char *code,int number,FILE *out2)

     {

 

     char string[20]="";

     int test,i,j;

 

 

 

 

     for(j=0;code[j]!='\0';j++)

     {

     if (code[j]=='0')

       {

       stack_push(string,'0');   //if input char = 0 push a 0

       }

     else

       {

       stack_push(string,'1');   //if input  char = 1 push a 1

       }

     for(i=0;i<=number;i++)

       {

       test=strcmp(array[i].code,string);  //see if string on stack is equal

                                      //to string in huffman code

       if(test==0)                         //if yes ouput that symbol

         {

         fputc(array[i].character,out2);

         string[0]='\0';                  //initialize stack

         }

        }

 

     }

 

 

     return(1);

     }

int  encode(nodetype *array,int number)

     {

     int i;

     double count_char=0.0,code_length=0.0;

     char string[256]="";

     char c;

     FILE *out2;

     fclose(in);

     if ((in=fopen("\\input2.txt","rt"))==NULL) //open input file

           {

           printf("Cannot open input file");

           }

     if ((out2=fopen("\\output.txt","a"))==NULL) //open output file

           {

           printf("Cannot open output file");

           }

     fprintf(out2,"\n\nStarting Coding Process\n\n");

     c=fgetc(in);                    //get first symbol

     fprintf(out2,"Encoding String:\n");

     fputc(c,out2);

     while(c!=EOF)                   //output first symbol

       {

 

 

       while(c!='\n')              //code until end of line

          {

          for(i=0;(array[i].character!=c);i++);

          code_length+=strlen(array[i].code);    //calculate lenght of string

          count_char++;                          //increment number of characters coded

          strncat(string,array[i].code,code_length+1); //store string for output

          c=fgetc(in);                    //output character

          fputc(c,out2);

          }

       fprintf(out2,"\n");

       fprintf(out2,"Compressed String:\n %s\n",string);

       fprintf(out2,"\nString has been compressed %1.2lf percent\n",1.0-(code_length/8/count_char));

       fprintf(out2,"\nDecoding String:\n");

       decode(array,string,number,out2);     //decode string just encoded

 

       fprintf(out2,"\n\n\n\n\nEncoding String:\n");

       fprintf(out2," ");

       string[0]='\0';  //initialize string variable

       count_char=0.0;  //initialize count

       code_length=0.0; //intialize length

       c=fgetc(in);    //get next symbol

       fputc(c,out2); //print character

       }

     fclose(in);       //close files

     fclose(out2);

     return(1);

     }

void traverse(nodetype *T, nodetype P[256], char *string)

       {

       int i;

 

       if (T!=NULL)

       {

          if (T->left)

             {

             stack_push(string,'0');      //traverse left push 0

             traverse(T->left,P,string); // go to left child

             stack_pop(string);          //return from left pop 0

             }

 

          if(!(T->left||T->right))

            {

            //fprintf(out,"%x  %s\n",T->character, string);

            i=strlen(string); //calculate length of code

            P[array_pt].code=(char *)calloc(i+1,sizeof(char));//allocate space for code

 

            strcpy(P[array_pt].code,string);   //copy code into node

            P[array_pt].character=T->character;//store char into node

            array_pt--;                        //decrement pointer

            }

 

          if (T->right)

             {

             stack_push(string,'1');     //traverse right push 1

             traverse(T->right,P,string);//traverse right

             stack_pop(string);         //return from right pop1

             }

 

       }

       }

 

void stack_push(char *string,char number)

       {

       char *p;

       p=strchr(string,'\0');//this code pushes a symbol onto stack

       *p=number;

       ++p;

       *p='\0';

       }

void stack_pop(char *string)

       {

       char *p;

       p=strchr(string,'\0');//this code pops a symbol from the stack

       --p;

       *p='\0';

       }

 

 

void huffman(nodetype *array,int number)

    {

       nodetype *left;//this code builds a huffman tree.

       nodetype *right;

       number--;

 

       while(number!=0)

            {

 

            left=(nodetype *)calloc(1,sizeof(nodetype)); //allocate space

            right=(nodetype *)calloc(1,sizeof(nodetype));//allocate space

            left->left=array[number-1].left;             //copy node into left

            left->right=array[number-1].right;           //copy node into left

            right->left=array[number].left;              //copy node into right

            right->right=array[number].right;           // copy node into right

            left->character=array[number-1].character;  //copy char into left

            left->prob=array[number-1].prob;            //copy probability into left

 

            right->character=array[number].character;  //copy char into right

            right->prob=array[number].prob;           //copy probablity into right

            array[number-1].left=left;                   //update pointers

            array[number-1].right=right;

            array[number-1].prob+=array[number].prob;

            number--;

            qsort(array,number+1,sizeof(nodetype),oc_cmp); //sort nodes

            }

 

    }

int oc_cmp(const void *vp, const void *vq) //this is a sorting function for

       {                            //for qsort

       const nodetype *p=vp;

       const nodetype *q=vq;

       int diff;

       diff = (q->prob) - (p->prob);

       return (diff);

       }

int occurence(nodetype *array)

{

 

     char temp;

     int num_char=0,i,flag=1;

     if ((out=fopen("\\output.txt","wt"))==NULL) //open output file

           {

           printf("Cannot open output file");

           }

     if ((in=fopen("\\input.txt","rt"))==NULL)  //open input file

           {

           printf("Cannot open input file");

           }

     temp=fgetc(in);                           //get first symbole

     while(temp!=EOF)                          //do until end of file

       {                                     //search array of symbols

       for(i=0;(i<num_char);i++)

          {

          if (array[i].character==temp)

              {

              flag=0;

              ++array[i].prob;           //if in array update frequency

              }

          }

       if(flag)

              {

              ++num_char;

              array[num_char-1].character=temp;  //add to array

              array[num_char-1].prob++;          //increment probabilty

              }

       flag=1;

       temp=fgetc(in);

       }

 

     fclose(in);

     return (num_char);

 

}

.