Huffman Encoding
My program uses the Huffman algorithm to compress a text string. Instead of using the English word frequencies given in the book, I wrote a program to scan through a text file to obtain these frequencies and build a Huffman code. Then, I encoded sample strings with this code. I did this on two different languages: the symbols in the English languange and the symbols in the C language.
I first scanned through a sample text file to get a nice probability table. I then used the following Huffman algorithm to build a prefix free code. Following is the algorithm used to build the Huffman code:
1. Calcualte the probability for each symbol if not
given and place into an array of tree nodes.
2. Sort the symbols by probability from greatest to
least.
3. Merge the two smallest nodes, add the
probabilities
and place parent tree back into array of nodes.
4. Since, two trees have been merged, decrease the
size of the array of nodes by one.
5. Goto step 2.
This code will build a nice Huffman tree, to build the Huffman code one must use a tree traversal algorithm using a push down stack to keep track of left and right moves. When moving left in the tree, a 0 is pushed onto the stack. When moving right in the tree, a 1 is pushed onto the stack. Upon getting to a leaf, the contents of the stack is printed.
The algorithm is as follows:
1. If the left child exist, go to it and push a
zero. Upon returning from the left child pop a
zero.
2. If both left and right child do not exist, save
the symbol and the contents of the stack to the
symbol table. The content of the stack is the
Huffman code for that symbol.
3. If a right child exists push a one onto the
stack, upon returning from a right child pop a
one off of the stack.
This algorithm can be used to build a nice Huffman code.
To encode a symbol simply look it’s code up in the Huffman code table and write it to the output.
To decode a symbol keep reading in 1’s and 0’s until a match has been found in the Huffman code table.
Limitations
Huffman’s code has two main limitations. First, It cannot see when
there is much repetition of a symbol. For example, in the following
string: aaaaaaaaaaa. One could encode it with an 11 followed by an a.
This would be more efficient than Huffman’s code. Also, if one does not
assume a set of frequencies, a code table must be sent along with the
information. With small files, this code table can be large compared to
the size of the file. For example, a file consisting of the following:
ate, has only three symbols. Sending the entire Huffman table for Ascii
symbols would make this file larger than it’s original size.
Testing of the
Algorithm
I tested my algorithm by building a code table based on the occurrences of Ascii symbols in one of my history reports and my C program. This produced two different Huffman codes. I then took sample text out of these files and compressed it line by line. After compressing a line, I showed the percentage compression. Next, I showed the text decompressed.
First, I show each symbol along with it’s Ascii Hex value and number of occurrences of that symbol. I did this because such characters a newline, tab, and Carriage return either print weirdly or not at all. For example, the first symbol in the following table does not print because it is a tab, it simply moves that line of the code table over by a few space. Carriage returns don’t print at all, so the Ascii hex value helps one to pin down exactly what symbol it is.
Conclusion
Based on the results that I got from my program, It looks like one can get pretty good compression. My code was able to compress strings about 16 to 50 percent. Huffman code is quick, which makes it good for realtime applications. However, for such applications as file compression the programmer can use a code that works slower but also gets better compression. A Huffman code is a good beginning to compression, and I have learned a great deal with this assignment.
Compression of strings using probabilities from one of my History reports.
Character and Ascii hex Value Number of occurrences
9 18
P 50 8
r 72 279
a 61 362
k 6b 32
s 73 264
h 68 302
20 1087
T 54 39
n 6e 347
d 64 222
o 6f 374
' 27 7
B 42 19
e 65 576
y 79 85
u 75 106
j 6a 13
b 62 62
i 69 366
l 6c 169
t 74 419
c 63 105
g 67 82
I 49 40
x 78 13
p 70 97
w 77 104
. 2e 76
m 6d 100
U 55 2
v 76 63
f 66 85
z 7a 2
A 41 10
, 2c 54
a 52
" 22 8
H 48 14
; 3b 5
W 57 12
E 45 5
q 71 3
R 52 4
- 2d 2
C 43 4
Y 59 1
M 4d 2
S 53 3
G 47 1
O 4f 2
F 46 1
J 4a 1
1 31 2
9 39 1
4 34 1
7 37 1
L 4c 2
3 33 2
5 35 2
% 25 4
\ 5c 2
[ 5b 4
] 5d 4
( 28 5
) 29 5
= 3d 2
0 30 1
< 3c 1
+ 2b 2
{ 7b 1
} 7d 1
? 3f 2
Character and Ascii hex Value Code String
e 65 111
r 72 1101
h 68 1100
. 2e 101111
T 54 1011101
I 49 1011100
g 67 101101
f 66 101100
n 6e 1010
l 6c 10011
y 79 100101
p 70 100100
a 61 1000
i 69 0111
o 6f 0110
( 28 0101111111
E 45 0101111110
5 35 01011111011
S 53 01011111010
; 3b 0101111100
W 57 010111101
x 78 010111100
a 0101110
m 6d 010110
w 77 010101
c 63 010100
t 74 0100
u 75 001111
, 2c 0011101
0 30 001110011111
9 39 0011100111101
< 3c 0011100111100
q 71 00111001110
' 27 0011100110
j 6a 001110010
H 48 001110001
C 43 00111000011
O 4f 001110000101
1 31 001110000100
P 50 0011100000
d 64 00110
b 62 0010111
v 76 0010110
k 6b 00101011
" 22 0010101011
+ 2b 001010101011
7 37 0010101010101
{ 7b 0010101010100
z 7a 001010101001
3 33 001010101000
F 46 0010101001111
} 7d 0010101001110
\ 5c 001010100110
= 3d 001010100101
4 34 0010101001001
Y 59 0010101001000
] 5d 00101010001
M 4d 001010100001
U 55 001010100000
L 4c 001010011111
J 4a 0010100111101
G 47 0010100111100
% 25 00101001110
[ 5b 00101001101
R 52 00101001100
9 001010010
B 42 001010001
- 2d 001010000111
? 3f 001010000110
) 29 00101000010
A 41 0010100000
s 73 00100
20 000
Starting Coding Process
Encoding String:
rajesh was here
Compressed String:
1101100000111001011100100110000001010110000010000011001111101111
String has been compressed 47 percent
Decoding String:
rajesh was here
Encoding String:
this report is good
Compressed String:
0100110001110010000011011111001000110110101000000111001000001011010110011000110
String has been compressed 48 percent
Decoding String:
this report is good
Encoding String:
i am writing this report
Compressed String:
0111000100001011000001010111010111010001111010101101000010011000111001000001101111100100011011010100
String has been compressed 48 percent
Decoding String:
i am writing this report
Encoding String:
give me an a
Compressed String:
10110101110010110111000010110111000100010100001000
String has been compressed 48 percent
Decoding String:
give me an a
Compression of strings from my C program
Character and Ascii hex Value Number of occurrences
# 23 4
i 69 240
n 6e 255
c 63 159
l 6c 66
u 75 132
d 64 85
e 65 297
20 1128
< 3c 16
s 73 107
t 74 361
o 6f 201
. 2e 41
h 68 83
> 3e 21
a 323
b 62 57
r 72 327
g 67 69
F 46 7
I 49 3
L 4c 23
E 45 7
* 2a 38
, 2c 85
; 3b 171
y 79 83
p 70 177
f 66 148
_ 5f 32
9 184
{ 7b 47
a 61 205
} 7d 47
( 28 162
) 29 162
m 6d 86
v 76 27
k 6b 10
w 77 18
[ 5b 67
] 5d 67
q 71 7
2 32 12
0 30 59
= 3d 111
" 22 72
5 35 5
6 36 2
- 2d 40
1 31 37
+ 2b 31
% 25 11
x 78 18
3 33 3
\ 5c 28
z 7a 9
j 6a 11
/ 2f 27
N 4e 10
U 55 10
C 43 9
! 21 8
O 4f 4
8 38 2
7 37 1
' 27 24
| 7c 4
& 26 2
T 54 10
P 50 6
Character and Ascii hex Value Code String
e 65 1111
f 66 11101
c 63 11100
a 1101
) 29 11001
( 28 11000
r 72 1011
C 43 101011111
U 55 101011110
k 6b 101011101
T 54 101011100
- 2d 1010110
. 2e 1010101
> 3e 10101001
N 4e 101010001
j 6a 101010000
h 68 101001
y 79 101000
, 2c 100111
d 64 100110
; 3b 10010
p 70 10001
m 6d 100001
L 4c 10000011
% 25 100000101
2 32 100000100
{ 7b 1000000
t 74 0111
9 01101
o 6f 01100
} 7d 0101111
' 27 01011101
I 49 01011100111
3 33 01011100110
P 50 0101110010
q 71 0101110001
7 37 010111000011
& 26 010111000010
# 23 01011100000
s 73 010110
a 61 01010
= 3d 010011
/ 2f 01001011
v 76 01001010
b 62 0100100
i 69 01000
\ 5c 00111111
E 45 0011111011
F 46 0011111010
! 21 0011111001
O 4f 00111110001
| 7c 00111110000
0 30 0011110
+ 2b 00111011
_ 5f 00111010
l 6c 0011100
n 6e 00110
u 75 001011
] 5d 0010101
[ 5b 0010100
g 67 0010011
< 3c 001001011
6 36 001001010111
8 38 001001010110
5 35 00100101010
z 7a 0010010100
w 77 001001001
x 78 001001000
" 22 0010001
1 31 00100001
* 2a 00100000
20 000
Starting Coding Process
Encoding String:
#include <stdio.h>
Compressed String:
010111000000100000110111000011100001011100110111100000100101101011001111001100100001100101010110100110101001
String has been compressed 25 percent
Decoding String:
#include <stdio.h>
Encoding String:
#include <stdlib.h>
Compressed String:
010111000000100000110111000011100001011100110111100000100101101011001111001100011100010000100100101010110100110101001
String has been compressed 23 percent
Decoding String:
#include <stdlib.h>
Encoding String:
#include <string.h>
Compressed String:
01011100000010000011011100001110000101110011011110000010010110101100111101101000001100010011101010110100110101001
String has been compressed 26 percent
Decoding String:
#include <string.h>
Encoding String:
FILE *out,*in;
Compressed String:
0011111010010111001111000001100111110110000010000001100001011011110011100100000010000011010010
String has been compressed 16 percent
Decoding String:
FILE *out,*in;
Encoding String:
typedef struct node *node_ptr;
Compressed String:
01111010001000111111001101111111010000101100111101100101111100011100000110011001001101111000001000000011001100100110111100111010100010111101110010
String has been compressed 39 percent
Decoding String:
typedef struct node *node_ptr;
Source Code
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
FILE *out,*in;
typedef struct node *node_ptr;
typedef struct node //Tree node//
{
char character;
int prob ; //number of occurence of the character
node_ptr left;
node_ptr right;
char *code; //huffman code stored here
} nodetype;
int encode(nodetype *array,int number);
int decode(nodetype *array,char *code,int number,FILE *out2);
void stack_push(char *string,char number);
void stack_pop(char *string);
void binary_write();
void traverse(nodetype *array,nodetype something[], char *string);
void huffman(nodetype *array,int number);
int occurence(nodetype *pointer);
int oc_cmp(const void *vp, const void *vq);
int array_pt;
void main()
{
char string[20]="";
int i,number;
nodetype array[256]={0}; //allocate space for up to 256 characters
number=occurence(array); //count numbers of each symbol in a file
fprintf(out,"Character and Ascii hex Value\tNumber of occurences\n");
for(i=0;(i<=(number-1));i++) //print out character and number of occurence
{
fprintf(out,"%c %2x\t\t\t\t %d\n",array[i].character,array[i].character ,array[i].prob,array[i].code);
}
array_pt=number-1;
qsort(array,number,sizeof(nodetype),oc_cmp); //sort symbols by weight for huffman
huffman(array,number); //build huffman tree
traverse(array,array,string); //print out huffman codes
fprintf(out,"\n\nCharacter and Ascii hex Value\tCode String\n");
for(i=0;(i<=(number-1));i++)
{
fprintf(out,"%c %2x\t\t\t\t %s\n",array[i].character,array[i].character ,array[i].code);
}
fclose(out); //close output file
encode(array,number-1); //encode strings line by line
fclose(in); //close input file
}
int decode(nodetype *array,char *code,int number,FILE *out2)
{
char string[20]="";
int test,i,j;
for(j=0;code[j]!='\0';j++)
{
if (code[j]=='0')
{
stack_push(string,'0'); //if input char = 0 push a 0
}
else
{
stack_push(string,'1'); //if input char = 1 push a 1
}
for(i=0;i<=number;i++)
{
test=strcmp(array[i].code,string); //see if string on stack is equal
//to string in huffman code
if(test==0) //if yes ouput that symbol
{
fputc(array[i].character,out2);
string[0]='\0'; //initialize stack
}
}
}
return(1);
}
int encode(nodetype *array,int number)
{
int i;
double count_char=0.0,code_length=0.0;
char string[256]="";
char c;
FILE *out2;
fclose(in);
if ((in=fopen("\\input2.txt","rt"))==NULL) //open input file
{
printf("Cannot open input file");
}
if ((out2=fopen("\\output.txt","a"))==NULL) //open output file
{
printf("Cannot open output file");
}
fprintf(out2,"\n\nStarting Coding Process\n\n");
c=fgetc(in); //get first symbol
fprintf(out2,"Encoding String:\n");
fputc(c,out2);
while(c!=EOF) //output first symbol
{
while(c!='\n') //code until end of line
{
for(i=0;(array[i].character!=c);i++);
code_length+=strlen(array[i].code); //calculate lenght of string
count_char++; //increment number of characters coded
strncat(string,array[i].code,code_length+1); //store string for output
c=fgetc(in); //output character
fputc(c,out2);
}
fprintf(out2,"\n");
fprintf(out2,"Compressed String:\n %s\n",string);
fprintf(out2,"\nString has been compressed %1.2lf percent\n",1.0-(code_length/8/count_char));
fprintf(out2,"\nDecoding String:\n");
decode(array,string,number,out2); //decode string just encoded
fprintf(out2,"\n\n\n\n\nEncoding String:\n");
fprintf(out2," ");
string[0]='\0'; //initialize string variable
count_char=0.0; //initialize count
code_length=0.0; //intialize length
c=fgetc(in); //get next symbol
fputc(c,out2); //print character
}
fclose(in); //close files
fclose(out2);
return(1);
}
void traverse(nodetype *T, nodetype P[256], char *string)
{
int i;
if (T!=NULL)
{
if (T->left)
{
stack_push(string,'0'); //traverse left push 0
traverse(T->left,P,string); // go to left child
stack_pop(string); //return from left pop 0
}
if(!(T->left||T->right))
{
//fprintf(out,"%x %s\n",T->character, string);
i=strlen(string); //calculate length of code
P[array_pt].code=(char *)calloc(i+1,sizeof(char));//allocate space for code
strcpy(P[array_pt].code,string); //copy code into node
P[array_pt].character=T->character;//store char into node
array_pt--; //decrement pointer
}
if (T->right)
{
stack_push(string,'1'); //traverse right push 1
traverse(T->right,P,string);//traverse right
stack_pop(string); //return from right pop1
}
}
}
void stack_push(char *string,char number)
{
char *p;
p=strchr(string,'\0');//this code pushes a symbol onto stack
*p=number;
++p;
*p='\0';
}
void stack_pop(char *string)
{
char *p;
p=strchr(string,'\0');//this code pops a symbol from the stack
--p;
*p='\0';
}
void huffman(nodetype *array,int number)
{
nodetype *left;//this code builds a huffman tree.
nodetype *right;
number--;
while(number!=0)
{
left=(nodetype *)calloc(1,sizeof(nodetype)); //allocate space
right=(nodetype *)calloc(1,sizeof(nodetype));//allocate space
left->left=array[number-1].left; //copy node into left
left->right=array[number-1].right; //copy node into left
right->left=array[number].left; //copy node into right
right->right=array[number].right; // copy node into right
left->character=array[number-1].character; //copy char into left
left->prob=array[number-1].prob; //copy probability into left
right->character=array[number].character; //copy char into right
right->prob=array[number].prob; //copy probablity into right
array[number-1].left=left; //update pointers
array[number-1].right=right;
array[number-1].prob+=array[number].prob;
number--;
qsort(array,number+1,sizeof(nodetype),oc_cmp); //sort nodes
}
}
int oc_cmp(const void *vp, const void *vq) //this is a sorting function for
{ //for qsort
const nodetype *p=vp;
const nodetype *q=vq;
int diff;
diff = (q->prob) - (p->prob);
return (diff);
}
int occurence(nodetype *array)
{
char temp;
int num_char=0,i,flag=1;
if ((out=fopen("\\output.txt","wt"))==NULL) //open output file
{
printf("Cannot open output file");
}
if ((in=fopen("\\input.txt","rt"))==NULL) //open input file
{
printf("Cannot open input file");
}
temp=fgetc(in); //get first symbole
while(temp!=EOF) //do until end of file
{ //search array of symbols
for(i=0;(i<num_char);i++)
{
if (array[i].character==temp)
{
flag=0;
++array[i].prob; //if in array update frequency
}
}
if(flag)
{
++num_char;
array[num_char-1].character=temp; //add to array
array[num_char-1].prob++; //increment probabilty
}
flag=1;
temp=fgetc(in);
}
fclose(in);
return (num_char);
}
.