To build a huffman tree, we begin by reading a file. Assuming the name of the file is in args[0] you can set up a scanner to read it via:
Scanner scan = null; try { scan = new Scanner(new File(args[0])); } catch (FileNotFoundException fnfe) { System.err.println("No such file"); System.exit(-1); }
Next, you should set up a data structure to enable you to count all the occurrences of each character. Here is the code I used.
Such a HashMap will enable you to associate an integer with a character. You access the integer associated with character c with charCounts.get(c). You store an integer n to be associated with character c via charCounts.put(c, n).
My strategy was to read the file one line at a time, and use a "for" loop to look at each character in the line individually. Here is my code:
while (scan.hasNextLine()) { cs = scan.nextLine(); for (int i = 0; i < cs.length(); i++) { c = cs.charAt(i); if (charCounts.containsKey(c)) charCounts.put(c, charCounts.get(c) + 1); else charCounts.put(c, 1); } }
Next we will need to add each character-count pair as a HuffmanTreeNode into a priority queue. Again, if you want it, here is my code:
Finally, you will execute the Huffman-building algorithm. In English you will repeatedly merge the lowest-frequency two nodes into a single new node until there is just one node left. That one node will be the root of the Huffman tree.
Here is the same idea in Java:
while(pq.size() > 1) { pq.offer(new HuffmanTreeNode(pq.poll(), pq.poll())); } HuffmanTreeNode theTree = pq.poll();
You can print the huffman coding table to a file whose name is in args[1] using this code:
PrintStream huffmanTable = null; try { huffmanTable = new PrintStream(new File (args[1])); } catch (Exception e) { System.err.println("Need a file for the huffman table"); System.exit(-1); } for (Character chh : charCounts.keySet()) { huffmanTable.println(chh + " : " + theTree.lookup(chh)); } huffmanTable.close(); }
w : 011000 ? : 110101000 g : 000101 B : 011110101 b : 0111100 Q : 01111011111 , : 011100 M : 011110100 & : 000100000000101 f : 001111 y : 011101 T : 11010111 9 : 011001011101 x : 0110010110 e : 1100 3 : 01100101110000 ( : 00011100000 W : 00010001 h : 11101 c : 001110 H : 11010110 1 : 000111000110 G : 000111011 i : 11100 A : 11010000 ! : 011001110 t : 0101 ] : 11010100100 N : 011110110 s : 11111 X : 01100101110001 o : 0100 r : 11011See how the common letters have short codes? Your file will be similar, and bigger -- with a {0,1} code for each character encountered in your input file. My entire huffman table is here. Yours will likely differ.
If you want to decompress files, you need to think about how to store that Huffman tree for later use. A decompression algorithm will need that tree. I'm going to ask you to think about (but not actually program) decompression later in this lab.