My Favorite: Hashtable
From all fundamental data structure, hashtable is my favorite one. I know about this data structure not from college, in fact I never heard about it when I was studying in my college, my first time encounter with she is when I involved in a project that using the Beta version of .Net Framework as the tools.
The Application
After that first encounter, hashtable always come in handy whenever I need it. It was helped me when I was:
1. Developing an excel like pivot table program and print it using GDI+. As far as I can remember, in this project, I use a hashtable to group and summarize the data also to remember the cell boundaries when printing the displayed data in the grid. Well, thanks to hash table this program run faster than MS Excel and can load bigger data (millions of data) than the MS Excel version. This project is a proof that hashtable is good to handle big data.
2. Programming in JavaScript. I think the core point of JavaScript is hashtable. Because every array in javascript is also a hashtable. Every object is a hashtable.
3. Developing IrwanLang. I use hashtable to store list of variable and list of function declared so I can check if a variable/function is already defined or not by just passing the variable/function name to my hashtable friend.
3. Developing many more projects. I can't recall them again because I use hashtables unconsciously.
For me, a hashtable is just a love at the first sight, because I can use string or any kind of object as the index, more appropriately as the key, I don't have to do any calculation, just use it instantly.
In the beginning, I am courius how hashtable do his magic. Well, here is the fundamental theory of why people invented hashtable.
The Theory
Imagine that you have to build a software, a phonebook software, just like the one that you have on your cell phone. When there is someone calling, the software will display the name of the caller. So, the phone number is using as the key. Just forget about the persistent mechanism for now, let say the data is stored in volatile memory only. So the phone must always on, if it turned off, well the data will be gone :)
The requirements above, stated implicitly that the name retrieval function must be fast, so an algorithm with O(1) performance is preferred.
So with which data structure will you solve the problem?
Let's say there is four people that you'll have to add. We'll use Jakarta's phone number format for simplicity.
P1: 5357288
P2: 8849872
P3: 4358999
P4: 5697777
Those numbers are sparse, they are far away from each other. If you use array, then your data structure will look like this:
Figure 1. Phonebook data structure using an array
Well, what do you think with that? Is it a good solution? I don't think so, because it will consume a lot of memory just to store those four values.
Hmm...can we optimize it? Are there any alternatives?
Imagination is More Important Than Knowledge
What about linked list? Yeah, linked list can store those four numbers efficiently but what about its retrieval performance? Linked list is not adequate in terms of retrieval performance. Why?. (You can submit your answers by using the comments box below).
We need another kind of mechanism to achieve O(1) retrieval. Array is fast for retrieval, insertion too. OK, let's optimize our array.
After imagining for a while, I come up with this.
How about if we have this mechanism:
Figure 2. Optimizing the array by using the index calculator
We need to transform those numbers into the appropriate index of the cell in our array.
Hmm...Lets say I have an array with 10 cells. How do I transform those numbers to a valid array index? Think hard....!!!
OK, I think I've found the solution. We can use modulo.
10 mod 3 = 1
50 mod 3 = 2
90 mod 3 = 0
Well, modulo always yield result that is small number. But how can we makesure that the result will not give us index out of bounds error. Lets solve it heuristically. Let use our array size as the divisor.
5357288 mod 10 = 8
8849872 mod 10 = 2
4358999 mod 10 = 9
5697777 mod 10 = 7
Hey!!! I think we are quite sure that with modulo we won't have index out of bounds errors.
Those process is called hashing and the function we use, that is modulo by the array size, is called hash function. But can you help me to make sure our hash function is correct? (You can answer it mathematically or heuristically, I love to see both).
The Implementation
OK, lets implement our algorithm. Well, its your job to implement it :). Share it to the world ya! :)
Is it possible for two or more numbers to produce the same array index? If it yes, how can we solve it? Hmm...one problem solved comes another problem. That's why I love software development :-P
Can you give reason why the retrieval performance is O(1)? How about with its insertion and deletion?
Well, if you have reached to this section. Congratulations!!! You have build your own hashtable.
Conclusion
This is just an introduction to hashtable. Next time I will write more advance topic about hashtable. But from now on, use it. You don't have to write your own hashtable class, unless you really..really need it, because .Net already provided you with one. Especially in JavaScript :)