What You Need to Know About Hash-Based Collections

Hello everyone. In touch Vladislav Rodin. Currently, I am the head of the High Load Architect course at OTUS, and I also teach courses on software architecture.

In addition to teaching, as you can see, I am also writing copyright material for the OTUS blog on the Habré and I want to devote today's article to launching a new stream of the course "Algorithms for Developers" .





Introduction


Hash tables (HashMap) along with dynamic arrays are the most popular data structures used in production. Very often you can hear questions at interviews regarding their purpose, features of their internal structure, as well as related algorithms. This data structure is classical and is found not only in Java, but also in many other programming languages.

Formulation of the problem


Let's set a goal to create a data structure that allows you to:

  • contains (Element element) check whether an element is in it or not, for O (1)
  • add (Element element) add an element in O (1)
  • delete (Element element) delete an element in O (1)

Array


Let's try to do these operations on top of an array, which is the simplest data structure. We agree that we will consider the cell empty if it contains null.

Availability check


It is necessary to do a linear search through the array, because the element can potentially be in any cell. Asymptotically, this can be done in O (n), where n is the size of the array.

Adding


If we need to add an element anywhere, then first we need to find an empty cell, and then write the element to it. Thus, we again carry out a linear search and obtain the asymptotic behavior of O (n).

Delete


To delete an element, you must first find it, and then write null to the found cell. Again linear search leads us to O (n).

The simplest hash set


Please note that with each operation, we first searched for the index of the desired cell, and then performed the operation, and it is the search that spoils the asymptotics for us! If we learned to get this index for O (1), then the original problem would be solved.

Let's now replace the linear search with the following algorithm: we will calculate the value of a certain function - a hash function that maps a class object to an integer. After that, we will compare the resulting integer to the index of the array cell (this is quite easy to do, for example, taking the remainder of dividing this number by the size of the array). If the hash function is written in such a way that it is considered to be O (1) (and it is usually written like this), then we get the simplest implementation of the hash set. An array cell in terms of a hash set can be called a bucket .

The problems of the simplest implementation of a hash set


No matter how the hash function is written, the number of cells in the array is always limited, while the set of elements that we want to store in the data structure is unlimited. After all, we would not bother with the data structure if there was a need to save only ten previously known elements, right? This state of affairs leads to inevitable conflicts . A collision is a situation when, when adding different objects, we fall into the same cell in the array.

Two methods were invented to resolve collisions: the chaining method and the open addressing method .

Chaining method


The chaining method is the simplest method for resolving collisions. In the cell of the array we will not store the elements, but a linked list of these elements. Because adding to the top of the list (and it doesn’t matter to which part of the list to add an element) has the asymptotic behavior of O (1), we will not ruin the general asymptotic behavior, and it will remain equal to O (1).

This implementation has a problem: if the lists grow very much (as a last resort, we can consider a hash function that returns a constant for any object), then we get the asymptotics O (m), where m is the number of elements in the set, if the size array is fixed. To avoid such troubles, the concept of duty cycle is introduced.(it can be equal, for example, 1.5). If when adding an element it turns out that the fraction of the number of elements that are in the data structure relative to the size of the array exceeds the fill factor, then the following happens: a new array is selected whose size exceeds the size of the old array (for example, 2 times), and the data structure is rebuilt on the new array.

This method of resolving collisions is used in Java, and the data structure is called HashSet .

Open addressing method


In this method, the elements themselves are stored in the cells, and in the event of a collision, a sequence of samples occurs , that is, we begin to sort through the cells using some algorithm in the hope of finding a free one. This can be done by different algorithms ( linear / quadratic sequence of samples , double hashing ), each of which has its own problems (for example, the emergence of primary or secondary clusters ).

From a hash set to a hash table (HashMap)


Let's create a data structure that allows you to add, delete, search for items, but by some key, as quickly as a hash set (that is, beyond O (1)).

We will use the same data structure as the hash set, but we will not store elements, but pairs of elements.

Thus, the insertion (put (Key key, Value value)) will be carried out as follows: we will calculate the cell of the array by an object of type Key, we will get the number of bucket. Let's go through the list in the bucket, comparing the key with the key in stored pairs. If you find the same key - just squeeze out the old value, if you did not find it - add a pair.

How is an item received by key? Yes, according to a similar principle: we get a bucket by key, go through the pairs and return the value in a pair, the key in which is equal to the key in the request, if there is such a pair, and null otherwise.

This data structure is called a hash table .

Typical interview questions


Q: How are HashSet and HashMap arranged? How are the main operations carried out in these collections? How are equals () and hashCode () methods applied?
A : Answers to these questions can be found above.

Q: What is the contract for equals () and hashCode ()? What is it due to?
A : If the objects are equal, then they must have hashCode equal. Thus, hashCode must be determined by the ability of the fields involved in equals. Violation of this contract can lead to very interesting effects. If the objects are equal, but their hashCode is different, then you may not get the value corresponding to the key by which you just added the object to the HashSet or to the HashMap.

Conclusion


At interviews, they like to ask different cases related to these data structures. Moreover, the decision of any of them can be deduced from understanding the principles of their work without any “cramming”.



That's all. If you have read the material to the end, I invite my colleague Evgeny Volosatov to show you how to solve the olympiad problem using the ideas of dynamic programming for a free lesson on the topic “Secrets of Dynamic Programming” .



All Articles