To more easily complete queries, large databases use indexes. However, some databases are too large even for regular indexes to be performant. In those cases, hashing or hash indexing can be a more powerful, convenient alternative.
Here, we’ll cover the basics of hashing, its classifications, as well as situations where it’s the most useful.
DBMS Hash Indexing Explained
In a nutshell, hashing in DBMS is a way to quickly find data on disk without using a traditional index. Instead, the database uses a hash function to calculate the location of a given record.
It usually takes the primary key, finds the hash based on a certain mathematical function, and points to the appropriate bucket where the queried data is.
What Are Hash Buckets?
In hashing, buckets are essentially used to apportion data for lookup or sorting, similar to an array index. It’s where the hashes derived from the hashing function are stored, providing fast access to the required data.
Each bucket is identified via an address; so a bucket with one address holds all index entries for a certain search key, provided that function(search key) = address.
When Is Hashing Preferable to Standard Indexing?
In specific cases, hashing is superior to indexes. More precisely, hashing works better for point lookups and primary key-based lookups.
Hashing is only usable in equality/inequality checks, unlike indexes. But hashing performs better in that regard than B-tree indexes: its cost is equal to the number of pages in the bucket, so it’s much cheaper if there’s no overflow (more on overflow later).
Hash Functions Explained
Hash functions are the means through which a key becomes a hash of length-restricted value. These functions should have the following properties:
- They remain the same for every hash being made
- They’re easy to calculate quickly
- No two hashes should be alike—collision happens when two hashes are the same
- Every unique input causes a unique output
Common Hash Functions
- Division
- Mid-Square
- Folding
- Multiplication
Types of Hashing
Broadly speaking, we divide hashing techniques in two ways: by extendability and collision resolution.
Sorted by extendability, we use the following categorization:
- Static
- Dynamic
By collision resolution, hashing can be:
- Open (separate chaining)
- Closed (open addressing)
Static Vs Dynamic Hashing
In static hashing, hash tables have a fixed size. When a table is filled up, an overflow page is allocated to take in the brimming data. But this overflow page can also reach maximum capacity, after which we add another one. This chain is called the overflow chain.
Static hashing is a fast way to complete insert and delete operations, seeing that it only needs two I/Os, those being read and write. However, you do need to know the size of the hash table in advance. If you fail to optimize the hash table, the overflow chain can get quite long, causing a dip in performance and making scaling more difficult.
To compensate for the shortcomings of static hashing, dynamic hashing takes a different approach. Dynamic hash tables aren’t fixed; instead, they expand to accommodate more data. Conversely, if a dynamic hash table is too empty, it can be resized to save memory usage.
The advantage of dynamic hashing lies in its flexibility: you don’t have to plan its size in advance, and it still maintains an O(1) lookup/insertion performance. On the other hand, dynamic tables change data location as they grow or shrink. This can make performance inconsistent as data jumps around the table.
Open Vs Closed Hashing (Separate Chain Vs. Open Address)
For hashing to be effective, we have to reduce collisions (i.e., two hashes for different inputs being the same) as much as possible.
In closed hashing, we store records within the hash table array itself, and available fields for hashes are found via probing. There should be no more than one key per bucket. In contrast, open hashing entails using separate lists to store hashes (hence the term “separate” chains) with an arbitrary possible number of keys per table.
Why Is Closed Hashing Also Called “Open Address?”
The fact that “closed” hashing and “open” addresses are synonyms sounds confusing at first. But here’s why that’s the case: if the location of a record is independent of its hash value—as is the case in closed hashing—then its address is open by definition; it isn’t fixed.
Closed Hashing and Probing
In closed hashing, to find a place for a new hash entry or look for an existing record within a bucket array, a process called probing is necessary. Through probing, we examine the buckets in a given probe sequence (mainly linear, double-hashing, or quadratic) and look for an unoccupied slot.
Probing is an integral part of specific types of collision resolution. For example, Robin Hood hashing displaces duplicate hashes based on PSL (probe sequence length), displacing those farther from the bucket where the item was originally hashed.
Best Use Cases for Hashing
In the following cases, hashing gives you the most bang for your buck:
- Graphics
- Game boards
- Message digests
- Compiler operation
- Password verification
- Linking file names and paths
Hashing algorithms excel in various scenarios where fast data retrieval, security, and efficiency are essential. Graphics processing, where quick access to pixel data or image properties is crucial, benefits from hashing's speed. Similarly, game boards, particularly in complex video games, rely on hashing to quickly determine the state of the game and make efficient moves.
In cybersecurity, hashing plays a pivotal role in generating message digests, ensuring data integrity, and detecting tampering. Moreover, during compiler operation, hashing helps expedite symbol table lookups and optimize code generation. Password verification is another prime application, enhancing security by storing and comparing password hashes instead of the actual passwords.
Lastly, when linking file names and paths in file systems, hashing enables rapid access and reduces the need for exhaustive searches. In these use cases, hashing offers a powerful solution to streamline operations and enhance both performance and security.
Conclusion
In conclusion, understanding the power of hashing in the world of database management is crucial. It provides an efficient means to access data without the traditional index, making it particularly valuable in cases of extensive databases. Hashing technology's ability to distribute and locate data quickly offers substantial advantages for point lookups and primary key-based searches.
As you explore the possibilities of implementing hashing in your database system, consider innovative solutions like IneryDB.
IneryDB leverages the efficiency of hashing to enhance data retrieval and system performance, contributing to the ever-evolving landscape of decentralized and efficient data management. So, as you navigate the world of database management, remember that hashing plays a pivotal role, and solutions like IneryDB can take your data management to new heights.
Inery•
2 years ago
Our Vision for Metaverse: To Connect a Disconnected World
Changing the dynamics of metaverse from a platform-specific approach to the interoperability of content between different platforms. ...READ MORE
Share
Inery•
1 year ago
Building a Decentralized Future: The Power Of Community In Web3
Community involvement is key to Web3 project success. It fosters adoption, creativity, and more. Discover the impact of community on Web3 initiatives. ...READ MORE
Share
Inery•
7 months ago
Decentralized Storage Solutions: Comparing IneryDB with IPFS and Filecoin
Discover the leading decentralized storage platforms—IneryDB, IPFS, and Filecoin—each offering unique advantages for data management. ...READ MORE
Share
Inery•
2 years ago
Our Vision for Healthcare: Bringing Privacy Back
Value-creation for healthcare data sharing on a decentralized infrastructure. ...READ MORE
Share
Most popular today