Hashing Containers

Hash-based containers provide the fastest average lookup times in the STL, offering O(1) average complexity for insertions, deletions, and searches. Understanding how hashing works and when to use these containers is crucial for building high-performance applications.

What is Hashing?

Hashing is a technique that maps data of arbitrary size to fixed-size values (hash codes). A hash function takes an input and produces a hash value that determines where the data is stored in a hash table.

Basic Hashing Concept

cpp

// Simple hash function for integers
size_t simpleHash(int value) {
    return value % 1000;  // Maps to buckets 0-999
}

// Hash function for strings
size_t stringHash(const std::string& str) {
    size_t hash = 0;
    for (char c : str) {
        hash = hash * 31 + c;  // Simple but effective
    }
    return hash;
}

// Example usage
int num = 42;
size_t hash1 = simpleHash(num);        // hash1 = 42
std::string text = "hello";
size_t hash2 = stringHash(text);       // hash2 = some value

Key Properties of Good Hash Functions:

Deterministic: Same input always produces same output
Uniform Distribution: Outputs spread evenly across range
Fast Computation: Should be quick to calculate
Minimal Collisions: Different inputs should rarely produce same output

How Hash Tables Work

cpp

// Conceptual hash table structure
struct HashTable {
    std::vector<std::list<std::pair<std::string, int>>> buckets;
    size_t size;
    float load_factor;
};

// Simplified example
std::vector<std::list<std::pair<std::string, int>>> hash_table(10);

// Inserting "apple" -> 5
size_t hash = stringHash("apple") % 10;  // hash = 3
hash_table[3].push_back({"apple", 5});

// Looking up "apple"
hash = stringHash("apple") % 10;  // hash = 3
for (const auto& pair : hash_table[3]) {
    if (pair.first == "apple") {
        std::cout << "Found: " << pair.second << std::endl;  // 5
        break;
    }
}

Hash Table Components:

Buckets: Array of containers (usually linked lists)
Hash Function: Maps keys to bucket indices
Collision Resolution: Handles multiple keys mapping to same bucket
Load Factor: Ratio of elements to buckets

std::unordered_map

Basic Usage

cpp

#include <unordered_map>
#include <string>
#include <iostream>

// Create an unordered map
std::unordered_map<std::string, int> scores;

// Insert elements
scores["Alice"] = 95;
scores["Bob"] = 87;
scores["Charlie"] = 92;

// Or use insert method
scores.insert({"David", 89});
scores.emplace("Eve", 91);

// Access elements
std::cout << "Alice's score: " << scores["Alice"] << std::endl;  // 95

// Check if key exists
if (scores.find("Frank") != scores.end()) {
    std::cout << "Frank's score: " << scores["Frank"] << std::endl;
} else {
    std::cout << "Frank not found" << std::endl;
}

// Safe access with default value
int frank_score = scores["Frank"];  // Creates entry with value 0

Iterating Over unordered_map

cpp

std::unordered_map<std::string, int> scores = {
    {"Alice", 95}, {"Bob", 87}, {"Charlie", 92}
};

// Range-based for loop
for (const auto& [name, score] : scores) {
    std::cout << name << ": " << score << std::endl;
}

// Iterator-based loop
for (auto it = scores.begin(); it != scores.end(); ++it) {
    std::cout << it->first << ": " << it->second << std::endl;
}

// Note: Order is not guaranteed!
// Output might be: Charlie: 92, Alice: 95, Bob: 87

Advanced Operations

cpp

std::unordered_map<std::string, int> scores;

// Insert with duplicate handling
auto result = scores.insert({"Alice", 95});
if (!result.second) {
    std::cout << "Alice already exists with score: " << result.first->second << std::endl;
}

// Update existing value
scores["Alice"] = 98;  // Overwrites existing value

// Remove elements
scores.erase("Bob");

// Get size and check emptiness
std::cout << "Size: " << scores.size() << std::endl;
std::cout << "Empty: " << scores.empty() << std::endl;

// Clear all elements
scores.clear();

std::unordered_set

Basic Usage

cpp

#include <unordered_set>
#include <string>

// Create an unordered set
std::unordered_set<std::string> unique_names;

// Insert elements
unique_names.insert("Alice");
unique_names.emplace("Bob");
unique_names.insert("Charlie");

// Check membership
if (unique_names.find("Alice") != unique_names.end()) {
    std::cout << "Alice is in the set" << std::endl;
}

// Count occurrences (0 or 1 for sets)
size_t count = unique_names.count("Bob");  // 1

// Remove elements
unique_names.erase("Charlie");

// Get size
std::cout << "Number of unique names: " << unique_names.size() << std::endl;

Set Operations

cpp

std::unordered_set<int> set1 = {1, 2, 3, 4, 5};
std::unordered_set<int> set2 = {4, 5, 6, 7, 8};

// Check if element exists
bool has_three = set1.contains(3);  // C++20
bool has_six = set2.find(6) != set2.end();  // C++17 and earlier

// Iterate over elements
for (const auto& num : set1) {
    std::cout << num << " ";  // Order not guaranteed
}
std::cout << std::endl;

Performance and Memory

Understanding Load Factor

The load factor is the ratio of the number of elements to the number of buckets:

cpp

std::unordered_map<std::string, int> scores;

// Get current load factor
float current_load = scores.load_factor();
std::cout << "Current load factor: " << current_load << std::endl;

// Get maximum load factor
float max_load = scores.max_load_factor();
std::cout << "Maximum load factor: " << max_load << std::endl;

// Set custom maximum load factor
scores.max_load_factor(0.5f);  // More aggressive rehashing

// Reserve buckets to control load factor
scores.reserve(1000);  // Pre-allocate 1000 buckets

How Rehashing Works

cpp

std::unordered_map<std::string, int> scores;

// Insert elements and watch rehashing
for (int i = 0; i < 1000; ++i) {
    std::string name = "User" + std::to_string(i);
    scores[name] = i;

    // Check when rehashing occurs
    if (i % 100 == 0) {
        std::cout << "Elements: " << scores.size() 
                  << ", Buckets: " << scores.bucket_count()
                  << ", Load Factor: " << scores.load_factor() << std::endl;
    }
}

Rehashing Process:

Trigger: Load factor exceeds maximum load factor
Growth: Number of buckets increases (usually doubles)
Redistribution: All elements are rehashed and redistributed
Cost: O(n) operation, but amortized over many insertions

Time Complexity

Operation	Average Case	Worst Case	Notes
Insertion	O(1)	O(n)	Depends on hash function quality
Lookup	O(1)	O(n)	Collisions can degrade performance
Deletion	O(1)	O(n)	Same as lookup
Iteration	O(n)	O(n)	Must visit all elements

Memory Overhead

cpp

// Compare memory usage
std::vector<std::pair<std::string, int>> vector_storage;
std::unordered_map<std::string, int> hash_storage;

// Add same elements to both
for (int i = 0; i < 1000; ++i) {
    std::string key = "key" + std::to_string(i);
    vector_storage.push_back({key, i});
    hash_storage[key] = i;
}

// Hash containers have overhead:
// - Bucket array
// - Hash table metadata
// - Potential unused buckets
// - Linked list overhead for collisions

Custom Hash Functions

For Custom Types

cpp

struct Person {
    std::string name;
    int age;

    bool operator==(const Person& other) const {
        return name == other.name && age == other.age;
    }
};

// Custom hash function for Person
struct PersonHash {
    size_t operator()(const Person& person) const {
        size_t name_hash = std::hash<std::string>{}(person.name);
        size_t age_hash = std::hash<int>{}(person.age);
        return name_hash ^ (age_hash << 1);  // Combine hashes
    }
};

// Use custom hash function
std::unordered_set<Person, PersonHash> people;
people.insert({"Alice", 25});
people.insert({"Bob", 30});

For Enums

cpp

enum class Color { Red, Green, Blue };

// Hash function for enums
struct ColorHash {
    size_t operator()(Color color) const {
        return static_cast<size_t>(color);
    }
};

std::unordered_map<Color, std::string, ColorHash> color_names = {
    {Color::Red, "Red"},
    {Color::Green, "Green"},
    {Color::Blue, "Blue"}
};

When to Use Hash Containers

Use std::unordered_map/unordered_set when:

Fast lookup is critical: O(1) average access time
Order doesn't matter: Elements can be in any order
Frequent insertions/deletions: Good average performance
Large datasets: Hash tables scale well
Key-based access: Need to find elements by key quickly

Consider alternatives when:

Ordered iteration needed: Use std::map or std::set
Memory is critical: Hash containers have overhead
Predictable performance: Hash tables have variable performance
Small datasets: Overhead might not be worth it

Summary

Hash containers (std::unordered_map and std::unordered_set) provide:

Advantages:

Fast average access: O(1) insertion, lookup, and deletion
Scalable performance: Good performance with large datasets
Flexible key types: Can hash any type with proper hash function

Trade-offs:

No ordering guarantee: Elements are not sorted
Variable performance: Depends on hash function quality and load factor
Memory overhead: Additional storage for buckets and metadata
Rehashing cost: Periodic O(n) rehashing operations

Best Use Cases:

Fast lookups: When you need O(1) average access time
Unordered data: When element order doesn't matter
Large datasets: When the benefits of hashing outweigh overhead
Key-value storage: When you need efficient key-based access

Questions

Q: What is the average time complexity of insertion and lookup in std::unordered_map?

std::unordered_map provides average O(1) insertion and lookup time. This is because hash tables use hashing to distribute elements across buckets, allowing direct access to elements in most cases. However, worst-case can be O(n) if there are many hash collisions.

Q: What happens when the load factor of a hash container exceeds the maximum load factor?

When the load factor exceeds the maximum load factor, the container automatically rehashes by increasing the number of buckets and redistributing all elements. This maintains good performance but can be expensive as it requires rehashing all existing elements.

Q: What is the main difference between std::map and std::unordered_map?

std::map maintains elements in sorted order (using a balanced tree), while std::unordered_map does not guarantee any specific order. This is the fundamental difference - std::map provides ordered iteration, std::unordered_map provides faster average access.

Q: When should you use std::unordered_set over std::set?

Use std::unordered_set when you need fast average O(1) lookup and don't care about the order of elements. std::set provides O(log n) lookup but maintains sorted order. Choose based on whether you need ordering or speed.

Q: What is the purpose of a hash function in hash containers?

The hash function's purpose is to distribute elements evenly across buckets. A good hash function minimizes collisions by spreading elements uniformly across the available buckets, which maintains O(1) average performance for insertions and lookups.

Hashing Containers ​

What is Hashing? ​

Basic Hashing Concept ​

How Hash Tables Work ​

std::unordered_map ​

Basic Usage ​

Iterating Over unordered_map ​

Advanced Operations ​

std::unordered_set ​

Basic Usage ​

Set Operations ​

Performance and Memory ​

Understanding Load Factor ​

How Rehashing Works ​

Time Complexity ​

Memory Overhead ​

Custom Hash Functions ​

For Custom Types ​

For Enums ​

When to Use Hash Containers ​

Use std::unordered_map/unordered_set when: ​

Consider alternatives when: ​

Summary ​

Questions ​

Hashing Containers

What is Hashing?

Basic Hashing Concept

How Hash Tables Work

std::unordered_map

Basic Usage

Iterating Over unordered_map

Advanced Operations

std::unordered_set

Basic Usage

Set Operations

Performance and Memory

Understanding Load Factor

How Rehashing Works

Time Complexity

Memory Overhead

Custom Hash Functions

For Custom Types

For Enums

When to Use Hash Containers

Use std::unordered_map/unordered_set when:

Consider alternatives when:

Summary

Questions