Skip to content

Hashing Containers

Hash-based containers provide the fastest average lookup times in the STL, offering O(1) average complexity for insertions, deletions, and searches. Understanding how hashing works and when to use these containers is crucial for building high-performance applications.

What is Hashing?

Hashing is a technique that maps data of arbitrary size to fixed-size values (hash codes). A hash function takes an input and produces a hash value that determines where the data is stored in a hash table.

Basic Hashing Concept

cpp
// Simple hash function for integers
size_t simpleHash(int value) {
    return value % 1000;  // Maps to buckets 0-999
}

// Hash function for strings
size_t stringHash(const std::string& str) {
    size_t hash = 0;
    for (char c : str) {
        hash = hash * 31 + c;  // Simple but effective
    }
    return hash;
}

// Example usage
int num = 42;
size_t hash1 = simpleHash(num);        // hash1 = 42
std::string text = "hello";
size_t hash2 = stringHash(text);       // hash2 = some value

Key Properties of Good Hash Functions:

  • Deterministic: Same input always produces same output
  • Uniform Distribution: Outputs spread evenly across range
  • Fast Computation: Should be quick to calculate
  • Minimal Collisions: Different inputs should rarely produce same output

How Hash Tables Work

cpp
// Conceptual hash table structure
struct HashTable {
    std::vector<std::list<std::pair<std::string, int>>> buckets;
    size_t size;
    float load_factor;
};

// Simplified example
std::vector<std::list<std::pair<std::string, int>>> hash_table(10);

// Inserting "apple" -> 5
size_t hash = stringHash("apple") % 10;  // hash = 3
hash_table[3].push_back({"apple", 5});

// Looking up "apple"
hash = stringHash("apple") % 10;  // hash = 3
for (const auto& pair : hash_table[3]) {
    if (pair.first == "apple") {
        std::cout << "Found: " << pair.second << std::endl;  // 5
        break;
    }
}

Hash Table Components:

  • Buckets: Array of containers (usually linked lists)
  • Hash Function: Maps keys to bucket indices
  • Collision Resolution: Handles multiple keys mapping to same bucket
  • Load Factor: Ratio of elements to buckets

std::unordered_map

Basic Usage

cpp
#include <unordered_map>
#include <string>
#include <iostream>

// Create an unordered map
std::unordered_map<std::string, int> scores;

// Insert elements
scores["Alice"] = 95;
scores["Bob"] = 87;
scores["Charlie"] = 92;

// Or use insert method
scores.insert({"David", 89});
scores.emplace("Eve", 91);

// Access elements
std::cout << "Alice's score: " << scores["Alice"] << std::endl;  // 95

// Check if key exists
if (scores.find("Frank") != scores.end()) {
    std::cout << "Frank's score: " << scores["Frank"] << std::endl;
} else {
    std::cout << "Frank not found" << std::endl;
}

// Safe access with default value
int frank_score = scores["Frank"];  // Creates entry with value 0

Iterating Over unordered_map

cpp
std::unordered_map<std::string, int> scores = {
    {"Alice", 95}, {"Bob", 87}, {"Charlie", 92}
};

// Range-based for loop
for (const auto& [name, score] : scores) {
    std::cout << name << ": " << score << std::endl;
}

// Iterator-based loop
for (auto it = scores.begin(); it != scores.end(); ++it) {
    std::cout << it->first << ": " << it->second << std::endl;
}

// Note: Order is not guaranteed!
// Output might be: Charlie: 92, Alice: 95, Bob: 87

Advanced Operations

cpp
std::unordered_map<std::string, int> scores;

// Insert with duplicate handling
auto result = scores.insert({"Alice", 95});
if (!result.second) {
    std::cout << "Alice already exists with score: " << result.first->second << std::endl;
}

// Update existing value
scores["Alice"] = 98;  // Overwrites existing value

// Remove elements
scores.erase("Bob");

// Get size and check emptiness
std::cout << "Size: " << scores.size() << std::endl;
std::cout << "Empty: " << scores.empty() << std::endl;

// Clear all elements
scores.clear();

std::unordered_set

Basic Usage

cpp
#include <unordered_set>
#include <string>

// Create an unordered set
std::unordered_set<std::string> unique_names;

// Insert elements
unique_names.insert("Alice");
unique_names.emplace("Bob");
unique_names.insert("Charlie");

// Check membership
if (unique_names.find("Alice") != unique_names.end()) {
    std::cout << "Alice is in the set" << std::endl;
}

// Count occurrences (0 or 1 for sets)
size_t count = unique_names.count("Bob");  // 1

// Remove elements
unique_names.erase("Charlie");

// Get size
std::cout << "Number of unique names: " << unique_names.size() << std::endl;

Set Operations

cpp
std::unordered_set<int> set1 = {1, 2, 3, 4, 5};
std::unordered_set<int> set2 = {4, 5, 6, 7, 8};

// Check if element exists
bool has_three = set1.contains(3);  // C++20
bool has_six = set2.find(6) != set2.end();  // C++17 and earlier

// Iterate over elements
for (const auto& num : set1) {
    std::cout << num << " ";  // Order not guaranteed
}
std::cout << std::endl;

Performance and Memory

Understanding Load Factor

The load factor is the ratio of the number of elements to the number of buckets:

cpp
std::unordered_map<std::string, int> scores;

// Get current load factor
float current_load = scores.load_factor();
std::cout << "Current load factor: " << current_load << std::endl;

// Get maximum load factor
float max_load = scores.max_load_factor();
std::cout << "Maximum load factor: " << max_load << std::endl;

// Set custom maximum load factor
scores.max_load_factor(0.5f);  // More aggressive rehashing

// Reserve buckets to control load factor
scores.reserve(1000);  // Pre-allocate 1000 buckets

How Rehashing Works

cpp
std::unordered_map<std::string, int> scores;

// Insert elements and watch rehashing
for (int i = 0; i < 1000; ++i) {
    std::string name = "User" + std::to_string(i);
    scores[name] = i;

    // Check when rehashing occurs
    if (i % 100 == 0) {
        std::cout << "Elements: " << scores.size() 
                  << ", Buckets: " << scores.bucket_count()
                  << ", Load Factor: " << scores.load_factor() << std::endl;
    }
}

Rehashing Process:

  1. Trigger: Load factor exceeds maximum load factor
  2. Growth: Number of buckets increases (usually doubles)
  3. Redistribution: All elements are rehashed and redistributed
  4. Cost: O(n) operation, but amortized over many insertions

Time Complexity

OperationAverage CaseWorst CaseNotes
InsertionO(1)O(n)Depends on hash function quality
LookupO(1)O(n)Collisions can degrade performance
DeletionO(1)O(n)Same as lookup
IterationO(n)O(n)Must visit all elements

Memory Overhead

cpp
// Compare memory usage
std::vector<std::pair<std::string, int>> vector_storage;
std::unordered_map<std::string, int> hash_storage;

// Add same elements to both
for (int i = 0; i < 1000; ++i) {
    std::string key = "key" + std::to_string(i);
    vector_storage.push_back({key, i});
    hash_storage[key] = i;
}

// Hash containers have overhead:
// - Bucket array
// - Hash table metadata
// - Potential unused buckets
// - Linked list overhead for collisions

Custom Hash Functions

For Custom Types

cpp
struct Person {
    std::string name;
    int age;

    bool operator==(const Person& other) const {
        return name == other.name && age == other.age;
    }
};

// Custom hash function for Person
struct PersonHash {
    size_t operator()(const Person& person) const {
        size_t name_hash = std::hash<std::string>{}(person.name);
        size_t age_hash = std::hash<int>{}(person.age);
        return name_hash ^ (age_hash << 1);  // Combine hashes
    }
};

// Use custom hash function
std::unordered_set<Person, PersonHash> people;
people.insert({"Alice", 25});
people.insert({"Bob", 30});

For Enums

cpp
enum class Color { Red, Green, Blue };

// Hash function for enums
struct ColorHash {
    size_t operator()(Color color) const {
        return static_cast<size_t>(color);
    }
};

std::unordered_map<Color, std::string, ColorHash> color_names = {
    {Color::Red, "Red"},
    {Color::Green, "Green"},
    {Color::Blue, "Blue"}
};

When to Use Hash Containers

Use std::unordered_map/unordered_set when:

  • Fast lookup is critical: O(1) average access time
  • Order doesn't matter: Elements can be in any order
  • Frequent insertions/deletions: Good average performance
  • Large datasets: Hash tables scale well
  • Key-based access: Need to find elements by key quickly

Consider alternatives when:

  • Ordered iteration needed: Use std::map or std::set
  • Memory is critical: Hash containers have overhead
  • Predictable performance: Hash tables have variable performance
  • Small datasets: Overhead might not be worth it

Summary

Hash containers (std::unordered_map and std::unordered_set) provide:

Advantages:

  • Fast average access: O(1) insertion, lookup, and deletion
  • Scalable performance: Good performance with large datasets
  • Flexible key types: Can hash any type with proper hash function

Trade-offs:

  • No ordering guarantee: Elements are not sorted
  • Variable performance: Depends on hash function quality and load factor
  • Memory overhead: Additional storage for buckets and metadata
  • Rehashing cost: Periodic O(n) rehashing operations

Best Use Cases:

  • Fast lookups: When you need O(1) average access time
  • Unordered data: When element order doesn't matter
  • Large datasets: When the benefits of hashing outweigh overhead
  • Key-value storage: When you need efficient key-based access

Questions

Q: What is the average time complexity of insertion and lookup in std::unordered_map?

std::unordered_map provides average O(1) insertion and lookup time. This is because hash tables use hashing to distribute elements across buckets, allowing direct access to elements in most cases. However, worst-case can be O(n) if there are many hash collisions.

Q: What happens when the load factor of a hash container exceeds the maximum load factor?

When the load factor exceeds the maximum load factor, the container automatically rehashes by increasing the number of buckets and redistributing all elements. This maintains good performance but can be expensive as it requires rehashing all existing elements.

Q: What is the main difference between std::map and std::unordered_map?

std::map maintains elements in sorted order (using a balanced tree), while std::unordered_map does not guarantee any specific order. This is the fundamental difference - std::map provides ordered iteration, std::unordered_map provides faster average access.

Q: When should you use std::unordered_set over std::set?

Use std::unordered_set when you need fast average O(1) lookup and don't care about the order of elements. std::set provides O(log n) lookup but maintains sorted order. Choose based on whether you need ordering or speed.

Q: What is the purpose of a hash function in hash containers?

The hash function's purpose is to distribute elements evenly across buckets. A good hash function minimizes collisions by spreading elements uniformly across the available buckets, which maintains O(1) average performance for insertions and lookups.