Reduce UUID memory consumption in pandas

I was working with a dataset of tens of millions of rows of data, which is using UUIDs as identifiers. UUID is not supported natively in Pandas. One option is to convert these to 128-bit integers, but those caused problems as they are not supported by Numpy. I was storing temporary files in Parquet and those were also getting quite big.

To solve this, I came up with a simple trick to compress the UUID to 64-bit ints. Since UUIDs are 128-bit long, this is obviously not a perfect solution. It is possible that some UUIDs map to same integer in the dataset. On the other hand with 64-bit int there are still 2^64 options so this is not very likely and in my case few collisions would not even matter.

I was not fully sure if the UUIDs were random, so I decided to hash them first to make sure they are distributed all over the bit space.

Code for the conversion is simple:

import hashlib
import uuid

int.from_bytes(hashlib.shake_128(uuid.UUID(uuid_str).bytes).digest(8), byteorder='big')

A small test app shows decent memory and file size savings.

# /// script
# requires-python = ">=3.12"
# dependencies = ["pandas", "pyarrow"]
# [tool.uv]
# exclude-newer = "2025-01-10T00:00:00Z"
# ///

import pandas as pd
import uuid
import hashlib
import os

# Function to convert UUID string to integer
def uuid_to_int(uuid_str):
    return int.from_bytes(hashlib.shake_128(uuid.UUID(uuid_str).bytes).digest(8), byteorder='big')

# Generate 1 million random UUIDs
def generate_uuid_list(n):
    return [str(uuid.uuid4()) for _ in range(n)]

# Generate DataFrame with UUID strings
num_records = 1_000_000
uuid_list = generate_uuid_list(num_records)
df_string = pd.DataFrame({'uuid': uuid_list})

# Create a copy and convert UUIDs to integers
df_int = df_string.copy()
df_int['uuid'] = df_int['uuid'].apply(uuid_to_int)

# Memory usage comparison
mem_usage_string = df_string.memory_usage(deep=True).sum() / (1024 ** 2)  # in MB
mem_usage_int = df_int.memory_usage(deep=True).sum() / (1024 ** 2)       # in MB

print(f"Memory usage with UUID strings: {mem_usage_string:.2f} MB")
print(f"Memory usage with converted integers: {mem_usage_int:.2f} MB")

# Save both DataFrames to Parquet
string_path = '/tmp/uuid_string.parquet'
int_path = '/tmp/uuid_int.parquet'
df_string.to_parquet(string_path)
df_int.to_parquet(int_path)

# File size comparison
size_string = os.path.getsize(string_path) / (1024 ** 2)  # in MB
size_int = os.path.getsize(int_path) / (1024 ** 2)        # in MB

print(f"Parquet file size with UUID strings: {size_string:.2f} MB")
print(f"Parquet file size with converted integers: {size_int:.2f} MB")

> uv run test.py

Memory usage with UUID strings: 81.06 MB
Memory usage with converted integers: 7.63 MB
Parquet file size with UUID strings: 34.48 MB
Parquet file size with converted integers: 7.90 MB