July 4, 2022 1 min read

Using Reference Tables in Citus for Efficient Joins

Azure Citus PostgreSQL Reference Tables Performance

Reference tables in Citus are a powerful feature for handling lookup data that needs to be joined with distributed tables. They’re replicated to all worker nodes, enabling efficient local joins.

What are Reference Tables?

Reference tables are small tables that:

Are replicated in full to every worker node
Can be joined with any distributed table locally
Are ideal for lookup data, configuration, or dimensional data

Creating Reference Tables

-- Lookup tables perfect for reference tables
CREATE TABLE countries (
    id int PRIMARY KEY,
    code char(2),
    name text
);

CREATE TABLE product_categories (
    id int PRIMARY KEY,
    name text,
    description text
);

CREATE TABLE subscription_plans (
    id int PRIMARY KEY,
    name text,
    price decimal,
    features jsonb
);

-- Convert to reference tables
SELECT create_reference_table('countries');
SELECT create_reference_table('product_categories');
SELECT create_reference_table('subscription_plans');

Efficient Joins with Reference Tables

-- Distributed table
CREATE TABLE orders (
    id bigserial,
    tenant_id int,
    product_category_id int,
    country_id int,
    amount decimal,
    created_at timestamptz,
    PRIMARY KEY (tenant_id, id)
);

SELECT create_distributed_table('orders', 'tenant_id');

-- This join executes locally on each worker
SELECT
    o.id,
    c.name as country,
    pc.name as category,
    o.amount
FROM orders o
JOIN countries c ON o.country_id = c.id
JOIN product_categories pc ON o.product_category_id = pc.id
WHERE o.tenant_id = 42;

When to Use Reference Tables

-- Good candidates: small, frequently joined tables
-- Check table size before making it a reference table
SELECT
    relname,
    pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
WHERE relname IN ('countries', 'product_categories', 'subscription_plans');

Updating Reference Tables

Updates to reference tables are automatically propagated to all nodes:

-- This update happens on all nodes atomically
INSERT INTO countries (id, code, name)
VALUES (250, 'FR', 'France')
ON CONFLICT (id) DO UPDATE
SET name = EXCLUDED.name;

-- Verify replication
SELECT run_command_on_workers($$
    SELECT COUNT(*) FROM countries
$$);

Best Practices

Keep reference tables small: Under 100MB is ideal
Use for slowly changing data: Frequent updates impact all nodes
Index appropriately: Indexes are replicated too

-- Create indexes on reference tables
CREATE INDEX idx_countries_code ON countries(code);
CREATE INDEX idx_categories_name ON product_categories(name);

Reference tables eliminate network round-trips for joins, significantly improving query performance in distributed setups.