Skip to content

Project Setup

Learn how to organize and structure your Housekeeper projects for maintainability and team collaboration.

Project Structure

A well-organized Housekeeper project follows this recommended structure:

my-clickhouse-project/
├── housekeeper.yaml          # Project configuration
├── db/
│   ├── main.sql              # Main schema entrypoint
│   ├── config.d/             # ClickHouse configuration files
│   │   ├── _clickhouse.xml   # Generated cluster configuration
│   │   └── custom.xml        # Your custom ClickHouse config
│   ├── migrations/           # Generated migration files
│   │   ├── 20240101120000.sql
│   │   ├── 20240101130000.sql
│   │   └── housekeeper.sum   # Migration integrity checksum
│   └── schemas/              # Modular schema organization
│       ├── _global/          # Global objects (not tied to databases)
│       │   ├── schema.sql    # Global objects import file
│       │   ├── roles/        # Role definitions and grants
│       │   │   ├── admin_role.sql
│       │   │   └── readonly_role.sql
│       │   └── collections/  # Named collection definitions
│       │       ├── api_configs.sql
│       │       └── kafka_configs.sql
│       ├── analytics/
│       │   ├── schema.sql    # Database definition
│       │   ├── tables/       # Table definitions
│       │   │   ├── events.sql
│       │   │   ├── users.sql
│       │   │   └── products.sql
│       │   ├── dictionaries/ # Dictionary definitions
│       │   │   ├── countries.sql
│       │   │   └── categories.sql
│       │   └── views/        # View definitions
│       │       ├── daily_stats.sql
│       │       └── user_summary.sql
│       └── logs/
│           ├── schema.sql
│           └── tables/
│               └── access_logs.sql
├── environments/             # Environment-specific configs
│   ├── development.yaml
│   ├── staging.yaml
│   └── production.yaml
└── README.md                 # Project documentation

Configuration File

The housekeeper.yaml file defines your project configuration:

# ClickHouse Configuration
clickhouse:
  version: "25.7"             # Version for Docker containers
  config_dir: "db/config.d"   # ClickHouse configuration directory
  cluster: "my_cluster"       # Default cluster name for ON CLUSTER operations

# Schema Configuration  
entrypoint: db/main.sql       # Main schema file with import directives
dir: db/migrations            # Directory for generated migration files

Schema Import System

The import system allows you to organize complex schemas into manageable, modular files.

Main Schema File

Your db/main.sql serves as the entrypoint:

-- Main schema file
-- This file coordinates all schema imports

-- Import database definitions first
-- housekeeper:import schemas/analytics/schema.sql
-- housekeeper:import schemas/logs/schema.sql

-- Import table definitions
-- housekeeper:import schemas/analytics/tables/users.sql
-- housekeeper:import schemas/analytics/tables/events.sql
-- housekeeper:import schemas/analytics/tables/products.sql

-- Import dictionaries
-- housekeeper:import schemas/analytics/dictionaries/countries.sql
-- housekeeper:import schemas/analytics/dictionaries/categories.sql

-- Import views (depends on tables and dictionaries)
-- housekeeper:import schemas/analytics/views/daily_stats.sql
-- housekeeper:import schemas/analytics/views/user_summary.sql

Database Schema Files

db/schemas/analytics/schema.sql:

-- Analytics database definition
CREATE DATABASE analytics 
ON CLUSTER my_cluster 
ENGINE = Atomic 
COMMENT 'Analytics and user behavior database';

Table Definition Files

db/schemas/analytics/tables/users.sql:

-- User profiles table
CREATE TABLE analytics.users ON CLUSTER my_cluster (
    id UInt64,
    email String,
    name String,
    created_at DateTime DEFAULT now(),
    updated_at DateTime DEFAULT now(),
    status LowCardinality(String) DEFAULT 'active',
    metadata Map(String, String) DEFAULT map(),

    -- Materialized columns for analytics
    signup_month UInt32 MATERIALIZED toYYYYMM(created_at),
    email_domain String MATERIALIZED splitByChar('@', email)[2]
) 
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id
PARTITION BY signup_month
SETTINGS index_granularity = 8192;

db/schemas/analytics/tables/events.sql:

-- User event tracking table
CREATE TABLE analytics.events ON CLUSTER my_cluster (
    id UUID DEFAULT generateUUIDv4(),
    user_id UInt64,
    event_type LowCardinality(String),
    timestamp DateTime DEFAULT now(),
    session_id String,
    properties Map(String, String) DEFAULT map(),

    -- Derived columns
    date Date MATERIALIZED toDate(timestamp),
    hour UInt8 MATERIALIZED toHour(timestamp)
)
ENGINE = MergeTree()
PARTITION BY (toYYYYMM(timestamp), event_type)
ORDER BY (timestamp, user_id, event_type)
TTL timestamp + INTERVAL 365 DAY
SETTINGS index_granularity = 8192;

Dictionary Definition Files

db/schemas/analytics/dictionaries/countries.sql:

-- Country lookup dictionary
CREATE DICTIONARY analytics.countries_dict ON CLUSTER my_cluster (
    country_code String,
    country_name String,
    continent String,
    population UInt64 DEFAULT 0
)
PRIMARY KEY country_code
SOURCE(HTTP(
    url 'https://api.example.com/countries'
    format 'JSONEachRow'
    headers('Authorization' 'Bearer YOUR_TOKEN')
))
LAYOUT(HASHED())
LIFETIME(MIN 3600 MAX 7200)
COMMENT 'Country reference data from external API';

View Definition Files

db/schemas/analytics/views/daily_stats.sql:

-- Daily analytics summary view
CREATE MATERIALIZED VIEW analytics.daily_stats ON CLUSTER my_cluster
ENGINE = MergeTree() 
ORDER BY (date, event_type)
POPULATE
AS SELECT 
    date,
    event_type,
    count() as event_count,
    uniq(user_id) as unique_users,
    uniq(session_id) as unique_sessions,
    countIf(user_id = 0) as anonymous_events,
    quantile(0.5)(length(JSONExtractString(properties, 'page_url'))) as median_url_length
FROM analytics.events
WHERE date >= today() - INTERVAL 90 DAY
GROUP BY date, event_type;

Import Resolution

The import system works as follows:

  1. Relative Paths: Import paths are resolved relative to the importing file
  2. Recursive Processing: Imported files can import other files
  3. Ordering: Imports are processed in the order they appear
  4. Single Output: All imports are combined into a single SQL stream

Example Import Resolution

Given this structure:

db/
├── main.sql
└── schemas/
    └── analytics/
        ├── schema.sql
        └── tables/
            └── users.sql

In db/main.sql:

-- housekeeper:import schemas/analytics/schema.sql

In db/schemas/analytics/schema.sql:

CREATE DATABASE analytics ENGINE = Atomic;
-- housekeeper:import tables/users.sql

The path tables/users.sql is resolved relative to db/schemas/analytics/schema.sql, resulting in db/schemas/analytics/tables/users.sql.

Environment-Specific Configuration

For different environments, create separate configuration files:

environments/development.yaml

clickhouse:
  version: "25.7"
  config_dir: "db/config.d"
  cluster: "dev_cluster"

entrypoint: db/main.sql
dir: db/migrations

environments/production.yaml

clickhouse:
  version: "25.7"
  config_dir: "db/config.d"
  cluster: "production_cluster"

entrypoint: db/main.sql
dir: db/migrations

Use environment-specific configs:

# Development
housekeeper diff --config environments/development.yaml

# Production  
housekeeper diff --config environments/production.yaml

Best Practices

File Organization

  1. One Object Per File: Each table, dictionary, or view in its own file
  2. Logical Grouping: Group related objects in directories
  3. Descriptive Names: Use clear, descriptive filenames
  4. Consistent Structure: Maintain consistent directory structure across projects

Import Guidelines

  1. Dependency Order: Import dependencies before dependents
  2. Database First: Always import database definitions first
  3. Tables Before Views: Import tables before views that depend on them
  4. Dictionaries Before Tables: Import dictionaries before tables that reference them

Schema Design

  1. Use Comments: Document your schema with SQL comments
  2. Cluster Awareness: Include ON CLUSTER clauses for distributed deployments
  3. Performance Optimization: Choose appropriate partition keys and ordering
  4. Data Lifecycle: Use TTL for data retention policies

Version Control

  1. Track Everything: Commit all schema files, configurations, and migrations
  2. Meaningful Commits: Use descriptive commit messages for schema changes
  3. Branch Strategy: Use feature branches for schema changes
  4. Code Reviews: Review schema changes like application code

Bootstrapping from Existing Databases

When starting with an existing ClickHouse database, Housekeeper provides a complete bootstrap workflow:

Step 1: Bootstrap the Project

# Initialize project structure
housekeeper init

# Extract schema from existing database
housekeeper bootstrap --url localhost:9000

This creates a fully organized project structure with: - Separate files for each database object - Proper import directives - Organized directory hierarchy

Step 2: Create Initial Snapshot

Since you're starting with an existing database, you need to create an initial snapshot:

# Create bootstrap snapshot from project schema
housekeeper snapshot --bootstrap --description "Initial database state"

This creates a snapshot file that represents your current database state. The --bootstrap flag tells the command to use the compiled project schema instead of existing migrations (which don't exist yet).

Step 3: Continue with Normal Workflow

After bootstrapping and creating the initial snapshot, you can use the normal development workflow:

# Make changes to schema files
# ...

# Generate migrations as usual
housekeeper diff

# Apply migrations
housekeeper migrate --url localhost:9000

Migration Management

Migration Integrity

Housekeeper generates a housekeeper.sum file that tracks migration integrity:

h1:TotalHashOfAllMigrations=
20240101120000.sql h1:HashOfMigration1=
20240101130000.sql h1:ChainedHashWithPrevious=

This prevents unauthorized migration modifications and ensures consistency across environments.

Migration Workflow

  1. Make Schema Changes: Edit your schema files
  2. Generate Migration: Run housekeeper diff
  3. Review Migration: Examine the generated SQL
  4. Test Migration: Apply to development environment
  5. Commit Changes: Commit schema files and migration together
  6. Deploy: Apply migration to staging/production

Next Steps