Project Setup¶
Learn how to organize and structure your Housekeeper projects for maintainability and team collaboration.
Project Structure¶
A well-organized Housekeeper project follows this recommended structure:
my-clickhouse-project/
├── housekeeper.yaml # Project configuration
├── db/
│ ├── main.sql # Main schema entrypoint
│ ├── config.d/ # ClickHouse configuration files
│ │ ├── _clickhouse.xml # Generated cluster configuration
│ │ └── custom.xml # Your custom ClickHouse config
│ ├── migrations/ # Generated migration files
│ │ ├── 20240101120000.sql
│ │ ├── 20240101130000.sql
│ │ └── housekeeper.sum # Migration integrity checksum
│ └── schemas/ # Modular schema organization
│ ├── _global/ # Global objects (not tied to databases)
│ │ ├── schema.sql # Global objects import file
│ │ ├── roles/ # Role definitions and grants
│ │ │ ├── admin_role.sql
│ │ │ └── readonly_role.sql
│ │ └── collections/ # Named collection definitions
│ │ ├── api_configs.sql
│ │ └── kafka_configs.sql
│ ├── analytics/
│ │ ├── schema.sql # Database definition
│ │ ├── tables/ # Table definitions
│ │ │ ├── events.sql
│ │ │ ├── users.sql
│ │ │ └── products.sql
│ │ ├── dictionaries/ # Dictionary definitions
│ │ │ ├── countries.sql
│ │ │ └── categories.sql
│ │ └── views/ # View definitions
│ │ ├── daily_stats.sql
│ │ └── user_summary.sql
│ └── logs/
│ ├── schema.sql
│ └── tables/
│ └── access_logs.sql
├── environments/ # Environment-specific configs
│ ├── development.yaml
│ ├── staging.yaml
│ └── production.yaml
└── README.md # Project documentation
Configuration File¶
The housekeeper.yaml
file defines your project configuration:
# ClickHouse Configuration
clickhouse:
version: "25.7" # Version for Docker containers
config_dir: "db/config.d" # ClickHouse configuration directory
cluster: "my_cluster" # Default cluster name for ON CLUSTER operations
# Schema Configuration
entrypoint: db/main.sql # Main schema file with import directives
dir: db/migrations # Directory for generated migration files
Schema Import System¶
The import system allows you to organize complex schemas into manageable, modular files.
Main Schema File¶
Your db/main.sql
serves as the entrypoint:
-- Main schema file
-- This file coordinates all schema imports
-- Import database definitions first
-- housekeeper:import schemas/analytics/schema.sql
-- housekeeper:import schemas/logs/schema.sql
-- Import table definitions
-- housekeeper:import schemas/analytics/tables/users.sql
-- housekeeper:import schemas/analytics/tables/events.sql
-- housekeeper:import schemas/analytics/tables/products.sql
-- Import dictionaries
-- housekeeper:import schemas/analytics/dictionaries/countries.sql
-- housekeeper:import schemas/analytics/dictionaries/categories.sql
-- Import views (depends on tables and dictionaries)
-- housekeeper:import schemas/analytics/views/daily_stats.sql
-- housekeeper:import schemas/analytics/views/user_summary.sql
Database Schema Files¶
db/schemas/analytics/schema.sql
:
-- Analytics database definition
CREATE DATABASE analytics
ON CLUSTER my_cluster
ENGINE = Atomic
COMMENT 'Analytics and user behavior database';
Table Definition Files¶
db/schemas/analytics/tables/users.sql
:
-- User profiles table
CREATE TABLE analytics.users ON CLUSTER my_cluster (
id UInt64,
email String,
name String,
created_at DateTime DEFAULT now(),
updated_at DateTime DEFAULT now(),
status LowCardinality(String) DEFAULT 'active',
metadata Map(String, String) DEFAULT map(),
-- Materialized columns for analytics
signup_month UInt32 MATERIALIZED toYYYYMM(created_at),
email_domain String MATERIALIZED splitByChar('@', email)[2]
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY id
PARTITION BY signup_month
SETTINGS index_granularity = 8192;
db/schemas/analytics/tables/events.sql
:
-- User event tracking table
CREATE TABLE analytics.events ON CLUSTER my_cluster (
id UUID DEFAULT generateUUIDv4(),
user_id UInt64,
event_type LowCardinality(String),
timestamp DateTime DEFAULT now(),
session_id String,
properties Map(String, String) DEFAULT map(),
-- Derived columns
date Date MATERIALIZED toDate(timestamp),
hour UInt8 MATERIALIZED toHour(timestamp)
)
ENGINE = MergeTree()
PARTITION BY (toYYYYMM(timestamp), event_type)
ORDER BY (timestamp, user_id, event_type)
TTL timestamp + INTERVAL 365 DAY
SETTINGS index_granularity = 8192;
Dictionary Definition Files¶
db/schemas/analytics/dictionaries/countries.sql
:
-- Country lookup dictionary
CREATE DICTIONARY analytics.countries_dict ON CLUSTER my_cluster (
country_code String,
country_name String,
continent String,
population UInt64 DEFAULT 0
)
PRIMARY KEY country_code
SOURCE(HTTP(
url 'https://api.example.com/countries'
format 'JSONEachRow'
headers('Authorization' 'Bearer YOUR_TOKEN')
))
LAYOUT(HASHED())
LIFETIME(MIN 3600 MAX 7200)
COMMENT 'Country reference data from external API';
View Definition Files¶
db/schemas/analytics/views/daily_stats.sql
:
-- Daily analytics summary view
CREATE MATERIALIZED VIEW analytics.daily_stats ON CLUSTER my_cluster
ENGINE = MergeTree()
ORDER BY (date, event_type)
POPULATE
AS SELECT
date,
event_type,
count() as event_count,
uniq(user_id) as unique_users,
uniq(session_id) as unique_sessions,
countIf(user_id = 0) as anonymous_events,
quantile(0.5)(length(JSONExtractString(properties, 'page_url'))) as median_url_length
FROM analytics.events
WHERE date >= today() - INTERVAL 90 DAY
GROUP BY date, event_type;
Import Resolution¶
The import system works as follows:
- Relative Paths: Import paths are resolved relative to the importing file
- Recursive Processing: Imported files can import other files
- Ordering: Imports are processed in the order they appear
- Single Output: All imports are combined into a single SQL stream
Example Import Resolution¶
Given this structure:
In db/main.sql
:
In db/schemas/analytics/schema.sql
:
The path tables/users.sql
is resolved relative to db/schemas/analytics/schema.sql
, resulting in db/schemas/analytics/tables/users.sql
.
Environment-Specific Configuration¶
For different environments, create separate configuration files:
environments/development.yaml
¶
clickhouse:
version: "25.7"
config_dir: "db/config.d"
cluster: "dev_cluster"
entrypoint: db/main.sql
dir: db/migrations
environments/production.yaml
¶
clickhouse:
version: "25.7"
config_dir: "db/config.d"
cluster: "production_cluster"
entrypoint: db/main.sql
dir: db/migrations
Use environment-specific configs:
# Development
housekeeper diff --config environments/development.yaml
# Production
housekeeper diff --config environments/production.yaml
Best Practices¶
File Organization¶
- One Object Per File: Each table, dictionary, or view in its own file
- Logical Grouping: Group related objects in directories
- Descriptive Names: Use clear, descriptive filenames
- Consistent Structure: Maintain consistent directory structure across projects
Import Guidelines¶
- Dependency Order: Import dependencies before dependents
- Database First: Always import database definitions first
- Tables Before Views: Import tables before views that depend on them
- Dictionaries Before Tables: Import dictionaries before tables that reference them
Schema Design¶
- Use Comments: Document your schema with SQL comments
- Cluster Awareness: Include
ON CLUSTER
clauses for distributed deployments - Performance Optimization: Choose appropriate partition keys and ordering
- Data Lifecycle: Use TTL for data retention policies
Version Control¶
- Track Everything: Commit all schema files, configurations, and migrations
- Meaningful Commits: Use descriptive commit messages for schema changes
- Branch Strategy: Use feature branches for schema changes
- Code Reviews: Review schema changes like application code
Bootstrapping from Existing Databases¶
When starting with an existing ClickHouse database, Housekeeper provides a complete bootstrap workflow:
Step 1: Bootstrap the Project¶
# Initialize project structure
housekeeper init
# Extract schema from existing database
housekeeper bootstrap --url localhost:9000
This creates a fully organized project structure with: - Separate files for each database object - Proper import directives - Organized directory hierarchy
Step 2: Create Initial Snapshot¶
Since you're starting with an existing database, you need to create an initial snapshot:
# Create bootstrap snapshot from project schema
housekeeper snapshot --bootstrap --description "Initial database state"
This creates a snapshot file that represents your current database state. The --bootstrap
flag tells the command to use the compiled project schema instead of existing migrations (which don't exist yet).
Step 3: Continue with Normal Workflow¶
After bootstrapping and creating the initial snapshot, you can use the normal development workflow:
# Make changes to schema files
# ...
# Generate migrations as usual
housekeeper diff
# Apply migrations
housekeeper migrate --url localhost:9000
Migration Management¶
Migration Integrity¶
Housekeeper generates a housekeeper.sum
file that tracks migration integrity:
h1:TotalHashOfAllMigrations=
20240101120000.sql h1:HashOfMigration1=
20240101130000.sql h1:ChainedHashWithPrevious=
This prevents unauthorized migration modifications and ensures consistency across environments.
Migration Workflow¶
- Make Schema Changes: Edit your schema files
- Generate Migration: Run
housekeeper diff
- Review Migration: Examine the generated SQL
- Test Migration: Apply to development environment
- Commit Changes: Commit schema files and migration together
- Deploy: Apply migration to staging/production
Next Steps¶
- Schema Management - Learn best practices for writing schemas
- Migration Process - Understand the migration workflow
- Cluster Management - Configure for distributed ClickHouse
- Examples - See a complete example project