Migration Process¶
Learn how Housekeeper generates, manages, and applies ClickHouse schema migrations.
Overview¶
Housekeeper's migration system compares your desired schema (defined in your schema files) with the current state of your ClickHouse database and generates the necessary SQL statements to transform the current state to match your desired schema.
📝 Migration Tracking
Housekeeper automatically manages migration tracking infrastructure. When you run your first migration against any ClickHouse instance, Housekeeper creates ahousekeeper.revisionstable to track which migrations have been applied. This happens automatically - no manual setup required.
Connection Configuration¶
Before running migrations, configure your ClickHouse connection. The recommended approach is using the HOUSEKEEPER_DATABASE_URL environment variable:
# Set connection URL once for all commands
export HOUSEKEEPER_DATABASE_URL="localhost:9000"
# Now all commands will use this connection automatically
housekeeper migrate
housekeeper status
housekeeper schema dump
Alternatively, use the --url flag with each command:
How Migrations Work¶
1. Development Server Workflow¶
Housekeeper uses a development workflow with Docker containers:
This process:
- Starts a ClickHouse Docker container
- Applies all existing migrations from db/migrations/
- Provides connection details for the running server
2. Schema Compilation¶
When comparing schemas, Housekeeper compiles your target schema:
The compilation process:
- Processes all -- housekeeper:import directives recursively
- Resolves relative paths from each file's location
- Combines all SQL into a single output with proper ordering
- Validates all DDL syntax through the robust parser
3. Intelligent Comparison¶
The diff command compares your target schema with the current database state:
The comparison algorithm: - Current State: Reads schema from the running development server - Target State: Compiles your schema files - Object Detection: Identifies new, modified, and removed objects - Rename Detection: Recognizes when objects are renamed (avoiding DROP+CREATE) - Property Analysis: Compares all properties of each object type - Dependency Resolution: Understands relationships between objects
4. Migration Generation¶
Based on the comparison, Housekeeper generates optimal migration strategies:
- Databases: CREATE, ALTER (comments), RENAME, DROP
- Tables: CREATE, ALTER (columns), RENAME, DROP
- Dictionaries: CREATE OR REPLACE (dictionaries can't be altered)
- Views: CREATE OR REPLACE for regular views, DROP+CREATE for materialized views
Migration Strategies¶
Database Operations¶
Creating Databases¶
When a database exists in target schema but not current state:
Modifying Database Comments¶
When database properties change:
Renaming Databases¶
When a database has identical properties but different name:
Table Operations¶
Creating Tables¶
CREATE TABLE analytics.events (
id UUID DEFAULT generateUUIDv4(),
timestamp DateTime,
event_type String
) ENGINE = MergeTree() ORDER BY timestamp;
Altering Tables¶
For standard table engines, Housekeeper generates precise ALTER statements:
-- Add new column
ALTER TABLE analytics.events ADD COLUMN user_id UInt64;
-- Modify column type
ALTER TABLE analytics.events MODIFY COLUMN event_type LowCardinality(String);
-- Drop column
ALTER TABLE analytics.events DROP COLUMN old_column;
Integration Engine Tables¶
For integration engines (Kafka, MySQL, PostgreSQL, etc.), Housekeeper automatically uses DROP+CREATE strategy:
-- Integration engines require recreation for modifications
DROP TABLE integration.kafka_events;
CREATE TABLE integration.kafka_events (...) ENGINE = Kafka(...);
Dictionary Operations¶
Dictionaries use CREATE OR REPLACE for all modifications since ClickHouse doesn't support ALTER DICTIONARY:
-- Any dictionary change becomes CREATE OR REPLACE
CREATE OR REPLACE DICTIONARY analytics.users_dict (
id UInt64 IS_OBJECT_ID,
name String INJECTIVE
) PRIMARY KEY id
SOURCE(HTTP(url 'http://api.example.com/users'))
LAYOUT(HASHED())
LIFETIME(3600);
View Operations¶
Regular Views¶
Use CREATE OR REPLACE for modifications:
Materialized Views¶
Use DROP+CREATE for query modifications (more reliable than ALTER TABLE MODIFY QUERY):
-- Drop existing materialized view
DROP TABLE analytics.mv_daily_stats;
-- Recreate with new query
CREATE MATERIALIZED VIEW analytics.mv_daily_stats
ENGINE = MergeTree() ORDER BY date
AS SELECT date, count(), sum(amount) FROM events GROUP BY date;
Migration Ordering¶
Housekeeper ensures proper operation ordering to handle dependencies:
UP Migrations (Create)¶
- Databases - Create databases first
- Tables - Create tables that other objects depend on
- Dictionaries - Create dictionaries after source tables
- Views - Create views last (depend on tables and dictionaries)
Within each type: 1. CREATE - Create new objects 2. ALTER/REPLACE - Modify existing objects 3. RENAME - Rename objects 4. DROP - Remove objects
DOWN Migrations (Destroy)¶
Reverse order of UP migrations for safe teardown.
Rename Detection¶
Housekeeper includes intelligent rename detection to avoid unnecessary DROP+CREATE operations:
How Rename Detection Works¶
- Property Comparison: Compares all properties except names
- Exact Match: Properties must match exactly (except name/database)
- Generate RENAME: Creates efficient RENAME statements instead of DROP+CREATE
Example Rename Detection¶
Current Schema:
CREATE DATABASE old_analytics ENGINE = Atomic COMMENT 'Analytics DB';
CREATE TABLE old_analytics.user_events (...) ENGINE = MergeTree() ORDER BY timestamp;
Target Schema:
CREATE DATABASE analytics ENGINE = Atomic COMMENT 'Analytics DB';
CREATE TABLE analytics.events (...) ENGINE = MergeTree() ORDER BY timestamp;
Generated Migration:
-- Efficient renames instead of DROP+CREATE
RENAME DATABASE old_analytics TO analytics;
RENAME TABLE analytics.user_events TO analytics.events;
Migration Files¶
File Naming¶
Migration files use UTC timestamps for consistent ordering:
- Format: yyyyMMddHHmmss.sql
- Example: 20240806143022.sql
- Snapshots: yyyyMMddHHmmss_snapshot.sql
File Structure¶
Each migration file includes:
-- Schema migration generated at 2024-08-06 14:30:22 UTC
-- Down migration: swap current and target schemas and regenerate
-- Create database 'analytics'
CREATE DATABASE analytics ENGINE = Atomic COMMENT 'Analytics database';
-- Create table 'analytics.events'
CREATE TABLE analytics.events (
id UUID DEFAULT generateUUIDv4(),
timestamp DateTime,
event_type String
) ENGINE = MergeTree() ORDER BY timestamp;
Snapshot Consolidation¶
Over time, you may accumulate many migration files. Housekeeper provides a snapshot feature to consolidate migrations:
# Consolidate all existing migrations into a single snapshot
housekeeper snapshot --description "Q4 2024 consolidation"
# This will:
# 1. Create a timestamped snapshot file with all migration content
# 2. Remove the individual migration files that were consolidated
# 3. Update the migration sum file
Benefits of snapshot consolidation: - Reduced Clutter: Fewer migration files to manage - Faster Setup: New environments apply one snapshot instead of many migrations - Preserved History: Snapshot contains all historical changes - Clean Baseline: Easier to understand current schema state
Migration Integrity¶
Housekeeper generates a housekeeper.sum file for integrity checking:
h1:TotalHashOfAllMigrations=
20240101120000.sql h1:HashOfMigration1=
20240101130000.sql h1:ChainedHashWithPrevious=
This ensures: - Tamper Detection: Unauthorized changes are detected - Consistency: Same migrations across environments - Chained Verification: Each migration builds on the previous
Automatic Partial Progress Tracking¶
Housekeeper automatically tracks the progress of migration execution at the statement level, enabling seamless recovery from failures without manual intervention.
How Partial Progress Works¶
When applying migrations, Housekeeper:
- Statement-Level Tracking: Records progress after each successful statement
- Hash Validation: Stores cryptographic hashes of each statement for integrity
- Automatic Resume: Automatically detects and resumes from failure points
- No User Action Required: No flags or commands needed - happens automatically
Migration Execution Tracking¶
Each migration execution is tracked in the housekeeper.revisions table:
-- Example revision record
INSERT INTO housekeeper.revisions (
version, -- Migration version (e.g., '20240101120000_create_users')
executed_at, -- UTC timestamp of execution
execution_time_ms, -- Total execution time
kind, -- 'migration' or 'snapshot'
error, -- Error message if failed (NULL if successful)
applied, -- Number of statements successfully applied
total, -- Total statements in migration
hash, -- h1 hash of entire migration
partial_hashes, -- Array of h1 hashes for each statement
housekeeper_version -- Version of Housekeeper that ran the migration
);
Automatic Recovery Examples¶
Partial Failure Scenario¶
If a migration fails partway through:
-- Migration: 20240101120000_setup_analytics.sql (5 statements total)
CREATE DATABASE analytics; -- ✅ Statement 1: Success
CREATE TABLE analytics.events (...); -- ✅ Statement 2: Success
CREATE TABLE analytics.users (...); -- ✅ Statement 3: Success
CREATE DICTIONARY analytics.locations (...); -- ❌ Statement 4: Failed (network error)
CREATE VIEW analytics.summary (...); -- ⏸ Statement 5: Not executed
Revision Record Created:
- applied: 3 (statements 1-3 completed)
- total: 5 (migration has 5 statements)
- error: "Network timeout connecting to dictionary source"
- partial_hashes: ["h1:stmt1hash=", "h1:stmt2hash=", "h1:stmt3hash=", ...]
Automatic Resume¶
When you run migrations again:
Housekeeper automatically:
- Detects Partial State: Finds revision with
applied < total - Validates Integrity: Confirms statements 1-3 haven't changed using hashes
- Shows Progress: Displays what will be resumed
- Resumes Execution: Starts from statement 4 (the failed statement)
Found 1 partially applied migration(s) that will be resumed:
⚠️ 20240101120000_setup_analytics: 3/5 statements applied
Last error: Network timeout connecting to dictionary source
Will resume with 2 remaining statement(s)
Migration execution results:
⚠️ 20240101120000_setup_analytics resumed from statement 4 in 1.234s (2/2 remaining statements)
Safety Features¶
Hash Validation¶
Before resuming, Housekeeper validates migration file integrity:
# ❌ If migration file was modified after partial execution
Error: statement 2 hash mismatch: migration file may have been modified
since partial execution (expected h1:abc123=, got h1:def456=)
# ✅ If migration file is unchanged, resume proceeds safely
Found partial migration, resuming from statement 4...
Statement Count Validation¶
# ❌ If statements were added/removed from migration file
Error: migration statement count changed: expected 5 statements,
found 7 in revision
# This prevents resuming with a different migration file
Development Workflow¶
Development Cycle¶
The typical development workflow is:
# 1. Start development server (applies existing migrations)
housekeeper dev up
# 2. Make schema changes in your files
# Edit db/main.sql or imported files
# 3. Generate migration from changes
housekeeper diff
# 4. Restart server to apply new migration
housekeeper dev down
housekeeper dev up
Working with Existing Databases¶
When starting with an existing ClickHouse database, Housekeeper provides a complete bootstrap workflow:
Step 1: Extract Schema from Existing Database¶
# Bootstrap project from existing database
housekeeper bootstrap --url localhost:9000
# This creates initial project structure with current schema
Step 2: Create Initial Snapshot (New Feature)¶
After bootstrapping, you need to create an initial snapshot to establish a baseline for future migrations. This solves the chicken-and-egg problem where you have schema files but no migrations yet:
# Create bootstrap snapshot from project schema (no migrations required)
housekeeper snapshot --bootstrap --description "Initial database baseline"
# This creates a snapshot file representing the current database state
# Future migrations will be generated against this baseline
The --bootstrap flag tells the snapshot command to use the compiled project schema instead of existing migrations. This is essential when starting with an existing database because:
- You have schema files (from
bootstrapcommand) but no migrations yet - The snapshot becomes your "migration zero" - representing the current database state
- Future
diffcommands will generate migrations against this baseline
Excluding Databases¶
When working with existing databases, you can exclude certain databases from the extraction process:
# Extract schema while ignoring test databases
housekeeper schema dump --url localhost:9000 \
--ignore-databases testing_db \
--ignore-databases temp_analytics
# Or configure in housekeeper.yaml for permanent exclusion
clickhouse:
ignore_databases:
- testing_db
- temp_analytics
This is useful when you have test or temporary databases that shouldn't be part of your managed schema.
Validation and Safety¶
Pre-Migration Validation¶
Before generating migrations, Housekeeper validates:
- Schema Syntax: All SQL must parse correctly
- Forbidden Operations: Prevents unsupported operations
- Dependency Check: Ensures proper object dependencies
Forbidden Operations¶
Some operations require manual intervention:
# These operations will return validation errors:
# ❌ Engine changes
# Current: ENGINE = MergeTree()
# Target: ENGINE = ReplacingMergeTree()
# Error: engine type changes not supported
# ❌ Cluster changes
# Current: CREATE TABLE users (...);
# Target: CREATE TABLE users (...) ON CLUSTER prod;
# Error: cluster configuration changes not supported
# ❌ System object modifications
# Error: system object modifications not supported
Automatic Handling¶
Some operations are automatically handled with optimal strategies:
# ✅ Integration engine modifications
# Automatically uses DROP+CREATE strategy
# ✅ Materialized view query changes
# Automatically uses DROP+CREATE strategy
# ✅ Dictionary modifications
# Automatically uses CREATE OR REPLACE strategy
Best Practices¶
Development Workflow¶
- Make Schema Changes: Edit your schema files
- Generate Migration: Run
housekeeper diff - Review Output: Examine the generated SQL carefully
- Test in Development: Apply to development environment first
- Commit Together: Commit schema files and migration together
- Deploy Systematically: Apply to staging, then production
Migration Safety¶
- Backup First: Always backup before applying migrations to production
- Test Thoroughly: Test migrations in non-production environments
- Monitor Application: Watch for application errors after migration
- Have Rollback Plan: Prepare rollback procedures for critical changes
Team Collaboration¶
- Code Reviews: Review migrations like application code
- Sequential Migrations: Avoid parallel schema changes
- Communication: Coordinate schema changes with team
- Documentation: Document complex migrations and their purpose
Troubleshooting¶
Common Issues¶
Migration Generation Fails¶
# Check schema syntax
housekeeper schema compile
# Validate connection
housekeeper schema dump --url localhost:9000
Migration Application Fails¶
# Check ClickHouse logs
docker logs clickhouse-container
# Check development server status
housekeeper dev up --help
Forbidden Operation Errors¶
# These require manual intervention:
# 1. Engine changes: Manually DROP+CREATE
# 2. Cluster changes: Manually recreate objects
# 3. System modifications: Not allowed
Recovery Procedures¶
Corrupted Migration State¶
# Regenerate sum file
housekeeper rehash
# Compare current database with expected schema
housekeeper diff
Failed Migration¶
Housekeeper automatically handles partial migration failures:
# ✅ Simply run migrate again - automatic resume
housekeeper migrate --url localhost:9000
# ✅ Check migration status to see partial progress
housekeeper status --url localhost:9000 --verbose
# ✅ Use dry-run to see what would be resumed
housekeeper migrate --url localhost:9000 --dry-run
If you need to manually investigate:
# Check what was applied using schema dump
housekeeper schema dump --url localhost:9000
# Check revision table for detailed failure info
SELECT version, applied, total, error, executed_at
FROM housekeeper.revisions
WHERE error IS NOT NULL
ORDER BY executed_at DESC;
Next Steps¶
- Schema Management - Learn best practices for schema design
- Configuration - Configure Housekeeper for your environment
- Cluster Management - Handle distributed ClickHouse deployments
- Troubleshooting - Solve common migration issues