Inspector Query Costs¶

This document explains what database queries the inspector runs and how to estimate the cost based on your table structure.

Configuration¶

Control which queries run via dataface.yml or default_config.yml:

# In your project's dataface.yml
inspector:
  collect_top_values: true       # Default: true (smart filtering)
  top_values_limit: 10           # Default: 10
  top_values_max_cardinality: 1000  # Skip columns with more distinct values

  # Types to SKIP for top_values (numeric, date, uuid, etc.)
  top_values_skip_types:
    - INTEGER
    - BIGINT
    - FLOAT
    - DOUBLE
    - DATE
    - TIMESTAMP
    - UUID
    # ... see default_config.yml for full list

  sample_size: 100               # Default: 100

  # Deep-profile flags (all off by default — each fires extra per-column SQL)
  deep_profile: false            # Master switch: enables all three below
  collect_histogram_bins: false  # Per numeric column: binned distribution query
  collect_date_distribution: false  # Per date/timestamp column: bucket count query
  collect_enum_values: false     # Per low-cardinality column: full DISTINCT query
  enum_cardinality_threshold: 20 # Max distinct count for enum collection

Deep-profile flags are off by default. Each adds one SQL query per qualifying column — which can be expensive for wide tables. Enable them in dataface.yml when you want richer inspect.json output.

deep_profile: true enables all three collectors (histogram, date distribution, enum values) regardless of the individual flags. To enable a subset, leave deep_profile: false and set only the flags you want:

inspector:
  collect_histogram_bins: true    # histograms only
  collect_date_distribution: true # plus date distributions
  # collect_enum_values stays false

Or access programmatically:

from dataface.core.compile.config import get_config

config = get_config().inspector
print(config.collect_top_values)  # True
print(config.deep_profile)        # False
print(config.collect_histogram_bins)  # False

Query Summary¶

Query Type	Count	Default	Config Setting
Schema	1	✅ On	Always runs
Stats scan	1-N	✅ On	Always runs
Sample data	1	✅ On	`sample_size`
Top values	C	✅ On	`collect_top_values`
Enum values	E	❌ Off	`collect_enum_values` or `deep_profile`
Histogram bins	N_num	❌ Off	`collect_histogram_bins` or `deep_profile`
Date distributions	N_date	❌ Off	`collect_date_distribution` or `deep_profile`

Default (no deep profile): Total queries = 2 + ceil(C / max_cols) + C

With deep_profile: Total queries = 2 + ceil(C / max_cols) + C + E + N_num + N_date

Where: - C = total number of columns - E = number of enum-like columns (distinct count ≤ 20) - N_num = number of numeric columns - N_date = number of date/timestamp columns - max_cols = column limit per stats query (varies by database)

Detailed Breakdown¶

1. Schema Query (1 query)¶

-- Get column names and types
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'

Cost: Very fast, reads metadata only.

2. Stats Scan (1+ queries)¶

Single table scan computing aggregate stats for all columns:

SELECT
  COUNT(*) as total_rows,
  -- Per column (repeated for each column):
  COUNT(col1) as col1__non_null_count,
  COUNT(DISTINCT col1) as col1__distinct_count,
  MIN(col1) as col1__min,
  MAX(col1) as col1__max,
  AVG(col1) as col1__mean,        -- numeric only
  STDDEV(col1) as col1__stddev,   -- numeric only
  -- ... more columns
FROM your_table

Cost: Full table scan. For wide tables, may be split into multiple queries.

Column limits per query (to avoid SELECT column limits):

Database	Max Columns per Query
DuckDB	1400
BigQuery	1400
MySQL	500
SQL Server	500
PostgreSQL	200
Redshift	200
Snowflake	140
Oracle	140

Example: A PostgreSQL table with 450 columns = ceil(450/200) = 3 stats queries

3. Sample Data Query (1 query)¶

SELECT * FROM your_table LIMIT 100

Cost: Very fast, reads 100 rows for semantic type detection.

4. Top Values Queries (C queries)¶

One query per column to get value frequencies:

SELECT col1, COUNT(*) as frequency
FROM your_table
GROUP BY col1
ORDER BY frequency DESC
LIMIT 10

Cost: Each requires a GROUP BY scan. This is the most expensive part for tables with many columns.

5. Enum Values Queries (E queries) — off by default¶

Only runs when collect_enum_values: true or deep_profile: true, for columns detected as enum-like (distinct count ≤ enum_cardinality_threshold, default 20):

SELECT DISTINCT col1 FROM your_table

Cost: Fast since these columns have low cardinality.

6. Histogram Bins (N_numeric queries) — off by default¶

Only runs when collect_histogram_bins: true or deep_profile: true, for numeric columns:

SELECT
  MIN(val) + (MAX(val) - MIN(val)) / 10 * bucket AS bin_start,
  MIN(val) + (MAX(val) - MIN(val)) / 10 * (bucket + 1) AS bin_end,
  COUNT(*) AS count
FROM your_table, generate_series(0, 9) AS t(bucket)
WHERE val IS NOT NULL
GROUP BY bucket ORDER BY bucket

Cost: One full column scan per numeric column.

7. Date Distributions (N_date queries) — off by default¶

Only runs when collect_date_distribution: true or deep_profile: true, for date/timestamp columns:

SELECT CAST(DATE_TRUNC('month', dt) AS DATE) AS bucket, COUNT(*) AS count
FROM your_table
WHERE dt IS NOT NULL
GROUP BY DATE_TRUNC('month', dt)
ORDER BY bucket
LIMIT 1000

Cost: One full column scan per temporal column. Granularity (day/week/month/year) is auto-selected based on the column's date range.

Cost Calculator¶

Formula¶

Default (no deep profile):

Total Queries = 1 (schema)
              + ceil(C / max_cols) (stats scans)
              + 1 (sample)
              + C (top values, if collect_top_values: true)

With deep_profile (or individual deep flags):

Total Queries = above
              + E (enum values, if collect_enum_values or deep_profile)
              + N_num (histogram bins, if collect_histogram_bins or deep_profile)
              + N_date (date distributions, if collect_date_distribution or deep_profile)

Examples¶

Small Table: 10 columns, PostgreSQL, no deep profile¶

Queries = 1 + ceil(10/200) + 1 + 10
        = 1 + 1 + 1 + 10
        = 13 queries

Small Table: 10 columns, 2 numeric, 1 date, 2 enums, PostgreSQL, deep_profile: true¶

Queries = 1 + ceil(10/200) + 1 + 10 + 2 + 2 + 1
        = 1 + 1 + 1 + 10 + 2 + 2 + 1
        = 18 queries

Wide Table: 500 columns, PostgreSQL, no deep profile¶

Queries = 1 + ceil(500/200) + 1 + 500
        = 1 + 3 + 1 + 500
        = 505 queries

Performance Tips¶

Wide tables are expensive - The top_values queries (one per column) dominate cost for tables with many columns.
DuckDB is most efficient - Higher column limits mean fewer stats queries, plus fast in-memory execution.
Deep profile adds per-column queries - Enable deep_profile selectively. For large tables, prefer collect_histogram_bins: true over deep_profile: true if you only need numeric distributions.
Consider column filtering - Use top_values_skip_types to exclude types you don't need top values for.