Inspector Query Costs¶
This document explains what database queries the inspector runs and how to estimate the cost based on your table structure.
Configuration¶
Control which queries run via dataface.yml or default_config.yml:
# In your project's dataface.yml inspector: collect_top_values: true # Default: true (smart filtering) top_values_limit: 10 # Default: 10 top_values_max_cardinality: 1000 # Skip columns with more distinct values # Types to SKIP for top_values (numeric, date, uuid, etc.) top_values_skip_types: - INTEGER - BIGINT - FLOAT - DOUBLE - DATE - TIMESTAMP - UUID # ... see default_config.yml for full list sample_size: 100 # Default: 100 # Deep-profile flags (all off by default — each fires extra per-column SQL) deep_profile: false # Master switch: enables all three below collect_histogram_bins: false # Per numeric column: binned distribution query collect_date_distribution: false # Per date/timestamp column: bucket count query collect_enum_values: false # Per low-cardinality column: full DISTINCT query enum_cardinality_threshold: 20 # Max distinct count for enum collection
Deep-profile flags are off by default. Each adds one SQL query per qualifying
column — which can be expensive for wide tables. Enable them in dataface.yml
when you want richer inspect.json output.
deep_profile: true enables all three collectors (histogram, date distribution,
enum values) regardless of the individual flags. To enable a subset, leave
deep_profile: false and set only the flags you want:
inspector: collect_histogram_bins: true # histograms only collect_date_distribution: true # plus date distributions # collect_enum_values stays false
Or access programmatically:
from dataface.core.compile.config import get_config
config = get_config().inspector
print(config.collect_top_values) # True
print(config.deep_profile) # False
print(config.collect_histogram_bins) # False
Query Summary¶
| Query Type | Count | Default | Config Setting |
|---|---|---|---|
| Schema | 1 | ✅ On | Always runs |
| Stats scan | 1-N | ✅ On | Always runs |
| Sample data | 1 | ✅ On | sample_size |
| Top values | C | ✅ On | collect_top_values |
| Enum values | E | ❌ Off | collect_enum_values or deep_profile |
| Histogram bins | N_num | ❌ Off | collect_histogram_bins or deep_profile |
| Date distributions | N_date | ❌ Off | collect_date_distribution or deep_profile |
Default (no deep profile): Total queries = 2 + ceil(C / max_cols) + C
With deep_profile: Total queries = 2 + ceil(C / max_cols) + C + E + N_num + N_date
Where: - C = total number of columns - E = number of enum-like columns (distinct count ≤ 20) - N_num = number of numeric columns - N_date = number of date/timestamp columns - max_cols = column limit per stats query (varies by database)
Detailed Breakdown¶
1. Schema Query (1 query)¶
-- Get column names and types
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_name = 'your_table'
Cost: Very fast, reads metadata only.
2. Stats Scan (1+ queries)¶
Single table scan computing aggregate stats for all columns:
SELECT
COUNT(*) as total_rows,
-- Per column (repeated for each column):
COUNT(col1) as col1__non_null_count,
COUNT(DISTINCT col1) as col1__distinct_count,
MIN(col1) as col1__min,
MAX(col1) as col1__max,
AVG(col1) as col1__mean, -- numeric only
STDDEV(col1) as col1__stddev, -- numeric only
-- ... more columns
FROM your_table
Cost: Full table scan. For wide tables, may be split into multiple queries.
Column limits per query (to avoid SELECT column limits):
| Database | Max Columns per Query |
|---|---|
| DuckDB | 1400 |
| BigQuery | 1400 |
| MySQL | 500 |
| SQL Server | 500 |
| PostgreSQL | 200 |
| Redshift | 200 |
| Snowflake | 140 |
| Oracle | 140 |
Example: A PostgreSQL table with 450 columns = ceil(450/200) = 3 stats queries
3. Sample Data Query (1 query)¶
Cost: Very fast, reads 100 rows for semantic type detection.
4. Top Values Queries (C queries)¶
One query per column to get value frequencies:
Cost: Each requires a GROUP BY scan. This is the most expensive part for tables with many columns.
5. Enum Values Queries (E queries) — off by default¶
Only runs when collect_enum_values: true or deep_profile: true, for columns
detected as enum-like (distinct count ≤ enum_cardinality_threshold, default 20):
Cost: Fast since these columns have low cardinality.
6. Histogram Bins (N_numeric queries) — off by default¶
Only runs when collect_histogram_bins: true or deep_profile: true, for numeric columns:
SELECT
MIN(val) + (MAX(val) - MIN(val)) / 10 * bucket AS bin_start,
MIN(val) + (MAX(val) - MIN(val)) / 10 * (bucket + 1) AS bin_end,
COUNT(*) AS count
FROM your_table, generate_series(0, 9) AS t(bucket)
WHERE val IS NOT NULL
GROUP BY bucket ORDER BY bucket
Cost: One full column scan per numeric column.
7. Date Distributions (N_date queries) — off by default¶
Only runs when collect_date_distribution: true or deep_profile: true, for date/timestamp columns:
SELECT CAST(DATE_TRUNC('month', dt) AS DATE) AS bucket, COUNT(*) AS count
FROM your_table
WHERE dt IS NOT NULL
GROUP BY DATE_TRUNC('month', dt)
ORDER BY bucket
LIMIT 1000
Cost: One full column scan per temporal column. Granularity (day/week/month/year) is auto-selected based on the column's date range.
Cost Calculator¶
Formula¶
Default (no deep profile):
Total Queries = 1 (schema)
+ ceil(C / max_cols) (stats scans)
+ 1 (sample)
+ C (top values, if collect_top_values: true)
With deep_profile (or individual deep flags):
Total Queries = above
+ E (enum values, if collect_enum_values or deep_profile)
+ N_num (histogram bins, if collect_histogram_bins or deep_profile)
+ N_date (date distributions, if collect_date_distribution or deep_profile)
Examples¶
Small Table: 10 columns, PostgreSQL, no deep profile¶
Small Table: 10 columns, 2 numeric, 1 date, 2 enums, PostgreSQL, deep_profile: true¶
Wide Table: 500 columns, PostgreSQL, no deep profile¶
Performance Tips¶
-
Wide tables are expensive - The top_values queries (one per column) dominate cost for tables with many columns.
-
DuckDB is most efficient - Higher column limits mean fewer stats queries, plus fast in-memory execution.
-
Deep profile adds per-column queries - Enable
deep_profileselectively. For large tables, prefercollect_histogram_bins: trueoverdeep_profile: trueif you only need numeric distributions. -
Consider column filtering - Use
top_values_skip_typesto exclude types you don't need top values for.