Skip to content

Fanout Detection

Fanout happens when a join multiplies rows beyond the intended grain, causing aggregates like SUM and COUNT to silently return inflated results. Dataface detects fanout risk deterministically from profiled table statistics and SQL structure analysis. No LLM guessing is involved in the core detection pipeline.

This page describes what the system detects, how risk is scored, and what to do when a warning fires.

How it works

Fanout detection builds on the inspector context stack. The inspector profiles tables and caches the results in target/super_schema.json. Three modules layer on top of raw profiles to produce fanout signals:

  1. Grain detection infers what entity each row represents.
  2. Join multiplicity classifies relationship cardinality and estimates row multiplication.
  3. Fanout risk scoring combines multiplicity and aggregation context into a severity level.

A fourth module, the query validator, analyzes SQL structure (via AST parsing) to detect fanout-prone patterns at compile time.

flowchart LR
    A["Table profiles<br/>(super_schema.json)"] --> B["Grain detection"]
    A --> C["Join multiplicity"]
    C --> D["Fanout risk scoring"]
    D --> E["Compile-time warnings"]
    A --> F["Query validator<br/>(SQL AST)"]
    D -.->|relationship context| F

Grain detection

Grain detection answers: what does one row represent? The detector examines profiled column metadata without running additional queries.

Algorithm priority

Priority Signal Confidence Example
1 Explicit primary key (single) 1.00 id column with PK role
2 Explicit primary key (composite) 0.95 (order_id, line_num)
3 Single high-uniqueness column (>= 99%) 0.90 user_email with 99.5% uniqueness
4 Best match among multiple unique columns 0.85 id preferred over email by naming
5 FK/identifier + time column 0.75 (customer_id, created_at)
6 Two FK/identifier columns 0.70 (customer_id, product_id)

When confidence is below 0.70 or no candidates are found, the detector returns nothing rather than guessing.

Grain in practice

Grain metadata appears in:

  • Schema context — AI agents see grain labels like one row per order_id in formatted schema summaries.
  • MCP schema tool — the schema(source=X, schema=Y, table=T) response includes grain for profiled tables.
  • Fanout risk scoring — grain informs whether an aggregate is at the right level.

Join multiplicity

Join multiplicity classifies the cardinality of a relationship between two tables using the uniqueness ratios of join columns from cached profiles.

Classification

Left uniqueness Right uniqueness Multiplicity Fanout factor
>= 99% >= 99% one-to-one 1.0 (no multiplication)
>= 99% < 99% one-to-many right_rows / left_rows
< 99% >= 99% many-to-one 1.0 (dimension lookup)
< 99% < 99% many-to-many max(rows) / min(rows) approx

Join profile output

Each detected relationship is enriched with a join_profile:

{
  "multiplicity": "one-to-many",
  "fanout_factor": 3.2,
  "left_coverage": 1.0,
  "right_coverage": 0.98
}
  • fanout_factor is the estimated average row multiplication. 1.0 means no fanout.
  • coverage shows the fraction of non-null join keys on each side. Low coverage means many rows will not match.

Risk levels

Fanout risk combines multiplicity and aggregation context into five severity tiers:

Level When What to do
none 1:1 or N:1 joins Safe. No action needed.
low 1:N join without aggregation Verify this is a detail query, not an aggregate.
medium 1:N join with aggregation (moderate fanout) Pre-aggregate the many-side to the join key grain before joining.
high 1:N join with aggregation and fanout > 10x, or N:M without aggregation Pre-aggregate before joining.
critical N:M join with aggregation Almost certainly inflates results. Use a bridge table or pre-aggregate both sides.

Compile-time warnings fire for high and critical levels when both tables of a risky relationship appear in the same dashboard query.

Query validator

The query validator parses SQL with SQLGlot and detects structural patterns that indicate likely query bugs. It does not execute queries.

Diagnostic codes

Code Severity What it detects
missing_join_predicate error Comma joins or CROSS JOINs without explicit predicates (cartesian products)
fanout_risk warning/error Aggregation over joined tables where measures may be inflated
reaggregation warning Outer aggregate wrapping an already aggregate-derived column (e.g. SUM of a SUM)
parse_error error SQL that cannot be parsed

Fanout risk detection

The validator flags fanout_risk when all of these are true:

  1. The query has joins (explicit or comma-join).
  2. The query has aggregate functions in SELECT or HAVING.
  3. Any of:
    • Aggregate functions reference columns from 2+ distinct tables.
    • COUNT(*) is used (inflated by row multiplication).
    • Unqualified columns appear in aggregates with 2+ tables in scope.

Severity calibration with relationship context

When relationship metadata is available, the validator calibrates severity:

Multiplicity Confidence Fanout Severity
one-to-one >= 75% any info
many-to-one >= 75% any info
one-to-many >= 75% <= 10x warning
one-to-many >= 75% > 10x error
many-to-many >= 75% any error
any < 75% any warning

Without relationship context, the default severity is warning.

Reaggregation detection

The validator traces aggregate lineage through CTEs and subqueries. If an outer query applies SUM, AVG, or another aggregate to a column that was already produced by an aggregate in an inner scope, it emits a reaggregation warning.

Example:

WITH daily AS (
  SELECT date, SUM(amount) AS total
  FROM orders GROUP BY date
)
SELECT SUM(total) FROM daily
-- reaggregation: SUM(total) re-aggregates an already aggregate-derived column

Detection coverage and known limits

The system covers approximately 90% of fanout patterns in typical analytics SQL:

Detected:

  • All standard aggregate functions (SUM, AVG, COUNT, MIN, MAX, STDDEV, etc.)
  • All join types (INNER, LEFT, RIGHT, FULL OUTER, self-joins, multi-table)
  • SQL patterns including CTEs, subqueries, window functions, UNION
  • Edge cases: NATURAL JOIN (not flagged as cross join), single-table aggregation after row-multiplying join, chained reaggregation

Not detectable without runtime data:

  • Data-dependent fanout direction (borderline uniqueness ratios)
  • Conditional joins with highly selective WHERE filters
  • Dynamic or parameterized SQL
  • Composite key joins without profile data for all key columns
  • Post-join filtering that eliminates duplicates
  • Cross-database joins
  • Views with unknown internal join structure

These gaps are inherent to static analysis. For critical queries, consider execution-time grain checks (comparing COUNT(*) to COUNT(DISTINCT grain_key)).

AI agent integration

AI agents receive fanout context through two channels:

  1. Schema contextformat_table_context() includes grain labels and confidence scores in the formatted table summary that agents see.
  2. MCP schema tool — the schema(source=X, schema=Y) response includes relationship metadata with join_profile and fanout_risk for each detected edge.

This gives agents enough information to:

  • Avoid writing queries that join at the wrong grain.
  • Self-diagnose when a query validator warning fires.
  • Recommend pre-aggregation when constructing joins between tables with known fanout risk.