Fanout Detection¶

Fanout happens when a join multiplies rows beyond the intended grain, causing aggregates like SUM and COUNT to silently return inflated results. Dataface detects fanout risk deterministically from profiled table statistics and SQL structure analysis. No LLM guessing is involved in the core detection pipeline.

This page describes what the system detects, how risk is scored, and what to do when a warning fires.

How it works¶

Fanout detection builds on the inspector context stack. The inspector profiles tables and caches the results in target/super_schema.json. Three modules layer on top of raw profiles to produce fanout signals:

Grain detection infers what entity each row represents.
Join multiplicity classifies relationship cardinality and estimates row multiplication.
Fanout risk scoring combines multiplicity and aggregation context into a severity level.

A fourth module, the query validator, analyzes SQL structure (via AST parsing) to detect fanout-prone patterns at compile time.

flowchart LR
    A["Table profiles<br/>(super_schema.json)"] --> B["Grain detection"]
    A --> C["Join multiplicity"]
    C --> D["Fanout risk scoring"]
    D --> E["Compile-time warnings"]
    A --> F["Query validator<br/>(SQL AST)"]
    D -.->|relationship context| F

Grain detection¶

Grain detection answers: what does one row represent? The detector examines profiled column metadata without running additional queries.

Algorithm priority¶

Priority	Signal	Confidence	Example
1	Explicit primary key (single)	1.00	`id` column with PK role
2	Explicit primary key (composite)	0.95	`(order_id, line_num)`
3	Single high-uniqueness column (>= 99%)	0.90	`user_email` with 99.5% uniqueness
4	Best match among multiple unique columns	0.85	`id` preferred over `email` by naming
5	FK/identifier + time column	0.75	`(customer_id, created_at)`
6	Two FK/identifier columns	0.70	`(customer_id, product_id)`

When confidence is below 0.70 or no candidates are found, the detector returns nothing rather than guessing.

Grain in practice¶

Grain metadata appears in:

Schema context — AI agents see grain labels like one row per order_id in formatted schema summaries.
MCP schema tool — the schema(source=X, schema=Y, table=T) response includes grain for profiled tables.
Fanout risk scoring — grain informs whether an aggregate is at the right level.

Join multiplicity¶

Join multiplicity classifies the cardinality of a relationship between two tables using the uniqueness ratios of join columns from cached profiles.

Classification¶

Left uniqueness	Right uniqueness	Multiplicity	Fanout factor
>= 99%	>= 99%	one-to-one	1.0 (no multiplication)
>= 99%	< 99%	one-to-many	right_rows / left_rows
< 99%	>= 99%	many-to-one	1.0 (dimension lookup)
< 99%	< 99%	many-to-many	max(rows) / min(rows) approx

Join profile output¶

Each detected relationship is enriched with a join_profile:

{
  "multiplicity": "one-to-many",
  "fanout_factor": 3.2,
  "left_coverage": 1.0,
  "right_coverage": 0.98
}

fanout_factor is the estimated average row multiplication. 1.0 means no fanout.
coverage shows the fraction of non-null join keys on each side. Low coverage means many rows will not match.

Risk levels¶

Fanout risk combines multiplicity and aggregation context into five severity tiers:

Level	When	What to do
none	1:1 or N:1 joins	Safe. No action needed.
low	1:N join without aggregation	Verify this is a detail query, not an aggregate.
medium	1:N join with aggregation (moderate fanout)	Pre-aggregate the many-side to the join key grain before joining.
high	1:N join with aggregation and fanout > 10x, or N:M without aggregation	Pre-aggregate before joining.
critical	N:M join with aggregation	Almost certainly inflates results. Use a bridge table or pre-aggregate both sides.

Compile-time warnings fire for high and critical levels when both tables of a risky relationship appear in the same dashboard query.

Query validator¶

The query validator parses SQL with SQLGlot and detects structural patterns that indicate likely query bugs. It does not execute queries.

Diagnostic codes¶

Code	Severity	What it detects
`missing_join_predicate`	error	Comma joins or CROSS JOINs without explicit predicates (cartesian products)
`fanout_risk`	warning/error	Aggregation over joined tables where measures may be inflated
`reaggregation`	warning	Outer aggregate wrapping an already aggregate-derived column (e.g. SUM of a SUM)
`parse_error`	error	SQL that cannot be parsed

Fanout risk detection¶

The validator flags fanout_risk when all of these are true:

The query has joins (explicit or comma-join).
The query has aggregate functions in SELECT or HAVING.
Any of:
- Aggregate functions reference columns from 2+ distinct tables.
- COUNT(*) is used (inflated by row multiplication).
- Unqualified columns appear in aggregates with 2+ tables in scope.

Severity calibration with relationship context¶

When relationship metadata is available, the validator calibrates severity:

Multiplicity	Confidence	Fanout	Severity
one-to-one	>= 75%	any	info
many-to-one	>= 75%	any	info
one-to-many	>= 75%	<= 10x	warning
one-to-many	>= 75%	> 10x	error
many-to-many	>= 75%	any	error
any	< 75%	any	warning

Without relationship context, the default severity is warning.

Reaggregation detection¶

The validator traces aggregate lineage through CTEs and subqueries. If an outer query applies SUM, AVG, or another aggregate to a column that was already produced by an aggregate in an inner scope, it emits a reaggregation warning.

Example:

WITH daily AS (
  SELECT date, SUM(amount) AS total
  FROM orders GROUP BY date
)
SELECT SUM(total) FROM daily
-- reaggregation: SUM(total) re-aggregates an already aggregate-derived column

Detection coverage and known limits¶

The system covers approximately 90% of fanout patterns in typical analytics SQL:

Detected:

All standard aggregate functions (SUM, AVG, COUNT, MIN, MAX, STDDEV, etc.)
All join types (INNER, LEFT, RIGHT, FULL OUTER, self-joins, multi-table)
SQL patterns including CTEs, subqueries, window functions, UNION
Edge cases: NATURAL JOIN (not flagged as cross join), single-table aggregation after row-multiplying join, chained reaggregation

Not detectable without runtime data:

Data-dependent fanout direction (borderline uniqueness ratios)
Conditional joins with highly selective WHERE filters
Dynamic or parameterized SQL
Composite key joins without profile data for all key columns
Post-join filtering that eliminates duplicates
Cross-database joins
Views with unknown internal join structure

These gaps are inherent to static analysis. For critical queries, consider execution-time grain checks (comparing COUNT(*) to COUNT(DISTINCT grain_key)).

AI agent integration¶

AI agents receive fanout context through two channels:

Schema context — format_table_context() includes grain labels and confidence scores in the formatted table summary that agents see.
MCP schema tool — the schema(source=X, schema=Y) response includes relationship metadata with join_profile and fanout_risk for each detected edge.

This gives agents enough information to:

Avoid writing queries that join at the wrong grain.
Self-diagnose when a query validator warning fires.
Recommend pre-aggregation when constructing joins between tables with known fanout risk.

Inspector overview
Inspector context stack — how profiling data flows through layers
Running the inspector