Fanout Detection¶
Fanout happens when a join multiplies rows beyond the intended grain, causing
aggregates like SUM and COUNT to silently return inflated results. Dataface
detects fanout risk deterministically from profiled table statistics and SQL
structure analysis. No LLM guessing is involved in the core detection pipeline.
This page describes what the system detects, how risk is scored, and what to do when a warning fires.
How it works¶
Fanout detection builds on the inspector context stack. The
inspector profiles tables and caches the results in target/super_schema.json. Three
modules layer on top of raw profiles to produce fanout signals:
- Grain detection infers what entity each row represents.
- Join multiplicity classifies relationship cardinality and estimates row multiplication.
- Fanout risk scoring combines multiplicity and aggregation context into a severity level.
A fourth module, the query validator, analyzes SQL structure (via AST parsing) to detect fanout-prone patterns at compile time.
flowchart LR
A["Table profiles<br/>(super_schema.json)"] --> B["Grain detection"]
A --> C["Join multiplicity"]
C --> D["Fanout risk scoring"]
D --> E["Compile-time warnings"]
A --> F["Query validator<br/>(SQL AST)"]
D -.->|relationship context| F
Grain detection¶
Grain detection answers: what does one row represent? The detector examines profiled column metadata without running additional queries.
Algorithm priority¶
| Priority | Signal | Confidence | Example |
|---|---|---|---|
| 1 | Explicit primary key (single) | 1.00 | id column with PK role |
| 2 | Explicit primary key (composite) | 0.95 | (order_id, line_num) |
| 3 | Single high-uniqueness column (>= 99%) | 0.90 | user_email with 99.5% uniqueness |
| 4 | Best match among multiple unique columns | 0.85 | id preferred over email by naming |
| 5 | FK/identifier + time column | 0.75 | (customer_id, created_at) |
| 6 | Two FK/identifier columns | 0.70 | (customer_id, product_id) |
When confidence is below 0.70 or no candidates are found, the detector returns nothing rather than guessing.
Grain in practice¶
Grain metadata appears in:
- Schema context — AI agents see grain labels like
one row per order_idin formatted schema summaries. - MCP schema tool — the
schema(source=X, schema=Y, table=T)response includes grain for profiled tables. - Fanout risk scoring — grain informs whether an aggregate is at the right level.
Join multiplicity¶
Join multiplicity classifies the cardinality of a relationship between two tables using the uniqueness ratios of join columns from cached profiles.
Classification¶
| Left uniqueness | Right uniqueness | Multiplicity | Fanout factor |
|---|---|---|---|
| >= 99% | >= 99% | one-to-one | 1.0 (no multiplication) |
| >= 99% | < 99% | one-to-many | right_rows / left_rows |
| < 99% | >= 99% | many-to-one | 1.0 (dimension lookup) |
| < 99% | < 99% | many-to-many | max(rows) / min(rows) approx |
Join profile output¶
Each detected relationship is enriched with a join_profile:
{
"multiplicity": "one-to-many",
"fanout_factor": 3.2,
"left_coverage": 1.0,
"right_coverage": 0.98
}
- fanout_factor is the estimated average row multiplication. 1.0 means no fanout.
- coverage shows the fraction of non-null join keys on each side. Low coverage means many rows will not match.
Risk levels¶
Fanout risk combines multiplicity and aggregation context into five severity tiers:
| Level | When | What to do |
|---|---|---|
| none | 1:1 or N:1 joins | Safe. No action needed. |
| low | 1:N join without aggregation | Verify this is a detail query, not an aggregate. |
| medium | 1:N join with aggregation (moderate fanout) | Pre-aggregate the many-side to the join key grain before joining. |
| high | 1:N join with aggregation and fanout > 10x, or N:M without aggregation | Pre-aggregate before joining. |
| critical | N:M join with aggregation | Almost certainly inflates results. Use a bridge table or pre-aggregate both sides. |
Compile-time warnings fire for high and critical levels when both tables of a risky relationship appear in the same dashboard query.
Query validator¶
The query validator parses SQL with SQLGlot and detects structural patterns that indicate likely query bugs. It does not execute queries.
Diagnostic codes¶
| Code | Severity | What it detects |
|---|---|---|
missing_join_predicate |
error | Comma joins or CROSS JOINs without explicit predicates (cartesian products) |
fanout_risk |
warning/error | Aggregation over joined tables where measures may be inflated |
reaggregation |
warning | Outer aggregate wrapping an already aggregate-derived column (e.g. SUM of a SUM) |
parse_error |
error | SQL that cannot be parsed |
Fanout risk detection¶
The validator flags fanout_risk when all of these are true:
- The query has joins (explicit or comma-join).
- The query has aggregate functions in SELECT or HAVING.
- Any of:
- Aggregate functions reference columns from 2+ distinct tables.
COUNT(*)is used (inflated by row multiplication).- Unqualified columns appear in aggregates with 2+ tables in scope.
Severity calibration with relationship context¶
When relationship metadata is available, the validator calibrates severity:
| Multiplicity | Confidence | Fanout | Severity |
|---|---|---|---|
| one-to-one | >= 75% | any | info |
| many-to-one | >= 75% | any | info |
| one-to-many | >= 75% | <= 10x | warning |
| one-to-many | >= 75% | > 10x | error |
| many-to-many | >= 75% | any | error |
| any | < 75% | any | warning |
Without relationship context, the default severity is warning.
Reaggregation detection¶
The validator traces aggregate lineage through CTEs and subqueries. If an outer
query applies SUM, AVG, or another aggregate to a column that was already
produced by an aggregate in an inner scope, it emits a reaggregation warning.
Example:
WITH daily AS (
SELECT date, SUM(amount) AS total
FROM orders GROUP BY date
)
SELECT SUM(total) FROM daily
-- reaggregation: SUM(total) re-aggregates an already aggregate-derived column
Detection coverage and known limits¶
The system covers approximately 90% of fanout patterns in typical analytics SQL:
Detected:
- All standard aggregate functions (SUM, AVG, COUNT, MIN, MAX, STDDEV, etc.)
- All join types (INNER, LEFT, RIGHT, FULL OUTER, self-joins, multi-table)
- SQL patterns including CTEs, subqueries, window functions, UNION
- Edge cases: NATURAL JOIN (not flagged as cross join), single-table aggregation after row-multiplying join, chained reaggregation
Not detectable without runtime data:
- Data-dependent fanout direction (borderline uniqueness ratios)
- Conditional joins with highly selective WHERE filters
- Dynamic or parameterized SQL
- Composite key joins without profile data for all key columns
- Post-join filtering that eliminates duplicates
- Cross-database joins
- Views with unknown internal join structure
These gaps are inherent to static analysis. For critical queries, consider
execution-time grain checks (comparing COUNT(*) to
COUNT(DISTINCT grain_key)).
AI agent integration¶
AI agents receive fanout context through two channels:
- Schema context —
format_table_context()includes grain labels and confidence scores in the formatted table summary that agents see. - MCP schema tool — the
schema(source=X, schema=Y)response includes relationship metadata withjoin_profileandfanout_riskfor each detected edge.
This gives agents enough information to:
- Avoid writing queries that join at the wrong grain.
- Self-diagnose when a query validator warning fires.
- Recommend pre-aggregation when constructing joins between tables with known fanout risk.
Related pages¶
- Inspector overview
- Inspector context stack — how profiling data flows through layers
- Running the inspector