Metrics

ACP uses a layered system of metrics to quantify agreement between models. From the foundational pairwise agreement Cᵢⱼ through the aggregate D-score and Harmony score, to oracle-weighted H_total with criticality adjustments -- every aspect of consensus quality is measured and reported.

Pairwise Agreement: Cᵢⱼ

The foundation of all ACP metrics is the pairwise agreement score Cᵢⱼ, which measures the degree of agreement between any two models i and j. It is a weighted combination of three independent measurements:

Cᵢⱼ = 0.30 × similarity + 0.40 × nli_score + 0.30 × fact_overlap

Pairwise agreement formula

Components

Component	Weight	Range	Description
Semantic similarity	0.30	[0, 1]	Cosine similarity between response embeddings (768-dimensional vectors). Measures how similar the responses are in meaning, regardless of wording.
NLI score	0.40	[0, 1]	Natural Language Inference score. Entailment = 1.0, contradiction = 0.0, neutral = 0.5. Measures whether one response logically follows from the other.
Fact overlap	0.30	[0, 1]	Proportion of factual claims that appear in both responses. Measures concrete factual agreement independent of framing.

The NLI score receives the highest weight (0.40) because it captures logical entailment -- the strongest form of agreement. Two responses can have high semantic similarity (similar topics) or high fact overlap (same facts mentioned) while still contradicting each other in their conclusions. The NLI score catches this.

C_ij calculation

def compute_c_ij(response_i: str, response_j: str) -> float:
    """Compute pairwise agreement between two model responses."""
    # Semantic similarity via embeddings
    emb_i = embed(response_i)  # 768-dim vector
    emb_j = embed(response_j)
    similarity = cosine_similarity(emb_i, emb_j)

    # Natural Language Inference
    nli_result = nli_model(response_i, response_j)
    nli_score = {
        "entailment": 1.0,
        "contradiction": 0.0,
        "neutral": 0.5,
    }[nli_result]

    # Fact overlap
    facts_i = extract_facts(response_i)
    facts_j = extract_facts(response_j)
    if len(facts_i | facts_j) == 0:
        fact_overlap = 1.0  # no facts = no disagreement
    else:
        fact_overlap = len(facts_i & facts_j) / len(facts_i | facts_j)

    return 0.30 * similarity + 0.40 * nli_score + 0.30 * fact_overlap

Weight Rationale

The weights (0.30, 0.40, 0.30) were determined empirically across benchmark consensus tasks. The NLI component receives the highest weight because logical entailment is the most reliable indicator of genuine agreement. Semantic similarity captures surface-level alignment, while fact overlap provides a concrete, verifiable dimension of agreement.

D-Score (Divergence)

The D-score measures overall divergence between all participating models. It is defined as the complement of the mean pairwise agreement:

D = 1 - mean(Cᵢⱼ)

D-score definition

The D-score ranges from 0 (perfect consensus) to 1 (total disagreement). It provides a single number that captures the overall state of the consensus process.

D-Score Range	Interpretation	Action
D = 0.00	Perfect consensus	Return result immediately
D ≤ 0.20	High confidence	Consensus likely reached; verify criteria
D ≤ 0.40	Moderate confidence	Continue spiral; convergence progressing
D ≤ 0.60	Low confidence	Significant disagreement; spiral needed
D > 0.60	No consensus	Major divergence; full spiral required

Convergence Behavior

During the Axiom Spiral, D decreases geometrically at each loop:

D(n) = D(0) × (1/φ)ⁿ

D-score convergence

The consensus criterion requires D < 0.05. This is guaranteed to be reached within 7 loops for any initial D, since (1/φ)⁷ ≈ 0.034, meaning at most 3.4% of the initial divergence remains. See φ-Convergence for the complete mathematical treatment.

H_models (Inter-Model Harmony)

The inter-model Harmony score H_models is the mean of all pairwise agreement scores. It is the complement of the D-score:

H_models = mean(Cᵢⱼ) = 1 - D

H_models definition

H_models captures how well the models agree with each other, without considering external verification. It ranges from 0 (complete dissonance) to 1 (perfect unison).

H_models calculation

def compute_h_models(responses: dict[str, str]) -> float:
    """Compute inter-model harmony from all pairwise agreements."""
    models = list(responses.keys())
    n = len(models)
    if n < 2:
        return 1.0  # single model = perfect self-agreement

    total = 0.0
    pairs = 0
    for i in range(n):
        for j in range(i + 1, n):
            total += compute_c_ij(responses[models[i]], responses[models[j]])
            pairs += 1

    return total / pairs  # = mean(C_ij)

H_oracle (Oracle Verification)

The oracle Harmony score H_oracle measures the accuracy of verifiable claims in the consensus answer against external ground truth. It is computed as the ratio of verified claims to total verifiable claims:

H_oracle = verified_claims / total_verifiable_claims

H_oracle definition

Oracles are external verification systems that can independently confirm or deny specific claims. The ACP system supports multiple oracle types:

Oracle Type	Verifies	Examples
Mathematical	Arithmetic and algebraic claims	2 + 2 = 4, sqrt(144) = 12, primality tests
Cryptographic	Hash computations	SHA-256 verification, checksum validation
Astronomical	Celestial computations	Planet positions, eclipse dates, orbital parameters
Physical	Physical constants and formulas	Speed of light, gravitational constant
Factual	Verifiable facts	Dates, measurements, definitions

Oracle Independence

Oracles operate independently of the AI models. They provide ground truth for verifiable claims, ensuring that consensus does not merely mean "all models agree on the same wrong answer." The oracle dimension adds an objective verification layer that prevents collective hallucination.

H_total (Total Harmony)

The total Harmony score combines inter-model agreement with oracle verification, weighted by task criticality:

H_total = α × H_models + β × H_oracle

H_total definition

The weights α and β are determined by the criticality level of the task. Higher-criticality tasks place more weight on oracle verification (objective truth) while lower-criticality tasks place more weight on inter-model agreement (collective judgment).

Criticality Weights

Criticality	α (H_models)	β (H_oracle)	Use Case
High	0.4	0.6	Medical, financial, safety-critical decisions. Oracle verification is prioritized because errors have severe consequences.
Medium	0.5	0.5	General factual questions, technical documentation. Equal weight to model agreement and oracle verification.
Low	0.7	0.3	Creative tasks, opinion-based queries, brainstorming. Model agreement is prioritized because objective verification is less applicable.

H_total calculation

def compute_h_total(
    responses: dict[str, str],
    oracle_results: OracleResults,
    criticality: str = "medium",
) -> float:
    """Compute total harmony with criticality-weighted oracle scores."""
    weights = {
        "high":   {"alpha": 0.4, "beta": 0.6},
        "medium": {"alpha": 0.5, "beta": 0.5},
        "low":    {"alpha": 0.7, "beta": 0.3},
    }

    alpha = weights[criticality]["alpha"]
    beta = weights[criticality]["beta"]

    h_models = compute_h_models(responses)
    h_oracle = oracle_results.verified / oracle_results.total_verifiable

    return alpha * h_models + beta * h_oracle

When Oracles Are Unavailable

If no verifiable claims are present in the responses (e.g., for purely opinion-based queries), H_oracle defaults to 1.0 and H_total reduces to α × H_models + β × 1.0. In practice, this means H_total ≥ β even with zero inter-model agreement, so the consensus criteria (H > 0.90) still require substantial model agreement.

Musical Interval Mappings

ACP maps numeric agreement scores to musical intervals, providing an intuitive vocabulary for describing consensus quality. These mappings are used in logging, dashboards, and API responses.

Interval	H Range	Ratio	Meaning
Unison	H ≥ 0.95	1:1	Full consensus. All models agree on substance and framing.
Octave	H ≥ 0.85	2:1	Agreement at different levels of abstraction. Same truth, different expression.
Fifth	H ≥ 0.75	3:2	Strong agreement with minor nuances. Core consensus is solid.
Fourth	H ≥ 0.60	4:3	Stable agreement. Models converging but details still in flux.
Third	H ≥ 0.40	5:4	Partial agreement. Models share common ground but differ on key points.
Tritone	H < 0.40	45:32	Dissonance. Fundamental disagreement requiring full spiral resolution.

Usage in the API

The musical interval is included in API responses alongside the numeric scores, providing both precise metrics and intuitive classification:

Interval classification

def classify_interval(h: float) -> str:
    """Map a harmony score to its musical interval."""
    if h >= 0.95:
        return "unison"
    elif h >= 0.85:
        return "octave"
    elif h >= 0.75:
        return "fifth"
    elif h >= 0.60:
        return "fourth"
    elif h >= 0.40:
        return "third"
    else:
        return "tritone"


# Example API response:
# {
#     "consensus": "...",
#     "d_score": 0.08,
#     "harmony": 0.92,
#     "interval": "unison",
#     "positions": { ... }
# }

Metrics Summary

The full metrics pipeline flows from pairwise measurements up through aggregate scores:

Metric	Formula	Range	Purpose
Cᵢⱼ	0.30×sim + 0.40×nli + 0.30×facts	[0, 1]	Pairwise agreement between two models
D	1 - mean(Cᵢⱼ)	[0, 1]	Overall divergence (0 = consensus)
H_models	mean(Cᵢⱼ) = 1 - D	[0, 1]	Inter-model harmony
H_oracle	verified / total_verifiable	[0, 1]	Oracle verification score
H_total	α×H_models + β×H_oracle	[0, 1]	Combined harmony with criticality weighting

Metric Dependencies

The metrics build on each other: Cᵢⱼ feeds into both D and H_models, which combine with H_oracle to produce H_total. The consensus criteria check H, all Cᵢⱼ values, and D simultaneously -- a system must pass all three dimensions to declare consensus.

Algorithm

Fugue Architecture