Metrics
ACP uses a layered system of metrics to quantify agreement between models. From the foundational pairwise agreement Cᵢⱼ through the aggregate D-score and Harmony score, to oracle-weighted H_total with criticality adjustments -- every aspect of consensus quality is measured and reported.
Pairwise Agreement: Cᵢⱼ
The foundation of all ACP metrics is the pairwise agreement score Cᵢⱼ, which measures the degree of agreement between any two models i and j. It is a weighted combination of three independent measurements:
Cᵢⱼ = 0.30 × similarity + 0.40 × nli_score + 0.30 × fact_overlap
Components
| Component | Weight | Range | Description |
|---|---|---|---|
| Semantic similarity | 0.30 | [0, 1] | Cosine similarity between response embeddings (768-dimensional vectors). Measures how similar the responses are in meaning, regardless of wording. |
| NLI score | 0.40 | [0, 1] | Natural Language Inference score. Entailment = 1.0, contradiction = 0.0, neutral = 0.5. Measures whether one response logically follows from the other. |
| Fact overlap | 0.30 | [0, 1] | Proportion of factual claims that appear in both responses. Measures concrete factual agreement independent of framing. |
The NLI score receives the highest weight (0.40) because it captures logical entailment -- the strongest form of agreement. Two responses can have high semantic similarity (similar topics) or high fact overlap (same facts mentioned) while still contradicting each other in their conclusions. The NLI score catches this.
def compute_c_ij(response_i: str, response_j: str) -> float:
"""Compute pairwise agreement between two model responses."""
# Semantic similarity via embeddings
emb_i = embed(response_i) # 768-dim vector
emb_j = embed(response_j)
similarity = cosine_similarity(emb_i, emb_j)
# Natural Language Inference
nli_result = nli_model(response_i, response_j)
nli_score = {
"entailment": 1.0,
"contradiction": 0.0,
"neutral": 0.5,
}[nli_result]
# Fact overlap
facts_i = extract_facts(response_i)
facts_j = extract_facts(response_j)
if len(facts_i | facts_j) == 0:
fact_overlap = 1.0 # no facts = no disagreement
else:
fact_overlap = len(facts_i & facts_j) / len(facts_i | facts_j)
return 0.30 * similarity + 0.40 * nli_score + 0.30 * fact_overlapWeight Rationale
The weights (0.30, 0.40, 0.30) were determined empirically across benchmark consensus tasks. The NLI component receives the highest weight because logical entailment is the most reliable indicator of genuine agreement. Semantic similarity captures surface-level alignment, while fact overlap provides a concrete, verifiable dimension of agreement.
D-Score (Divergence)
The D-score measures overall divergence between all participating models. It is defined as the complement of the mean pairwise agreement:
D = 1 - mean(Cᵢⱼ)
The D-score ranges from 0 (perfect consensus) to 1 (total disagreement). It provides a single number that captures the overall state of the consensus process.
| D-Score Range | Interpretation | Action |
|---|---|---|
| D = 0.00 | Perfect consensus | Return result immediately |
| D ≤ 0.20 | High confidence | Consensus likely reached; verify criteria |
| D ≤ 0.40 | Moderate confidence | Continue spiral; convergence progressing |
| D ≤ 0.60 | Low confidence | Significant disagreement; spiral needed |
| D > 0.60 | No consensus | Major divergence; full spiral required |
Convergence Behavior
During the Axiom Spiral, D decreases geometrically at each loop:
D(n) = D(0) × (1/φ)ⁿ
The consensus criterion requires D < 0.05. This is guaranteed to be reached within 7 loops for any initial D, since (1/φ)⁷ ≈ 0.034, meaning at most 3.4% of the initial divergence remains. See φ-Convergence for the complete mathematical treatment.
H_models (Inter-Model Harmony)
The inter-model Harmony score H_models is the mean of all pairwise agreement scores. It is the complement of the D-score:
H_models = mean(Cᵢⱼ) = 1 - D
H_models captures how well the models agree with each other, without considering external verification. It ranges from 0 (complete dissonance) to 1 (perfect unison).
def compute_h_models(responses: dict[str, str]) -> float:
"""Compute inter-model harmony from all pairwise agreements."""
models = list(responses.keys())
n = len(models)
if n < 2:
return 1.0 # single model = perfect self-agreement
total = 0.0
pairs = 0
for i in range(n):
for j in range(i + 1, n):
total += compute_c_ij(responses[models[i]], responses[models[j]])
pairs += 1
return total / pairs # = mean(C_ij)H_oracle (Oracle Verification)
The oracle Harmony score H_oracle measures the accuracy of verifiable claims in the consensus answer against external ground truth. It is computed as the ratio of verified claims to total verifiable claims:
H_oracle = verified_claims / total_verifiable_claims
Oracles are external verification systems that can independently confirm or deny specific claims. The ACP system supports multiple oracle types:
| Oracle Type | Verifies | Examples |
|---|---|---|
| Mathematical | Arithmetic and algebraic claims | 2 + 2 = 4, sqrt(144) = 12, primality tests |
| Cryptographic | Hash computations | SHA-256 verification, checksum validation |
| Astronomical | Celestial computations | Planet positions, eclipse dates, orbital parameters |
| Physical | Physical constants and formulas | Speed of light, gravitational constant |
| Factual | Verifiable facts | Dates, measurements, definitions |
Oracle Independence
Oracles operate independently of the AI models. They provide ground truth for verifiable claims, ensuring that consensus does not merely mean "all models agree on the same wrong answer." The oracle dimension adds an objective verification layer that prevents collective hallucination.
H_total (Total Harmony)
The total Harmony score combines inter-model agreement with oracle verification, weighted by task criticality:
H_total = α × H_models + β × H_oracle
The weights α and β are determined by the criticality level of the task. Higher-criticality tasks place more weight on oracle verification (objective truth) while lower-criticality tasks place more weight on inter-model agreement (collective judgment).
Criticality Weights
| Criticality | α (H_models) | β (H_oracle) | Use Case |
|---|---|---|---|
| High | 0.4 | 0.6 | Medical, financial, safety-critical decisions. Oracle verification is prioritized because errors have severe consequences. |
| Medium | 0.5 | 0.5 | General factual questions, technical documentation. Equal weight to model agreement and oracle verification. |
| Low | 0.7 | 0.3 | Creative tasks, opinion-based queries, brainstorming. Model agreement is prioritized because objective verification is less applicable. |
def compute_h_total(
responses: dict[str, str],
oracle_results: OracleResults,
criticality: str = "medium",
) -> float:
"""Compute total harmony with criticality-weighted oracle scores."""
weights = {
"high": {"alpha": 0.4, "beta": 0.6},
"medium": {"alpha": 0.5, "beta": 0.5},
"low": {"alpha": 0.7, "beta": 0.3},
}
alpha = weights[criticality]["alpha"]
beta = weights[criticality]["beta"]
h_models = compute_h_models(responses)
h_oracle = oracle_results.verified / oracle_results.total_verifiable
return alpha * h_models + beta * h_oracleWhen Oracles Are Unavailable
If no verifiable claims are present in the responses (e.g., for purely opinion-based queries), H_oracle defaults to 1.0 and H_total reduces to α × H_models + β × 1.0. In practice, this means H_total ≥ β even with zero inter-model agreement, so the consensus criteria (H > 0.90) still require substantial model agreement.
Musical Interval Mappings
ACP maps numeric agreement scores to musical intervals, providing an intuitive vocabulary for describing consensus quality. These mappings are used in logging, dashboards, and API responses.
| Interval | H Range | Ratio | Meaning |
|---|---|---|---|
| Unison | H ≥ 0.95 | 1:1 | Full consensus. All models agree on substance and framing. |
| Octave | H ≥ 0.85 | 2:1 | Agreement at different levels of abstraction. Same truth, different expression. |
| Fifth | H ≥ 0.75 | 3:2 | Strong agreement with minor nuances. Core consensus is solid. |
| Fourth | H ≥ 0.60 | 4:3 | Stable agreement. Models converging but details still in flux. |
| Third | H ≥ 0.40 | 5:4 | Partial agreement. Models share common ground but differ on key points. |
| Tritone | H < 0.40 | 45:32 | Dissonance. Fundamental disagreement requiring full spiral resolution. |
Usage in the API
The musical interval is included in API responses alongside the numeric scores, providing both precise metrics and intuitive classification:
def classify_interval(h: float) -> str:
"""Map a harmony score to its musical interval."""
if h >= 0.95:
return "unison"
elif h >= 0.85:
return "octave"
elif h >= 0.75:
return "fifth"
elif h >= 0.60:
return "fourth"
elif h >= 0.40:
return "third"
else:
return "tritone"
# Example API response:
# {
# "consensus": "...",
# "d_score": 0.08,
# "harmony": 0.92,
# "interval": "unison",
# "positions": { ... }
# }Metrics Summary
The full metrics pipeline flows from pairwise measurements up through aggregate scores:
| Metric | Formula | Range | Purpose |
|---|---|---|---|
| Cᵢⱼ | 0.30×sim + 0.40×nli + 0.30×facts | [0, 1] | Pairwise agreement between two models |
| D | 1 - mean(Cᵢⱼ) | [0, 1] | Overall divergence (0 = consensus) |
| H_models | mean(Cᵢⱼ) = 1 - D | [0, 1] | Inter-model harmony |
| H_oracle | verified / total_verifiable | [0, 1] | Oracle verification score |
| H_total | α×H_models + β×H_oracle | [0, 1] | Combined harmony with criticality weighting |
Metric Dependencies
The metrics build on each other: Cᵢⱼ feeds into both D and H_models, which combine with H_oracle to produce H_total. The consensus criteria check H, all Cᵢⱼ values, and D simultaneously -- a system must pass all three dimensions to declare consensus.