April 30, 2026

QA Harness fs.read Fixture Compares Mock Provider-Plan Args Instead of Codex Runtime Args

The runtime parity harness incorrectly uses mock provider planned-args as runtime truth, causing false drift detection between pi and codex tool-call shapes.

🔍 Symptoms

Observable Test Fixture Behavior

The runtime-tool-fs-read QA fixture reports a tool-call-shape drift between Pi and Codex runtime cells:

drift=tool-call-shape
details=tool call 2 differs (read/29687c90343f2a246f50d1a0a60b29c3f7340e1dc79a8a0ddd65e702a2667f7c vs read/462521a229a053d20c4c8121cecce65e885c7d2b0f94347c1d4922445a701263)

Cell-Level Evidence

Both test cells report passing scenario-level checks, yet the runtime parity capture shows divergent planned arguments:

pi:    read failure planned args: {"__qaFailureMode":"denied-input"}
codex: read failure planned args: {"path":"QA_KICKOFF_TASK.md"}

Harness Output Artifacts

The fixture generates the following proof artifacts:

.artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-summary.json
.artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-report.md
.artifacts/qa-e2e/tool-coverage-phase2-runtime.md

Misleading Interpretation

Without proper context, the evidence appears to show:

Codex replaying happy-path fs.read arguments on a failure-path runtime call
A genuine runtime argument rewriting bug
Regression in failure-path handling

🧠 Root Cause

Core Fixture Design Flaw

The QA fixture conflates two distinct data sources:

Provider-Plan Args: Arguments generated by the mock provider during planning phase
Runtime Tool-Call Args: Actual arguments passed during Codex runtime execution

Failure Sequence

The fixture injects failure behavior by using __qaFailureMode as direct tool arguments:

// Fixture configuration (INCORRECT)
failure_path: {
  tool: "fs.read",
  args: {
    "__qaFailureMode": "denied-input"  // ← This is harness metadata, not valid tool args
  }
}

The harness then incorrectly treats the mock provider’s planned arguments as if they were verified runtime tool-call arguments. This creates a false comparison:

// What the harness does (FLAWED)
mock_provider.getPlannedArgs() → compared against → expected_runtime_args

// What it should do
codex_runtime.getActualToolCallArgs() → compared against → expected_runtime_args

Architectural Inconsistency

The mock provider’s /debug/requests endpoint exposes planned arguments, but these represent:

What the mock model intended to call
Not necessarily what the actual runtime executed
Especially unreliable for failure-path scenarios where normalization may occur

The Actual Protocol Mismatch

The true difference between Pi and Codex in this scenario is the mock/native protocol handling around read, not argument rewriting. The fixture fails to isolate:

Provider plan generation (mock behavior)
Runtime tool execution (actual behavior)
Harness fault injection (fixture behavior)

🛠️ Step-by-Step Fix

Phase 1: Fix Fixture Architecture

Before (Flawed Configuration):

// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
  scenario: "qa-kickoff-task",
  tools: ["fs.read"],
  cells: ["pi", "codex"],
  paths: {
    happy: {
      args: { path: "QA_KICKOFF_TASK.md" }
    },
    failure: {
      args: {
        "__qaFailureMode": "denied-input"  // WRONG: harness metadata as tool args
      }
    }
  }
};

After (Corrected Configuration):

// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
  scenario: "qa-kickoff-task",
  tools: ["fs.read"],
  cells: ["pi", "codex"],
  paths: {
    happy: {
      args: { path: "QA_KICKOFF_TASK.md" }
    },
    failure: {
      args: { path: "QA_KICKOFF_TASK.md" },  // Valid tool-shaped args
      harnessFault: {
        type: "fs.read.denied-input",
        injectionPoint: "provider-plan"
      }
    }
  }
};

Phase 2: Separate Runtime Capture from Provider Plan

// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
  scenario: "qa-kickoff-task",
  tools: ["fs.read"],
  cells: ["pi", "codex"],
  runtimeCapture: {
    enabled: true,
    capturePoints: ["tool-call", "tool-result"],
    fields: ["args", "function", "callId"]
  },
  paths: {
    failure: {
      args: { path: "QA_KICKOFF_TASK.md" },
      harnessFault: {
        type: "fs.read.denied-input",
        // Inject at provider level, not as tool args
        injectionPoint: "provider-response",
        effect: "deny-read-permission"
      }
    }
  }
};

Phase 3: Update Comparison Logic

// runtime-parity-comparator.ts
interface RuntimeToolCall {
  cell: "pi" | "codex";
  toolName: string;
  callId: string;
  runtimeArgs: Record;  // Actual runtime args
  providerPlannedArgs?: Record;  // Separate from runtime
}

function compareToolCalls(
  piCall: RuntimeToolCall,
  codexCall: RuntimeToolCall,
  options: { comparePlanned: boolean; compareRuntime: boolean }
) {
  const results = [];

  if (options.compareRuntime) {
    // Compare ACTUAL runtime arguments (not planned)
    results.push(compareArgs(
      piCall.runtimeArgs,
      codexCall.runtimeArgs,
      { source: "runtime-tool-call" }
    ));
  }

  if (options.comparePlanned) {
    // Log provider plans separately for diagnostics
    results.push({
      type: "provider-plan-diagnostic",
      pi: piCall.providerPlannedArgs,
      codex: codexCall.providerPlannedArgs,
      note: "Provider plans may differ; diagnostic only"
    });
  }

  return results;
}

Phase 4: Add Live/Native Proof Gate

// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
  // ... existing config ...
  
  proofGate: {
    requireLiveProof: true,  // Gate on native execution proof
    fallbackToMock: false,   // Reject mock-only evidence
    nativeProviders: ["openai", "azure-openai"]
  }
};

🧪 Verification

1. Verify Fixture Correctness

# Check that fixture no longer uses __qaFailureMode as tool args
grep -r "__qaFailureMode" .artifacts/qa-e2e/runtime-tool-fs-read-proof/
# Expected: No matches in runtime-args fields, only in harnessFault injection config

# Verify args are tool-shaped
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/fixture-config.json | jq '.paths.failure.args'
# Expected: { "path": "QA_KICKOFF_TASK.md" }

2. Verify Runtime Capture Isolation

# Verify runtime tool-call args are captured separately from provider plans
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-summary.json | jq '
  .cells[].toolCalls[] | {
    runtimeArgs: .args,
    providerPlannedArgs: .providerPlannedArgs,
    source: .metadata.captureSource
  }
'
# Expected: runtimeArgs from runtime-tool-call, providerPlannedArgs from provider-plan diagnostic

3. Run Runtime Parity Test

# Execute the fixed fixture
npm run test:qa:runtime-parity -- --fixture=runtime-tool-fs-read

# Expected output
# - No tool-call-shape drift reported for failure path
# - Provider plan diagnostic shows Pi vs Codex differences (informational only)
# - Runtime tool-call args match between cells (or documented as intentional)

4. Verify No False Positives

# Check that the fixture no longer flags the happy-path args as "drift"
grep -A5 "drift=tool-call-shape" .artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-report.md
# Expected: No matches, or matches with clear explanation that drift is expected

5. Cross-Reference with Native Provider (If Available)

# If native Codex execution is available, verify the fix doesn't mask real issues
npm run test:qa:runtime-parity -- --fixture=runtime-tool-fs-read --provider=native-openai

# Verify runtime args are correctly captured from native execution
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/native-execution.json | jq '.toolCalls[1].args'

⚠️ Common Pitfalls

1. Mock Provider vs Runtime Execution Conflation

Trap: Treating provider plan outputs as verified runtime behavior.

Prevention: Always use runtime instrumentation to capture actual tool-call arguments. Provider plans are planning artifacts, not execution proof.

// ❌ WRONG
const plannedArgs = mockProvider.getLastPlan().args;
verifyToolCall(plannedArgs);

// ✅ CORRECT
const runtimeArgs = await runtimeInstrument.captureToolCall(toolId);
verifyToolCall(runtimeArgs);

2. Harness Metadata Injection as Tool Arguments

Trap: Using internal harness controls (like __qaFailureMode) as valid tool arguments.

Prevention: Separate fault injection from tool arguments:

// ❌ WRONG
{ tool: "fs.read", args: { "__qaFailureMode": "denied" } }

// ✅ CORRECT
{ 
  tool: "fs.read", 
  args: { path: "valid/path.md" },
  harnessFault: { type: "fs.read.denied", target: "permission" }
}

3. Scenario-Level Pass Masking Tool-Level Failures

Trap: Relying on scenario-level test passes to validate tool-level behavior.

Prevention: Implement per-tool granularity checks as the primary validation gate:

// Scenario pass is necessary but not sufficient
if (scenario.passed) {
  for (const tool of scenario.tools) {
    verifyToolCallShape(tool, expectedShape);  // Required validation
  }
}

4. macOS/Docker Environment Differences

Trap: Mock providers may behave differently in containerized vs native environments.

Prevention: Run parity checks across all target environments:

# Verify mock behavior consistency
npm run test:qa:mock-consistency -- --fixture=runtime-tool-fs-read --env=native
npm run test:qa:mock-consistency -- --fixture=runtime-tool-fs-read --env=docker

5. Provider Plan Caching

Trap: Cached provider plans may not reflect current fixture configuration.

Prevention: Implement plan invalidation on fixture changes:

// Before each test run
await mockProvider.clearPlanCache();
await mockProvider.reloadFixture(fixtureConfig);

TRACKING #80171 — Parent issue for runtime parity harness validation
TRACKING #80173 — Phase 2 per-tool fixture implementation tracking

Similar Patterns in Other Fixtures

fs.write failure fixture — May exhibit same mock-plan vs runtime-args conflation
exec.run failure fixture — Similar pattern with command injection metadata
http.request failure fixture — Header injection vs actual request args

Diagnostic Error Codes

HARNESS-PLAN-VS-RUNTIME-MISMATCH — Indicates provider plan differs from runtime execution
HARNESS-INVALID-TOOL-ARGS — Indicates fixture using non-tool-shaped arguments
RUNTIME-PARITY-TOOL-SHAPE-DRIFT — Indicates actual tool-call shape differs between runtimes

Historical Context

This issue exemplifies the class of errors that the runtime parity harness was designed to catch. The harness correctly identified a signal, but the signal’s interpretation relied on incorrect assumptions about data source fidelity. Similar issues have been documented in:

Provider plan caching leading to stale comparison baselines
Fixture configuration drift between test suite versions
Inconsistent runtime instrumentation coverage across Pi and Codex cells