QA Harness fs.read Fixture Compares Mock Provider-Plan Args Instead of Codex Runtime Args
The runtime parity harness incorrectly uses mock provider planned-args as runtime truth, causing false drift detection between pi and codex tool-call shapes.
π Symptoms
Observable Test Fixture Behavior
The runtime-tool-fs-read QA fixture reports a tool-call-shape drift between Pi and Codex runtime cells:
drift=tool-call-shape
details=tool call 2 differs (read/29687c90343f2a246f50d1a0a60b29c3f7340e1dc79a8a0ddd65e702a2667f7c vs read/462521a229a053d20c4c8121cecce65e885c7d2b0f94347c1d4922445a701263)
Cell-Level Evidence
Both test cells report passing scenario-level checks, yet the runtime parity capture shows divergent planned arguments:
pi: read failure planned args: {"__qaFailureMode":"denied-input"}
codex: read failure planned args: {"path":"QA_KICKOFF_TASK.md"}
Harness Output Artifacts
The fixture generates the following proof artifacts:
.artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-summary.json.artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-report.md.artifacts/qa-e2e/tool-coverage-phase2-runtime.md
Misleading Interpretation
Without proper context, the evidence appears to show:
- Codex replaying happy-path
fs.readarguments on a failure-path runtime call - A genuine runtime argument rewriting bug
- Regression in failure-path handling
π§ Root Cause
Core Fixture Design Flaw
The QA fixture conflates two distinct data sources:
- Provider-Plan Args: Arguments generated by the mock provider during planning phase
- Runtime Tool-Call Args: Actual arguments passed during Codex runtime execution
Failure Sequence
The fixture injects failure behavior by using __qaFailureMode as direct tool arguments:
// Fixture configuration (INCORRECT)
failure_path: {
tool: "fs.read",
args: {
"__qaFailureMode": "denied-input" // β This is harness metadata, not valid tool args
}
}
The harness then incorrectly treats the mock provider’s planned arguments as if they were verified runtime tool-call arguments. This creates a false comparison:
// What the harness does (FLAWED)
mock_provider.getPlannedArgs() β compared against β expected_runtime_args
// What it should do
codex_runtime.getActualToolCallArgs() β compared against β expected_runtime_args
Architectural Inconsistency
The mock provider’s /debug/requests endpoint exposes planned arguments, but these represent:
- What the mock model intended to call
- Not necessarily what the actual runtime executed
- Especially unreliable for failure-path scenarios where normalization may occur
The Actual Protocol Mismatch
The true difference between Pi and Codex in this scenario is the mock/native protocol handling around read, not argument rewriting. The fixture fails to isolate:
- Provider plan generation (mock behavior)
- Runtime tool execution (actual behavior)
- Harness fault injection (fixture behavior)
π οΈ Step-by-Step Fix
Phase 1: Fix Fixture Architecture
Before (Flawed Configuration):
// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
scenario: "qa-kickoff-task",
tools: ["fs.read"],
cells: ["pi", "codex"],
paths: {
happy: {
args: { path: "QA_KICKOFF_TASK.md" }
},
failure: {
args: {
"__qaFailureMode": "denied-input" // WRONG: harness metadata as tool args
}
}
}
};
After (Corrected Configuration):
// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
scenario: "qa-kickoff-task",
tools: ["fs.read"],
cells: ["pi", "codex"],
paths: {
happy: {
args: { path: "QA_KICKOFF_TASK.md" }
},
failure: {
args: { path: "QA_KICKOFF_TASK.md" }, // Valid tool-shaped args
harnessFault: {
type: "fs.read.denied-input",
injectionPoint: "provider-plan"
}
}
}
};
Phase 2: Separate Runtime Capture from Provider Plan
// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
scenario: "qa-kickoff-task",
tools: ["fs.read"],
cells: ["pi", "codex"],
runtimeCapture: {
enabled: true,
capturePoints: ["tool-call", "tool-result"],
fields: ["args", "function", "callId"]
},
paths: {
failure: {
args: { path: "QA_KICKOFF_TASK.md" },
harnessFault: {
type: "fs.read.denied-input",
// Inject at provider level, not as tool args
injectionPoint: "provider-response",
effect: "deny-read-permission"
}
}
}
};
Phase 3: Update Comparison Logic
// runtime-parity-comparator.ts
interface RuntimeToolCall {
cell: "pi" | "codex";
toolName: string;
callId: string;
runtimeArgs: Record; // Actual runtime args
providerPlannedArgs?: Record; // Separate from runtime
}
function compareToolCalls(
piCall: RuntimeToolCall,
codexCall: RuntimeToolCall,
options: { comparePlanned: boolean; compareRuntime: boolean }
) {
const results = [];
if (options.compareRuntime) {
// Compare ACTUAL runtime arguments (not planned)
results.push(compareArgs(
piCall.runtimeArgs,
codexCall.runtimeArgs,
{ source: "runtime-tool-call" }
));
}
if (options.comparePlanned) {
// Log provider plans separately for diagnostics
results.push({
type: "provider-plan-diagnostic",
pi: piCall.providerPlannedArgs,
codex: codexCall.providerPlannedArgs,
note: "Provider plans may differ; diagnostic only"
});
}
return results;
}
Phase 4: Add Live/Native Proof Gate
// runtime-tool-fs-read.fixture.ts
export const fsReadFailurePath = {
// ... existing config ...
proofGate: {
requireLiveProof: true, // Gate on native execution proof
fallbackToMock: false, // Reject mock-only evidence
nativeProviders: ["openai", "azure-openai"]
}
};
π§ͺ Verification
1. Verify Fixture Correctness
# Check that fixture no longer uses __qaFailureMode as tool args
grep -r "__qaFailureMode" .artifacts/qa-e2e/runtime-tool-fs-read-proof/
# Expected: No matches in runtime-args fields, only in harnessFault injection config
# Verify args are tool-shaped
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/fixture-config.json | jq '.paths.failure.args'
# Expected: { "path": "QA_KICKOFF_TASK.md" }
2. Verify Runtime Capture Isolation
# Verify runtime tool-call args are captured separately from provider plans
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-summary.json | jq '
.cells[].toolCalls[] | {
runtimeArgs: .args,
providerPlannedArgs: .providerPlannedArgs,
source: .metadata.captureSource
}
'
# Expected: runtimeArgs from runtime-tool-call, providerPlannedArgs from provider-plan diagnostic
3. Run Runtime Parity Test
# Execute the fixed fixture
npm run test:qa:runtime-parity -- --fixture=runtime-tool-fs-read
# Expected output
# - No tool-call-shape drift reported for failure path
# - Provider plan diagnostic shows Pi vs Codex differences (informational only)
# - Runtime tool-call args match between cells (or documented as intentional)
4. Verify No False Positives
# Check that the fixture no longer flags the happy-path args as "drift"
grep -A5 "drift=tool-call-shape" .artifacts/qa-e2e/runtime-tool-fs-read-proof/qa-suite-report.md
# Expected: No matches, or matches with clear explanation that drift is expected
5. Cross-Reference with Native Provider (If Available)
# If native Codex execution is available, verify the fix doesn't mask real issues
npm run test:qa:runtime-parity -- --fixture=runtime-tool-fs-read --provider=native-openai
# Verify runtime args are correctly captured from native execution
cat .artifacts/qa-e2e/runtime-tool-fs-read-proof/native-execution.json | jq '.toolCalls[1].args'
β οΈ Common Pitfalls
1. Mock Provider vs Runtime Execution Conflation
Trap: Treating provider plan outputs as verified runtime behavior.
Prevention: Always use runtime instrumentation to capture actual tool-call arguments. Provider plans are planning artifacts, not execution proof.
// β WRONG
const plannedArgs = mockProvider.getLastPlan().args;
verifyToolCall(plannedArgs);
// β
CORRECT
const runtimeArgs = await runtimeInstrument.captureToolCall(toolId);
verifyToolCall(runtimeArgs);
2. Harness Metadata Injection as Tool Arguments
Trap: Using internal harness controls (like __qaFailureMode) as valid tool arguments.
Prevention: Separate fault injection from tool arguments:
// β WRONG
{ tool: "fs.read", args: { "__qaFailureMode": "denied" } }
// β
CORRECT
{
tool: "fs.read",
args: { path: "valid/path.md" },
harnessFault: { type: "fs.read.denied", target: "permission" }
}
3. Scenario-Level Pass Masking Tool-Level Failures
Trap: Relying on scenario-level test passes to validate tool-level behavior.
Prevention: Implement per-tool granularity checks as the primary validation gate:
// Scenario pass is necessary but not sufficient
if (scenario.passed) {
for (const tool of scenario.tools) {
verifyToolCallShape(tool, expectedShape); // Required validation
}
}
4. macOS/Docker Environment Differences
Trap: Mock providers may behave differently in containerized vs native environments.
Prevention: Run parity checks across all target environments:
# Verify mock behavior consistency
npm run test:qa:mock-consistency -- --fixture=runtime-tool-fs-read --env=native
npm run test:qa:mock-consistency -- --fixture=runtime-tool-fs-read --env=docker
5. Provider Plan Caching
Trap: Cached provider plans may not reflect current fixture configuration.
Prevention: Implement plan invalidation on fixture changes:
// Before each test run
await mockProvider.clearPlanCache();
await mockProvider.reloadFixture(fixtureConfig);
π Related Errors
Related Harness Issues
- TRACKING #80171 β Parent issue for runtime parity harness validation
- TRACKING #80173 β Phase 2 per-tool fixture implementation tracking
Similar Patterns in Other Fixtures
- fs.write failure fixture β May exhibit same mock-plan vs runtime-args conflation
- exec.run failure fixture β Similar pattern with command injection metadata
- http.request failure fixture β Header injection vs actual request args
Diagnostic Error Codes
HARNESS-PLAN-VS-RUNTIME-MISMATCHβ Indicates provider plan differs from runtime executionHARNESS-INVALID-TOOL-ARGSβ Indicates fixture using non-tool-shaped argumentsRUNTIME-PARITY-TOOL-SHAPE-DRIFTβ Indicates actual tool-call shape differs between runtimes
Historical Context
This issue exemplifies the class of errors that the runtime parity harness was designed to catch. The harness correctly identified a signal, but the signal’s interpretation relied on incorrect assumptions about data source fidelity. Similar issues have been documented in:
- Provider plan caching leading to stale comparison baselines
- Fixture configuration drift between test suite versions
- Inconsistent runtime instrumentation coverage across Pi and Codex cells