A controlled sample of 18 cells × 7 repeats of the same request. Goal: confirm that “faster and cheaper” is not a property of the model but a side effect of non-deterministic routing and a broken size parameter. The underlying model is genuine OpenAI gpt-image-2 (C2PA signature).
Key metrics grouped by quality (low / medium / high) and split by request type — this is where the tiers diverge.
1024×1024 was requested every single time. Correct / wrong split by the backend identified from the C2PA signature.
claim_generator = gpt-image, urn:c2pa). In this test this is the path that ignores the requested size and returns the wrong resolution on a cheap token tier.softwareAgent = Azure OpenAI ImageGen. In this test it always returned the correct 1024×1024 with honest token billing.Distribution of actual resolutions (all requested 1024×1024).
Cost min–max spread within a cell (7 identical repeats). Green dot = cheapest run, red dot = most expensive.
Each dot = 1 request. Note the separated clusters: the cheap/fast “direct” path (wrong size) vs the honest “Azure”.
Click any thumbnail to open the full image. The edits come out correctly — the problem is the size and the billing, not the quality.
Every row has a thumbnail — hover it for a larger preview, click it to open the full-resolution image. Group, filter, sort and search below.
results.jsonl, results.csv, full
responses in responses/, images in outputs/. Scripts: run_debug.py, report.py.