Thanks for the question David! Each regression found during testing went through an initial triage with the API team. We first identified scope--how many users, what features, and how quality of experience might be impacted. Critical issues required deallocating the experiment before further investigation. We then would try to build a minimal replication of the data issue across our old Falcor and new GraphQL APIs. With a basic replication, we used Netflix observability tooling such as Edgar to track the request upstream to determine root cause. As you mention this did identify latent bugs in our existing APIs.
Doing this all in a timely fashion was challenging. Some triages took on the order of weeks and spanned many upstream / downstream teams. Having the Falcor-to-GraphQL migration prioritized across the Netflix product helped to keep the project on track while these issues were resolved.