Rubber Duck uses a second model from a different AI family to evaluate the primary agent’s plans, question assumptions, and raise concerns.
GitHub has introduced an experimental Rubber Duck mode in the GitHub Copilot CLI. The latest addition to the AI-powered coding tool uses a second model from a different AI family to provide a second opinion before enacting the agent’s plan.
The new feature was announced April 6. Introduced in experimental mode, Rubber Duck leverages a second model from a different AI family to act as an independent reviewer, assessing plans and work at the moments where feedback matters most, according to GitHub. Rubber Duck is a focused review agent, powered by a model from a complementary family to a primary Copilot session. The job of Rubber Duck is to check the agent’s work and present a short, focused list of high-value concerns including details the primary agent may have missed, assumptions worth questioning, and edge cases to consider.
Developers can use/experimentalin the Copilot CLI to access Rubber Duck alongside other experimental features.
Evaluating Rubber Duck on SWE-Bench Pro, a benchmark of real-world coding problems drawn from open-source repositories, GitHub found that Claude Sonnet 4.6 paired with Rubber Duck running GPT-5.4 achieved a resolution rate approaching Claude Opus 4.6 running alone, closing 74.7% of the performance gap between Sonnet and Opus. GitHub said Rubber Duck tends to help more with difficult problems, ones that span three-plus files and would normally take 70-plus steps. On these problems, Sonnet plus Rubber Duck scores 3.8% higher than the Sonnet baseline and 4.8% higher on the hardest problems identified across three trials.
GitHub cited these examples of the kinds of problems Rubber Duck finds:
- Architectural catch (OpenLibrary/async scheduler): Rubber Duck caught that the proposed scheduler would start and immediately exit, running zero jobs—and that even if fixed, one of the scheduled tasks was itself an infinite loop.
- One-liner bug (OpenLibrary/Solr): Rubber Duck caught a loop that silently overwrote the same
dictkey on every iteration. Three of four Solr facet categories were being dropped from every search query, with no error thrown. - Cross-file conflict (NodeBB/email confirmation): Rubber Duck caught three files that all read from a Redis key which the new code stopped writing. The confirmation UI and cleanup paths would have been silently broken on deploy.


