OpenAI's 48-Hour Blitz Retakes AI Lead from Google, Anthropic

OpenAI's 48-hour launch blitz illustration, generated with gemini-3-pro-image

In the span of 48 hours, OpenAI quietly rewrote every leaderboard that mattered: ChatGPT Images 2.0 opened a 242-point gap over Google’s Nano Banana 2 on LM Arena — the widest #1-to-#2 spread ever recorded — and GPT-5.5, codenamed “Spud,” followed two days later to claim the top spot on the Artificial Analysis Intelligence Index and edge out Anthropic’s unreleased Mythos on agentic coding.

The back-to-back launches ended a three-way tie at the top of the intelligence charts between Anthropic, Google, and OpenAI, and they buried Google’s image-generation momentum from the Nano Banana era. They also arrived exactly one week after Anthropic stunned the industry by refusing to release its most capable model on safety grounds, an inversion that OpenAI’s leadership clearly relished.

ChatGPT Images 2.0 leaderboard dominance, generated with gemini-3-pro-image

The 242-Point Knockout

ChatGPT Images 2.0 went live on Tuesday, April 21, built on a model OpenAI calls gpt-image-2. It is the company’s first image model with native reasoning — it can plan, search the web for references, and verify its own output before committing a pixel — and the numbers that came out of LM Arena the next morning were not subtle.

On the April 22 snapshot of the Text-to-Image Arena leaderboard, gpt-image-2 (medium) opened at 1,512. Google’s gemini-3.1-flash-image-preview — the model the world has been calling Nano Banana 2 — sat at 1,270. Arena called the 242-point spread the largest gap between #1 and #2 ever recorded on any of its image boards. Typical model upgrades move the needle 30 to 60 points. This was a reset.

The model also swept all seven text-to-image sub-categories, took #1 on single-image edit (1,513), and led multi-image edit (1,464). Nano Banana 2 — the same model that briefly pushed Gemini to the top of the App Store last fall and brought Google 10 million new Gemini users — slid to second in every category it had dominated.

What actually changed under the hood is the reasoning layer. OpenAI’s announcement post describes two modes: Instant, which behaves like a conventional generator, and Thinking, which routes the prompt through the same O-series reasoning stack that powers GPT-5.5 before a single pixel is drawn. Thinking mode can browse the web for reference material, generate up to eight consistent images from a single prompt, and double-check its own outputs — the image equivalent of chain-of-thought.

The capability jump shows up in the places image models have always been weakest: dense text rendering, multilingual scripts, and information-rich layouts. Images 2.0 is the first OpenAI image model that renders Japanese, Korean, Chinese, Hindi, and Bengali text cleanly, and it handles dense magazine-cover compositions — with real-looking headlines, deck copy, volume numbers, and barcodes — that would have collapsed into typographic soup last year.

The Barcode Test

The viral stress-test of the week involved barcodes. Independent reviewers asked the model to generate business-book covers for titles like Good to Great and The Intelligent Investor, each with a functional back-of-book barcode and a valid ISBN — the model had to invent the numbers itself. Three of the five generated barcodes scanned successfully on a standard iPhone camera, and the codes for the real titles mapped to the actual published editions. In one case the tester blacked out the printed ISBN digits and the scanner still pulled up the correct book from the barcode alone.

That is a strange and slightly unnerving capability. Barcodes are not really images — they are encoded data that happens to be rendered visually. A model that can generate a correct, scannable EAN-13 for the real printed edition of a specific book is doing something closer to retrieval than to rendering. For product teams building packaging mockups, it turns a multi-step pipeline into a single prompt.

GPT-5.5 Spud benchmark leadership, generated with gemini-3-pro-image

GPT-5.5 “Spud”: A Narrow but Real Coding Lead

Two days after the image launch, on Thursday, April 23, OpenAI published its Introducing GPT-5.5 post and flipped the model live inside ChatGPT and Codex within ten minutes. Internally the model had been known as Spud — OpenAI’s research group has always given base models deliberately unglamorous codenames (Strawberry, Orion, now Spud) to keep premature buzz out of the codebase.

The significance of that codename is easy to miss. Every 5.1, 5.2, 5.3, and 5.4 release was a post-training layer on top of the same GPT-5 skeleton. GPT-5.5 is the first fully retrained base model OpenAI has shipped since GPT-4.5 — a new architecture, a new pretraining corpus, and an agentic training objective built in from the bottom up rather than bolted on.

Terminal-Bench and the Mythos Comparison

The headline benchmark result is on Terminal-Bench 2.0, which measures how well a model can plan, execute, and recover from multi-step command-line workflows — the closest thing the industry has to an agentic-coding eval. GPT-5.5 scored 82.7%. GPT-5.4 scored 75.1%. Claude Opus 4.7, the model Anthropic released two weeks ago, scored 69.4%.

The more charged comparison is against Mythos, the Anthropic model that has not been released publicly at all. Mythos scored 82.0% on the same evaluation. In other words, GPT-5.5 — a model any Plus, Pro, Business, or Enterprise user can load in ChatGPT right now — beats, by seven-tenths of a point, the model Anthropic has spent the last month describing as too dangerous to ship.

Other numbers from the launch:

SWE-Bench Pro: 58.6% — measuring real-world GitHub issue resolution. Claude Opus 4.7 scores higher here at 64.3%, though OpenAI has publicly disputed the methodology, citing signs of training-set memorization in Anthropic’s reported runs.
FrontierMath Tier 4: 35.4% — roughly double Opus 4.7 on the hardest public math benchmark.
GDPval: 84.9% — OpenAI’s evaluation of model output vs. human professionals across 44 occupations, from paralegal work to financial modeling. The model matches or beats the human baseline in the vast majority of comparisons.
Computer-use: 78.7% — the metric for navigating GUIs, filling forms, and driving desktop applications.

The Artificial Analysis Scoreboard Has a Standalone Leader Again

The Artificial Analysis Intelligence Index — a composite of ten hard benchmarks covering reasoning, knowledge, science, coding, and agents — had, until this week, a three-way tie at the top: Claude Opus 4.7, Gemini 3.1 Pro, and the then-reigning GPT-5.4 were all locked at a score of 57.

GPT-5.5 extra-high now sits at 60. It is the first time in months the index has had a clean, standalone leader. For the first time since Google’s Gemini 3.1 run in early 2026, OpenAI’s consumer-grade flagship is objectively the smartest general-purpose model measured.

A caveat worth noting: on the AA-Omniscience factual-accuracy benchmark, GPT-5.5 extra-high hit the highest accuracy score on record at 57% — but also the highest hallucination rate at 86%. By comparison, Opus 4.7’s hallucination rate is 36% and Gemini 3.1 Pro’s is 50%. The model reaches for answers more aggressively, and when it reaches wrong, it reaches wrong more confidently.

The Price Doubled

OpenAI also made the single biggest per-token price increase of the GPT-5.x era. Standard GPT-5.5 runs $5 per million input tokens and $30 per million output tokens in the API — exactly double GPT-5.4’s $2.50 and $15. GPT-5.5 Pro lands at $30 input and $180 output, which makes it more expensive than Claude Opus.

OpenAI’s argument is that the model completes the same tasks in meaningfully fewer tokens, so effective cost per finished task stays roughly flat. That claim holds up in their internal agentic-coding evals, where 5.5 uses about half the tokens of 5.4 for the same completion. For users of ChatGPT Plus or Pro, this detail is invisible. For anyone running production agent workloads on the API, it is the entire story.

The API became available on April 24, a day after the ChatGPT rollout. Cursor, Windsurf, and other third-party coding clients have not yet exposed the model as of this writing — integrations typically lag the API by a few days.

The Mythos Shadow

The backdrop to this entire week is Anthropic’s awkward month. On April 10, Anthropic announced Mythos — an AI model so capable at identifying software vulnerabilities, the company claimed, that releasing it to the public would be reckless. Instead Anthropic created Project Glasswing, a restricted preview for roughly 40 organizations including Apple, Amazon, Google, Microsoft, Cisco, CrowdStrike, JPMorgan, and Nvidia, plus the NSA and the Commerce Department’s Center for AI Standards and Innovation.

Mozilla used a Mythos preview to identify and patch 271 vulnerabilities in Firefox. Anthropic reported the model found a 27-year-old security flaw in OpenBSD, one of the most aggressively audited operating systems ever written.

Then, on April 21 — the same day ChatGPT Images 2.0 shipped — Anthropic disclosed that a small group of unauthorized users had gained access to Mythos. The group, operating out of a private Discord, had correctly guessed the model’s online endpoint based on Anthropic’s usual URL conventions, then got in through a third-party vendor environment where a contractor in the group had legitimate credentials. Anthropic says its own systems were not breached.

The episode punctured the central premise of Glasswing. When a security researcher publicly warned that Mythos access had almost certainly spread across thousands of employees at the 40 partner companies, the statement drew agreement rather than pushback.

Altman’s Bomb-Shelter Line

On the Core Memory podcast hosted by Ashlee Vance the same week, Sam Altman — sitting alongside Greg Brockman — delivered a line that has been quoted in every subsequent story about the Mythos leak:

“It is clearly incredible marketing to say, ‘We have built a bomb. We were about to drop it on your head. We will sell you a bomb shelter for $100 million to run across all your stuff — but only if we pick you as a customer.'”

Altman did not name Anthropic. He did not need to. He went on to argue that fear-based AI marketing is a strategy aimed at justifying a smaller, more controlled circle of AI access: “There are people in the world who, for a long time, have wanted to keep AI in the hands of a smaller group of people. You could justify that in a lot of different ways, and some of it’s real.”

The subtext is that GPT-5.5 — released on Thursday, at consumer-grade pricing, to anyone with a ChatGPT subscription — scores higher on the coding-agent benchmark that Anthropic used to justify locking Mythos away.

The Rest of the Week

Under almost any other schedule, the remaining launches this week would have each earned their own news cycle. Anthropic released Claude Design, a vision-model-powered collaborative design surface running on Opus 4.7 that produces slide decks, animated prototypes, and wireframes from a single prompt. Google DeepMind shipped Deep Research Max, an autonomous research agent now leading most deep-research benchmarks. Alibaba dropped two Qwen 3.6 models — the proprietary Max Preview and the open-weights 27B variant. Moonshot released Kimi K2.6, an open-weights coding model that beats Opus 4.6 and GPT-5.4 extra-high on DeepSearch and Humanity’s Last Exam.

OpenAI itself also quietly open-weighted a PII Privacy Filter — a small, local-inference model designed to redact personal information from unstructured text before it ever leaves a user’s machine — and rolled out a free ChatGPT for Clinicians offering for verified medical professionals in the U.S.

Microsoft shipped Copilot agentic actions across Word, Excel, and PowerPoint. Claude moved into Microsoft Word natively for Pro and Max users. X launched Grok-powered custom topic timelines. None of it broke through the news cycle that GPT-5.5 and Images 2.0 consumed between them.

The Bottom Line

For most of the last nine months, OpenAI has been on the defensive. Gemini owned the consumer mindshare conversation through Nano Banana. Anthropic owned the developer conversation through Opus 4.5 and 4.6. The company’s own post-Thanksgiving “code red” memo — pushing the team to close the consumer gap — leaked in January.

What happened this week is, strategically, exactly what a code red looks like when it works. Kill the products that aren’t landing (Sora, the short-form video app, was shuttered a month ago), pivot the ones that are (Codex toward enterprise), and then release your strongest base-model retraining and your strongest image model within 48 hours of each other, right as your biggest competitor is explaining to reporters how its own frontier model leaked.

The lead is real but not uncontested. Gemini 3.5 Pro is reportedly weeks away. Anthropic’s public pipeline post-Mythos is less clear — the company has strong reasons, commercial and reputational, to ship something before the end of Q2. And the hallucination-rate caveat on GPT-5.5 is not trivial; for any workflow that rewards confidence over calibration, the model will fail loudly.

But for the first time in a long time, if you ask the question “which lab is leading the frontier right now,” the answer is unambiguous. OpenAI has the top model on the intelligence index, the top model on the image-gen leaderboard, and a pricing structure that — despite the doubled API rates — still sits inside what most developers will pay. The race is not over, but this week it is no longer tied.

OpenAI’s 48-Hour Blitz Retakes AI Lead from Google, Anthropic