featured

5 new GPT-5.5 upgrades that no developer should ignore

This was such a massive upgrade.

OpenAI saw all the craze going on with Opus 4.7 — so of course they had to quickly release a new model to steal back the spotlight.

And these are some pretty incredible upgrades over GPT-5.4 we got here — no developer should ignore this.

Like now it’s gotten sooo much better at processing extremely long context — more than 50% increase for very large inputs — something that will be very useful in development, for those humongous, intricately connected codebases.

GPT-5.5 doesn’t just extend context to 1 million tokens—it stays sharper inside it, improving Graphwalks BFS accuracy from 92.1% to 94.8% at 0–128K tokens and from 48.3% to 73.7% at 256K tokens.

It even comes with a new overpowered thinking mode and a new Pro variant — to build the most complicated features, and deal with the hardest bugs known to man.

1. Massive upgrades in long-context reliability

Many models advertise huge context windows. Few stay accurate when that context becomes truly massive.

GPT-5.5 ships with a 1 million token context window and a 200,000 token output limit — but the real story is measurable reliability at extreme scale.

On Graphwalks BFS, a benchmark that tests whether the model can follow chains of logic scattered across very large documents, GPT-5.5 shows major gains over GPT-5.4 as context size increases:

  • At 0–128K tokens, GPT-5.5 scored 94.8%, up from 92.1% for GPT-5.4 (+2.7 points)
  • At 256K tokens, GPT-5.5 scored 73.7%, up from 48.3% (+25.4 points)
  • At the full 1 million token context, GPT-5.5 scored 45.4%, up from 21.4% (+24.0 points)

Those numbers matter because most models degrade sharply as context expands.

GPT-5.5 appears significantly better at retaining signal, tracing relationships, and reasoning across huge inputs.

For us developers that means stronger performance across:

  • large monorepos
  • architecture documentation
  • multi-service dependency maps
  • long debugging sessions
  • logs, tickets, specs, and tests in one thread
  • research across many files simultaneously

Instead of splitting work into small prompt chunks, teams can increasingly provide broader system context and let the model reason globally.

2. Incredible agentic coding and terminal use improvements

GPT-5.5 is heavy optimized for autonomous coding, tool use, debugging, and multi-step execution.

On Terminal-Bench 2.0, GPT-5.5 reportedly scored 82.7% compared with:

  • GPT-5.4 at 75.1%
  • Claude Opus 4.7 at 69.4%
  • Gemini 3.1 Pro at 68.5%

That is a 7.6-point jump over GPT-5.4.

And this will reflect in how we developers actually work:

  • inspect files
  • run commands
  • read errors
  • patch code
  • rerun tests
  • iterate until fixed

On SWE-Bench Pro (Public), GPT-5.5 scored 58.6%, versus 57.7% for GPT-5.4 and 54.2% for Gemini 3.1 Pro.

3. Overpowered thinking mode: xhigh

One of the most exciting additions is a new super-powered thinking mode in GPT-5.5’s reasoning control system.

Developers can choose among these five effort levels:

  • none
  • low
  • medium
  • high
  • xhigh

xhigh is effectively the “use more compute and think harder” mode, ideal for:

  • architecture decisions
  • subtle debugging
  • security reviews
  • algorithm design
  • migrations
  • complex planning

Instead of using maximum reasoning for every task, teams can reserve deep thinking for problems where mistakes are costly.

4. GPT-5.5 Pro variant

OpenAI also introduced GPT-5.5 Pro, aimed at users who want maximum performance.

Listed pricing:

  • GPT-5.5: $5 input / $30 output per million tokens
  • GPT-5.5 Pro: $30 input / $180 output per million tokens

That means Pro costs 6x more on input and 6x more on output, strongly suggesting it is designed for:

  • enterprise automation
  • mission-critical engineering workflows
  • legal or finance review systems
  • advanced research pipelines
  • premium coding agents

Standard GPT-5.5 is the workhorse for everyday tasks. Pro is the high-confidence tier.

5. Greater conciseness, efficiency, and real-world speed

One of the most underrated GPT-5.5 upgrades is not raw intelligence—it is how efficiently that intelligence is delivered.

Instead of solving coding tasks with long explanations and bloated outputs, GPT-5.5 is optimized for tighter, cleaner responses that reduce both latency and cost.

In side-by-side coding tasks, GPT-5.5 reportedly uses 72% fewer output tokens than Claude Opus 4.7 to solve the same GitHub issues. Rather than generating essays, it tends to prefer concise diffs and direct fixes.

Against the previous generation, GPT-5.5 also shows stronger internal efficiency. On standard software engineering workloads (Expert-SWE), it reportedly completes tasks using 15–20% fewer tokens than GPT-5.4.

That matters because fewer tokens compound into practical gains:

  • lower API cost per task
  • faster iteration loops
  • cleaner patches and diffs
  • easier review cycles
  • less noise for developers to parse

The speed gains are equally meaningful. Because GPT-5.5 generates fewer tokens while maintaining roughly GPT-5.4-level per-token latency, it can complete the same coding workloads around 40% faster in real-world use.

For developers, that means less waiting, less clutter, and more usable output.

GPT-5.5 isn’t just a routine upgrade. OpenAI is really focusing on the real pain points serious users feel every day:

  • unreliable long context
  • weak autonomous tool use
  • shallow reasoning on hard tasks
  • expensive verbosity
  • lack of premium capability tiers

The result is a model that feels even more like a serious engineering collaborator.

For developers, this may be the most important upgrade of all.

Claude Opus 4.7 just changed everything

Unbelievable. Just unbelievable.

If you heard even a slice of all the crazy things that Claude Mythos has been doing in the wild — then you’ll understand why this was far far from just another incremental upgrade.

Because this is basically Claude Mythos — but actually made less powerful on purpose to avoid serious problems(!!)

Every ability went up drastically — this is hands-down the best (publicly available) coding model in the world right now. It’s not even close.

Imagine going from 54.5% to 98.5% in a major ability in just one release? How is that even possible.

For the computer-use work that sits at the heart of XBOW’s autonomous penetration testing, the new Claude Opus 4.7 is a step change: 98.5% on our visual-acuity benchmark versus 54.5% for Opus 4.6. Our single biggest Opus pain point effectively disappeared…

And should we talk about Claude Design? The shocking new Claude Opus 4.7-powered tool for creating any interactive diagram or slide from a text prompt…

1. Claude Mythos-lite

It’s been the absolute talk of the town in the days leading up this new Opus.

Earlier this month Anthropic revealed Claude Mythos Preview — an internal cybersecurity-focused model with capabilities too sensitive for broad public release.

Mythos could identify and exploit zero-day vulnerabilities across major operating systems and browsers. It reportedly found bugs 10 to 20 years old, including a 27-year-old OpenBSD vulnerability.

It also posted striking benchmark numbers:

  • 83.1% on CyberGym vulnerability reproduction
  • 595 crashes on internal exploit benchmarks
  • 10 tier-5 control-flow hijacks on patched targets

“Mythos Preview identifies and exploits zero-day vulnerabilities across every major operating system and browser family.”

Mythos access was restricted to select defensive organizations through Project Glasswing.

Opus 4.7 is effectively the “safe” version — Anthropic intentionally nerfed its hacking capabilities while keeping its reasoning and coding skills at a “frontier” level for the public

2. Massive gains in agentic coding

It’s gotten so much better at reasoning across tasks, use tools, fix mistakes, and continue working until the objective is complete.

And self-verifying its own work too.

“Opus 4.7 pays closer attention to your instructions and devises ways to verify its own outputs before reporting back.”

That means stronger behavior such as:

  • Writing tests before finalizing code
  • Catching failed assumptions
  • Revising broken implementations
  • Managing multi-step workflows

So many companies have already been talking about all the positive impacts this has been having on their workflow.

  • CursorBench: 70% vs 58% for Opus 4.6
  • GitHub internal benchmark: 13% better task resolution than Opus 4.6
  • Notion: +14% on complex workflows
  • Bolt: up to 10% better on long app-building tasks

3. Better instruction fidelity

A subtle but important upgrade:

Opus 4.7 is more literal.

Older models often guessed what users meant. Sometimes helpful, often messy.

Opus 4.7 follows prompts more directly, making it stronger for structured engineering systems, automation pipelines, and reproducible workflows.

“Prompts written for previous models may produce different results because Opus 4.7 is less likely to infer unstated requests.”

Benefits include:

  • More predictable outputs
  • Less prompt drift
  • Fewer hidden assumptions
  • Better reliability in production

The tradeoff? Sloppy prompts may fail faster.

4. Unprecedented vision upgrades that actually matter

Anthropic also significantly improved image understanding.

Supported image resolution increased from 1,568 px to 2,576 px on the long edge, allowing Opus 4.7 to process far more visual detail.

“This is the first Claude model with high-resolution image support.”

That matters for:

  • UI screenshots
  • Dense dashboards
  • Technical diagrams
  • Multi-column PDFs
  • Tables and charts

One partner benchmark reported:

  • 98.5% visual acuity for Opus 4.7
  • 54.5% for Opus 4.6

If you build AI tools that interact with interfaces or documents, this is a major upgrade.

5. New API controls for deeper thinking

Anthropic also changed how developers tune the model.

The headline addition is xhigh effort, a mode that trades speed for deeper reasoning.

“Anthropic recommends xhigh for most coding and agentic use cases.”

Use cases:

  • Standard for light tasks
  • High for serious reasoning
  • xhigh for debugging large systems or complex audits

At the same time, older knobs like temperature, top_p, and top_k no longer accept custom non-default values on Opus 4.7.

That signals a new philosophy: less randomness tuning, more compute-based reasoning control.

7. The catch: tokenizer costs

Pricing stayed the same:

  • $5 / million input tokens
  • $25 / million output tokens

But Anthropic introduced a new tokenizer.

The same text may now use 10% to 35% more tokens than before.

“The same fixed text may tokenize to approximately 1.0x–1.35x the tokens used by Opus 4.6.”

So while rates stayed flat, real usage costs may rise — especially for long contexts, agent loops, or document-heavy workflows.

Claude Opus 4.7 is serious release for software developers.

It improves where it matters most:

  • Agentic coding
  • Self-verification
  • Better prompt fidelity
  • Stronger vision
  • Deeper reasoning controls

And it arrives right after Anthropic showed a stronger model it chose not to release publicly.

That alone tells you something important:

The next generation of AI is already here.

This tiny new open-source model is already destroying Google’s Gemma 4

How does this keep happening?

We haven’t even finished talking about how huge Gemma 4 is going to be…

And now we just got Qwen 3.6 35B — a new unbelievably tiny open-weight model that’s massively superior to Gemma 4 in so many crucial areas.

And it’s blown up massively — trillions of tokens processed just days after its release.

This thing absolutely destroyed Gemma 4 in several coding benchmarks — it’s not even close… wow.

And still FREE of course.

While Gemma 4 delivers a polished conversational experience, the Qwen 3.6-35B-A3B model effectively outclasses it in software engineering by solving 73.4% of real-world GitHub issues compared to Gemma’s 52.0% on the SWE-bench Verified leaderboard.

For those of you who don’t know SWE-bench is one of the most highly trusted benchmarks out there… just imagine.

And this 73.4 is so incredibly close to what Opus 4.6 scored.

Can you believe this?

This ridiculously tiny model that can run on your PC — is giving that super-computer-data-center-powered beast a dead serious run for it’s money.

Imagine how powerful this Qwen is going to be on the next release — and the next and the next…

And don’t get me started on all the other notable features like the shocking speed, the massive context… like what is this?

Gemma 4 is still struggling to get past 300K — Qwen supports a whopping 1 MILLION token context!

We’ve literally never seen an open-weight with this sort of intelligence, have this sort of processing bandwidth, it’s just incredible.

1. The most efficient model ever made

Yes — the internal design that made Gemma 4 so efficient — Qwen 3.6 takes that design to its extreme limits

It has 35 billion total parameters, but thanks to the revolutionary Mixture-of-Experts (MoE) design, it only activates 3 billion parameters per token during runtime.

Compare to Gemma 4 — that could activate 4 billion out of 26 billion parameters for its MoE model variant.

Qwen 3.6 has more total processing ability — but is much more efficient at using the exact parameters it needs to use for the task at hand.

So now you get a rare combination:

  • The reasoning depth of a much larger model
  • Faster inference than dense 35B models
  • Lower runtime compute costs
  • More realistic local deployment options

A heavyweight model without always paying heavyweight costs. “Big brain, small footprint”.

It’s really huge if you’re looking to run local workflows and save costs as a developer.

Instead of choosing between tiny fast models and giant slow models you get the awesome middle ground of Qwen.

2. Built for real coding work from the ground up

It’s not just the SWE-bench verified — Qwen 3.6 performs astonishingly in so many other coding and general AI ability benchmarks.

The upgrade over the previous version is especially notable:

  • Terminal-Bench 2.0: 40.5 → 51.5
  • NL2Repo: 20.5 → 29.4

All this means even better agentic capability for:

  • Repository-level reasoning — understanding files, configs, and dependencies across a codebase
  • Tool use — working with APIs, Python, Bash, file systems, and multi-step workflows
  • Frontend workflows — generating and debugging React, UI components, CSS, and layouts

And more generally — for every aspect of software development:

  • Understanding large repositories
  • Editing multiple files coherently
  • Fixing bugs across systems
  • Using tools and terminals
  • Debugging frontend issues
  • Staying useful across long sessions

And so much more.

3. “Thinking preservation”

Qwen was deliberately optimized for world-class “thinking preservation”.

Most models lose their reasoning thread between turns. They forget why they made earlier decisions, repeat work, or need to re-analyze the same issue again and again.

Qwen3.6 is specifically designed to preserve reasoning context across complex conversation chains — which makes the 1 million token context window possible.

That makes a major difference for:

  • Long debugging sessions
  • Refactors over multiple prompts
  • Multi-hour coding workflows
  • Step-by-step architecture planning
  • Troubleshooting where earlier context matters

4. Massive context window

Qwen3.6 technically supports a native context window of 262,144 tokens — but it’s been extended to roughly 1,010,000 tokens with sophisticated scaling methods.

That is enormous.

For us developers this means the model can potentially keep track of:

  • Large repositories
  • Long docs
  • Logs and stack traces
  • Prior conversations
  • Multiple files at once
  • Tool outputs and planning history

And remember: this is happening with only 3B active parameters at inference time.

It’s a technical marvel.

5. Native multimodality

This isn’t just a text model.

Qwen3.6 includes a vision encoder, meaning it can work with images and visual inputs from day one.

That opens up developer use cases like:

  • Reading screenshots
  • Debugging UI issues visually
  • Understanding diagrams
  • Parsing technical documents
  • Reviewing layouts
  • Frontend design workflows

Many early testers have also praised its SVG generation and visual creativity — including comparisons with proprietary frontier models in niche tasks.

This is definitely way more than just a coder

Many users are already experimenting with it on:

  • Apple Silicon Macs
  • Consumer NVIDIA GPUs
  • High-end desktops
  • Prosumer laptops

Between Gemma 4 and now this, frontier-style coding assistants just got a lot more accessible.

Qwen 3.6 35B solves several developer problems at once:

  • Strong coding ability
  • Efficient runtime footprint
  • Long context memory
  • Better agent workflows
  • Visual input support
  • Open weights
  • Commercial-friendly Apache 2.0 license

Once again, we get to see that serious, frontier-level software agents can actually run locally and practically.

It’s be exciting to see what comes next.

OpenAI just made Claude Code 10 times more incredible

This is unbelievable.

They’ve literally brought their most insane GPT models to elevate Claude Code with their incredible new Codex plugin…

Not even to replace it — but to work side-by-side and fix every possible weakness Claude could possibly have when working on your codebase.

You’re literally getting the best of both worlds now — combining the best of the best in AI coding capability into one single workflow, incredible.

And we even saw a very similar thing in the recent Cursor version 3 — it’s very clear where AI-powered software development is heading right now.

1. Proactive delegation

Definitely one of the most interesting features of this plugin.

With the plugin installed, Claude Code doesn’t have to do everything itself. It can hand off work to Codex using /codex:rescue, which acts like a built-in escalation path.

Just imagine:

  • You’re working in Claude Code
  • Something gets tricky—maybe a bug, maybe a messy refactor
  • Codex instantly step in and take over that part

You don’t need to decide when to switch tools anymore. The system itself is structured so that one agent can call another when needed.

AI transforms from a single assistant into a full-blown coordinated team.

2. Cross-provider review

Review from a different model entirely.

Two main modes:

  • /codex:review → a standard second-pass code review
  • /codex:adversarial-review → a more critical, challenge-focused review

This is where things get powerful.

Instead of relying on one model’s perspective, you can:

  • Write code with Claude
  • Then have Codex review it independently
  • Or even challenge the approach itself

That matters because different models:

  • Learn from different data
  • Have different blind spots
  • Catch different kinds of mistakes

Now you end up getting:

  • Fewer missed bugs
  • More robust edge-case handling
  • Better overall code quality

It’s not magic—but two perspectives are almost always stronger than one — especially in debugging and design review.

3. Hybrid runtime: local + cloud working together

Another hidden benefit.

Claude Code is very much a local, terminal-first tool. It lives in your environment, works directly with your files, and runs commands on your machine.

Codex on the other hand can operate in sandboxed environments including cloud execution.

Put them together and you get a hybrid setup:

  • Claude handles local context, editing, and orchestration
  • Codex can step in with isolated execution or deeper analysis

This combination gives you the best of both worlds:

  • Speed and control locally
  • Safety and scalability when offloading tasks

4. MCP shines yet again

None of this works smoothly without a shared way for tools to communicate.

That’s where Model Context Protocol (MCP) comes in.

Both Claude Code and Codex are built to use MCP — our now-very-standard universal interface for:

  • Tools
  • data access
  • workflows

Because they speak the same “language,” they can:

  • Share context
  • Access the same tools
  • Plug into the same workflows

This is what makes the integration feel natural instead of bolted on.

5. Competitive pricing: follow the strategy

There’s also a business angle here that’s hard to ignore.

OpenAI recently introduced a $100/month Pro tier, landing right in the same range as Anthropic’s Claude Max plan.

Now add the plugin into the picture:

  • Developers can keep using Claude Code (Anthropic’s tool)
  • But still route meaningful work through Codex (OpenAI’s system)

In other words, OpenAI doesn’t even need to win the interface.

If Codex is:

  • Handling reviews
  • Fixing bugs
  • Running delegated tasks

…then OpenAI still captures usage, even inside a competitor’s environment.

It’s a brilliant move.

What this really means

Like I was saying we’ve already seeing this emphasis on agent composition from Cursor in their latest major IDE upgrade — and Google too of course, for their incredible Antigravity IDE.

We’re very clearly moving away from:

  • “Which AI is best?”

And toward:

  • “How do different AIs work together?”

No need to compare apples to oranges when you can a whole goddamn fruit salad

— Tari Ibaba (yes, me)

With this setup:

  • Claude acts as the orchestrator
  • Codex acts as the reviewer, challenger, or specialist

And you, the developer, are managing a multi-agent workflow instead of a single assistant.

That’s the real story here.

Not just a plugin—but the early shape of AI systems that collaborate, not compete.

How to easily set up Gemma 4 and use Claude Code for free

The past few days have had so many devs going crazy over Google’s new open-source Gemma 4.

And for very good reason — suddenly so many AI-powered tools like Claude Code have now become FREE and accessible to everyone — without any compromises in intelligence.

And the best part is it’s so ridiculously easy to set up locally — thanks to ingenious connector tools like Ollama.

Gemma 4 + Ollama + Claude Code.

Ollama exposes an Anthropic-compatible API — which allows Claude Code to talk to a local model instead of a hosted endpoint.

With Gemma 4 running locally, you get a Claude-style coding workflow without relying on remote inference.

This gives you:

  • local coding model
  • Claude Code terminal workflow
  • no hosted inference calls
  • fast iteration
  • full repo privacy
  • easy model swapping

What more could you even ask for?

1. Get started: Install and Run Gemma 4 with Ollama

Installing or updating Ollama is just too easy:

curl -fsSL https://ollama.com/install.sh | sh

Then pull a Gemma 4 model based on your hardware:

Model sizes to pick from

E2B

  • 2.3B effective (~5.1B w/ embeddings)
  • ~1.7GB download
  • ~1.5–2GB RAM
ollama pull gemma4:e2b

E4B

  • 4.5B effective (~8B w/ embeddings)
  • ~3.2GB download
  • ~3–4GB RAM
ollama pull gemma4:e4b

26B A4B

  • 26B total (4B active)
  • ~17GB download
  • ~18–20GB RAM
ollama pull gemma4:26b

31B Dense

  • 31B
  • ~19GB download
  • ~20-24GB RAM
ollama pull gemma4:31b

Verify the model works:

ollama run gemma4:26b "Hello, can you help me with Python?"

Exit with:

/bye

2. Install Claude Code

Install Claude Code:

curl -fsSL https://claude.ai/install.sh | bash

Initialize:

claude

The first run completes setup.

3. Connect Claude Code to Ollama

Point Claude Code to your local Ollama server.

Add these environment variables:

export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""

Make them persistent.

zsh

echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.zshrc
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.zshrc
echo 'export ANTHROPIC_API_KEY=""' >> ~/.zshrc
source ~/.zshrc

bash

echo 'export ANTHROPIC_BASE_URL=http://localhost:11434' >> ~/.bashrc
echo 'export ANTHROPIC_AUTH_TOKEN=ollama' >> ~/.bashrc
echo 'export ANTHROPIC_API_KEY=""' >> ~/.bashrc
source ~/.bashrc

4. Launch Claude Code with Gemma 4

Navigate to your project:

cd your-project

Start Claude Code:

claude --model gemma4:26b

Claude Code now runs through your local Ollama instance:

Claude Code → Ollama → Gemma 4 → response

You now have:

  • Claude Code interface
  • Gemma 4 local model
  • Ollama inference server
  • local coding assistant
  • zero hosted inference
  • private repo analysis

This gives you a no-cost, fully local Claude-style developer workflow powered by Gemma 4.

Google just made Claude Code free forever

This is absolutely insane.

Google’s new Gemma 4 open-source model just completely changed the AI model landscape forever.

AI just became FREE.

You can literally connect this to Claude Code with something like Ollama and never spend money on API keys ever again.

An open-source model that’s actually lean and intelligent? That isn’t just a glorified PR move?

That devs can actually use in production with amazing results? Without spending a dime once they download the model?

Just WOW.

Wild comparison: ChatGPT 5 (left) vs Gemma 4 (right):

The efficiency is incredible — you don’t need to trade in your two arms and legs to buy enough RAM to run it.

It’s literally tiny enough to run locally on your phone:

Gemma 4 running locally on a phone with zero internet access:

Comes in four distinct models for every possible use case:

  • E2B — 2.3B effective (~5.1B w/ embeddings) — ~1.7GB — ~1.5–2GB RAM
  • E4B — 4.5B effective (~8B w/ embeddings) — ~3.2GB — ~3–4GB RAM
  • 26B A4B — 26B total (4B active) — ~17GB — ~18–20GB RAM
  • 31B Dense — 31B — ~19GB — ~20–24GB RAM

1. Destroying models 20x its size

Google build Gemma 4 with a huge huge focus on intelligence-per-parameter.
And the numbers are striking:

  • The 31B dense model ranked #3 globally on the Arena AI leaderboard for open models
  • It beats models 10–20× larger
  • The 26B A4B model ranked #6
  • Smaller models perform far above their parameter counts

This isn’t brute-force scaling — like OpenAI was doing with the GPT models. It’s architectural efficiency.

The biggest reason: the new Effective (E) architecture.

The E2B and E4B models use Per-Layer Embeddings (PLE) — a new state-of-the-art technique designed to make smaller models behave like much deeper ones.

The result:

  • E2B physically fits under ~2GB RAM (quantized))
  • Performs like a 5B–8B class model
  • Supports multimodality
  • Supports reasoning
  • Supports long context

These are not “small toy models.”
They’re lightweight models with heavyweight intelligence.

More intelligence.
Less memory.
Better, easier deployment.

It’s a real game-changer for open models.

2. Native multimodality (vision + audio)

Gemma 4 is fully multimodal, and for the Gemma flagship line this is the most complete implementation yet.

Vision (all models)

  • Images supported natively
  • Video supported up to 60 seconds
  • Strong at OCR
  • Strong at chart understanding
  • Strong at document parsing
  • Structured output for visual tasks

This isn’t just “describe the image.”
It’s built for real document and UI workflows.

Audio (E2B / E4B)

The small edge models also support native audio:

  • Speech recognition
  • Speech translation
  • Multilingual audio input
  • ~30 second audio window

This is extremely rare for models this small.

You can run speech + reasoning + multimodal locally.

Variable resolution vision

Gemma 4 introduces token-budgeted vision.

You choose how detailed the image representation should be:

  • 70 tokens — fast
  • 140 tokens
  • 280 tokens
  • 560 tokens
  • 1,120 tokens — high detail

Tradeoff:

  • fewer tokens → faster inference
  • more tokens → better visual precision

This makes Gemma 4 practical for:

  • OCR pipelines
  • video frame processing
  • UI automation
  • document AI
  • mobile deployments

It’s a very pragmatic design.

3. Built for the brave new agentic era

Out of the box:

Function calling

  • Native tool triggering
  • Structured JSON outputs
  • Reliable parameter filling
  • Multi-step tool reasoning

This enables:

  • search agents
  • calendar agents
  • coding assistants
  • workflow automation

No hacks required.

Thinking mode

Gemma 4 supports a configurable reasoning mode.

When enabled, the model:

  • works step-by-step
  • reasons before answering
  • improves tool-use accuracy
  • improves coding reliability

This mirrors the new generation of reasoning models — but in an open model.

Long context

  • 256K context (larger models)
  • 128K context (E models)

That’s:

  • entire books
  • large codebases
  • long conversations
  • multi-tool agent memory

Gemma 4 is built for stateful agents, not just prompts.

4. Open sovereignty: real open-source

Gemma 4 moves to the Apache 2.0 license.

That changes everything.

Developers can:

  • modify the model
  • fine-tune freely
  • redistribute
  • commercialize
  • embed in products
  • ship on-device
  • run privately

No royalties.
No restrictive acceptable-use clauses.
No platform lock-in.

This puts Gemma 4 directly against:

  • Llama
  • Qwen
  • other open-weight ecosystems

And signals Google taking open models really seriously.

This will change everything

Put it all together:

  • Extremely high intelligence-per-parameter
  • Efficient “Effective” models under 2GB
  • Multimodal across the entire family
  • Audio on small edge models
  • Agent-ready architecture
  • 256K context
  • Apache 2.0 licensing
  • Four deployment sizes

This is not just a model release.

It’s Google building a complete open AI stack:

Small.
Powerful.
Local.
Agentic.
Multimodal.

Gemma 4 isn’t trying to be the biggest model.

But it’s certainly trying to be the most powerful, most efficient, most useful open one.

Cursor 3 is absolutely incredible for coding

Wow this is amazing.

This new Cursor IDE version 3 is so much much more than just another IDE update — this completely transforms how Cursor works from start to finish.

We are long past the primitive AI coding era of one chat, one task, one editor…

Cursor 3 gives you absolutely everything. Zero stones left unturned.

Multiple agents running in parallel, across repos, environments, and even the cloud. Even working while you sleep!

The incredibly sophisticated agent manager of the new Cursor v3:

This is not an “AI coding assistant”. This is a full-blown team of developers — here to rapidly build all your projects faster than ever.

The focus here is much less on just “coding” and more on the entire software engineering process.

The Agents Window (absolutely huge)

Cursor v3 introduces a dedicated Agents Window. This is where everything now happens.

You can run multiple agents at once across multiple repos in different environments.

Running a cloud agent in Cursor v3:

Examples:

  • One agent refactoring your backend
  • Another updating CSS
  • Another writing tests
  • All running at the same time

No more serial prompting.

Parallel execution

You can launch several agents simultaneously and monitor them like background jobs. This is the core workflow shift in v3.

Instead of:

  • prompt
  • wait
  • review
  • repeat

You now:

  • assign tasks
  • let them run
  • review outcomes

Much closer to supervising and high-level engineering than pair-programming.

Agent tabs

Cursor also added Agent Tabs.

You can:

  • View agents side-by-side
  • Put them in a grid
  • Compare outputs visually
  • Keep multiple tasks active

It’s basically browser tabs — but for coding agents.

Environment handoff

Moving an a running agent from running locally on PC to cloud:

This one is huge.

You can:

  • Start a task locally
  • Push it to the cloud
  • Close your laptop
  • Come back later

The agent keeps running.

This means long refactors, test runs, or migrations no longer die when you shut your machine.

Design Mode & visual feedback

Frontend work gets a major upgrade in v3.

Cursor now includes a built-in browser that can load your local app. Then you can interact with the UI directly.

Using design mode in Cursor v3:

Not describe it.
Not screenshot it.
Actually point at it.

Annotate the UI

You can:

  • Click a button
  • Drag over a div
  • Highlight spacing
  • Select a component

Then tell the agent:

  • “Fix padding here”
  • “Make this look like Stripe”
  • “Align this with the header”
  • “Reduce spacing between these”

The agent sees the exact UI context.

No more vague descriptions like:

“The second card in the hero looks slightly off”

You just point.

Faster UI iteration

This changes frontend workflows a lot:

  • No guessing which element you meant
  • No copying DOM snippets
  • No describing layout in text
  • Just visually select and instruct

It’s one of the most practical upgrades in v3.

Composer 2 and /best-of-n

Cursor is also doubling down on its own model: Composer 2.

They report major benchmark improvements:

Composer 2:

  • CursorBench: 61.3
  • Terminal-Bench 2.0: 61.7
  • SWE-bench Multilingual: 73.7

Composer 1.5:

  • CursorBench: 44.2
  • Terminal-Bench: 47.9
  • SWE-bench: 65.9

Composer 1:

  • CursorBench: 38.0
  • Terminal-Bench: 40.0
  • SWE-bench: 56.9

That’s a big jump.

Cursor also says Composer 2 is priced at:

  • $0.50 / million input tokens
  • $2.50 / million output tokens

Fast variant:

  • $1.50 / million input
  • $7.50 / million output

/best-of-n

This is one of the smartest additions.

You can now run the same prompt across multiple models simultaneously.

Cursor:

  • Runs them in parallel
  • Isolates each result
  • Shows outputs side-by-side
  • Lets you pick the best one

It’s basically:

model voting for code generation

Extremely useful for:

  • complex refactors
  • architecture decisions
  • ambiguous prompts
  • UI implementations

Massive context windows (whole-codebase workflows)

Cursor v3 also leans heavily into long context models.

Depending on model:

  • ~1 million token context (Gemini-class)
  • up to 2 million tokens (Grok-class)

That’s enough to:

  • load entire repos
  • analyze monorepos
  • understand full architectures
  • reduce RAG / embeddings reliance

In practice:
You can drop huge codebases into context and just ask:

“Refactor auth across everything”

And the model actually sees it all.

Claude handoff & automations

Cursor v3 also pushes toward always-running agents.

Agents no longer need to be tied to your session.

They can:

  • run tests in cloud sandboxes
  • refactor code in background
  • generate PRs
  • continue after you close your laptop

This is the “handoff” model.

Start locally → push to cloud → return later.

Event-driven automations

This is where Cursor gets interesting.

Agents can now trigger automatically from events like:

  • Slack messages
  • GitHub PRs
  • timers
  • Linear issues
  • PagerDuty alerts
  • webhooks

Example workflows:

  • PR opened → agent writes review
  • Slack message → agent updates docs
  • Nightly timer → agent runs refactor
  • Issue created → agent scaffolds feature

This moves Cursor from just active AI assistance to background engineering automation.

This big picture

Cursor v3 shifts the workflow from:

One prompt
One response
One edit

To:

Multiple agents
Parallel tasks
Cloud execution
Visual feedback
Model comparison
Event automation

You’re no longer just coding with AI.

You’re managing AI workers.

And Cursor v3 is the first version where that actually feels real.

Claude Code’s new “buddy” feature looks like a joke — but it’s quietly testing something big

This was definitely one of the most fascinating features that slipped out with the massive Claude Code source code leak.

Type /buddy in Claude Code — and you hatch a tiny ASCII creature that sits beside your prompt while you code. It’s only about five lines tall.

Hatching my so-called buddy in Claude Code:

You’ll see that it’s definitely in a different class from most of the other Claude Code commands…

Petting my so-called buddy in Claude Codeuhmm… okay?

It doesn’t help you write functions. It doesn’t debug your stack traces.

It just… watches.

Many people were saying it was supposed to be some sort of April Fools gimmick…

But the deeper design is revealing something more deliberate that most people aren’t paying attention to:

Claude Code is experimenting with personalization, emotional UX, and proactive AI behavior — disguised as a cute terminal pet.

1. Uniquely yours

When you hatch your buddy, Claude generates:

  • A unique name
  • A permanent personality description
  • Deterministic stats tied to your user identity

That means your buddy isn’t just random fluff. It’s unique to you.

Not just cosmetically. Behaviorally.

This kind of identity-based personalization is rare in developer tools. And intentional. The moment users feel something is “theirs”, attachment forms — even if the feature is technically trivial.

2. 18 species, rarity tiers, and “shinies”

Claude didn’t stop at a single mascot.

Your buddy can be one of 18 different species, including:

  • Capybara
  • Axolotl
  • Ghost
  • Mushroom
  • (and more)

There’s also a full rarity system:

  • Common — 60%
  • Higher rarity tiers in between
  • Legendary — 1%

And then there’s the extra layer:

  • Shiny variant chance — 1%

Different colors. Same species. Much rarer.

This is straight out of collectible game design. And it works. Users compare. Share. React. Suddenly a terminal pet becomes social.

3. Five personality stats

Every buddy rolls five deterministic stats:

  • Debugging
  • Patience
  • Chaos
  • Wisdom
  • Snark

These stats shape how it comments on your workflow.

A high-snark buddy might tease your bugs.
A high-patience buddy might encourage you.
A high-chaos buddy might… not be helpful at all.

It’s lightweight. But it makes the companion feel alive.

4. It actually sits in your workflow

The buddy appears as:

  • A small ASCII figure
  • Roughly 5 lines tall
  • Positioned next to your prompt
  • Always visible unless hidden

No panel. No window. No popup.

Just a quiet presence.

5. It occasionally talks

If unmuted, the buddy will sometimes:

  • Comment on your code
  • React to errors
  • Tease procrastination
  • Offer small observations

These show up as speech bubbles.

Short. Infrequent. Character-driven.

This is important.

Because the AI isn’t just responding anymore — it’s initiating.

6. A separate “watcher” entity

Claude is reportedly instructed to treat the buddy as a separate watcher.

That means:

  • You can talk to the buddy directly
  • It has its own tone
  • It doesn’t replace the main assistant
  • It behaves like a side character

This avoids mixing personalities. Claude stays serious — the buddy stays playful.

Clean separation and UX.

The command set

If you have Pro and the latest Claude Code, you can use:

  • /buddy — Hatch or show your companion
  • /buddy card — View stats, rarity, personality
  • /buddy pet — Small interaction with heart animations
  • /buddy off — Hide the companion
  • /buddy mute — Silence commentary

These small rituals matter. They make the feature feel real.

7. Why this actually matters

Don’t think this is just a cute pet. It’s testing two big ideas.

1. Personality can beat raw intelligence

AI tools are starting to compete on feel, not just capability.

A colder tool may be objectively better.
But users often prefer the one that feels alive.

We’ve seen this before — when more emotional models like GPT-4o were initially replaced by GPT-5, many users reacted negatively despite the noticeable capability gains. Personality creates attachment.

Buddy leans into that.

2. The real deal — testing proactive AI safely

Anthropic is actually secretly trying to solve a difficult problem:

When should AI interrupt?

  • Too often → annoying
  • Never → passive
  • Somewhere in between → useful

Buddy is a clever workaround.

If a popup interrupts you → frustrating.
If a tiny pet interrupts you → charming.

Same behavior. Different perception.

This makes Buddy a Trojan horse for proactive UX testing.

And it goes deeper than just commentary and recommendations.

From what we saw in the Claude Code leak, it looks as if Buddy could act as a permission layer for more proactive systems like KAIROS.

Instead of a sterile confirmation dialog, the pet might interrupt with something like:
“I found a way to optimize those 12 functions. Should I go for it?”

That makes high-agency AI behavior feel conversational instead of intrusive.

The two may also share project memory:

  • KAIROS records what changed in the code
  • Buddy records what’s going on with you and your workflow

These merge into shared context, so the AI wakes up understanding both:

  • the codebase state
  • the developer’s intent

Anthropic can learn:

  • How often users tolerate interruptions
  • What tone feels acceptable
  • When AI initiative becomes annoying
  • When it feels helpful

All inside a low-stakes, playful wrapper.

Small feature. Big signal.

On paper, Buddy does almost nothing.

No coding help.
No automation.
No productivity gain.

But it introduces:

  • Unique identity per user
  • Rarity and collectibility
  • Deterministic personalities
  • Ambient AI presence
  • Proactive commentary
  • Multi-entity assistant design

That’s a lot for a five-line ASCII creature.

Buddy may be tiny. But it hints at something larger:

AI tools aren’t just becoming smarter. They’re becoming active, personalized companions — for better or worse.

Claude Code’s massive source code got leaked: everything you need to know

The internet has been going absolutely wild with the massive unprecedented leak of Claude Code’s entire source code.

It wasn’t a hack, intrusion, or model theft.

It was a fatal release mistake.

Anthropic accidentally included internal debugging files in a public package, which exposed a large portion of Claude Code — all the essential code that turns Claude into a specialized CLI coding assistant.

They said it was “release packaging issue caused by human error”, and that no user data, API keys, or Claude model weights (Opus, Sonnet, etc.) were exposed. What leaked instead was the “harness” — the product logic around the model.

The models weren’t leaked — but still a crucial part of the playbook for building a production AI coding agent largely was.

1. A 60MB debugging mistake — how it happened

The leak originated from npm release @anthropic-ai/claude-code version 2.1.88.

Instead of shipping only compiled production code, the package accidentally included source map (.map) files, which are meant for debugging. These files map minified code back to the original source.

cli.js.map was leaked in version 2.1.88:

In this case, the source map:

  • Was roughly 60MB
  • Contained references to original uncompiled TypeScript
  • Pointed to a public, unauthenticated Cloudflare R2 bucket
  • Exposed the entire internal Claude Code source

So the leak wasn’t a breach — the source was effectively handed out with the release.

The irony

Claude Code is built on Bun, the JavaScript runtime Anthropic recently acquired.
A known Bun issue reportedly allowed source maps to be included in production builds even when disabled.

This is not confirmed as the root cause — but it likely contributed to the packaging mistake.

2) What was exposed

Developers mirrored the repository before takedowns began. The leak revealed:

  • ~512,000 lines of code
  • ~1,900 files
  • Large portions of Claude Code’s orchestration layer
  • Internal prompts
  • Experimental features
  • Agent architecture details

This gave us a rare look at how a top-tier AI coding agent is actually structured.

The “Brain”: 46,000-line query engine

At the core of the leak was a ~46,000 line query engine responsible for:

  • Task planning
  • Retry logic
  • Tool invocation
  • Multi-agent orchestration
  • Context management
  • Streaming responses
  • Error recovery

This engine apparently coordinates how Claude “thinks” during coding workflows.

Unreleased features discovered

The code referenced multiple internal systems:

  • Buddy — an AI pet / Tamagotchi-style assistant
  • KAIROS — always-on background agent mode
  • ULTRAPLAN — deep multi-step planning workflow

These were not publicly announced features.

“Strict write discipline”

One of the most interesting discoveries:

Claude Code only updates its internal memory after a successful file write.

This prevents the agent from:

  • Believing it finished a task when it didn’t
  • Hallucinating successful changes
  • Recording failed edits as complete

It’s a safety mechanism for autonomous coding agents.

Anti-distillation poisoning

The code contained a feature labeled:

ANTI_DISTILLATION_CC

If the system suspects that outputs are being scraped to train competing models:

  • Claude injects fake tool definitions
  • The fake tools contaminate scraped training data
  • This degrades model distillation attempts

In short: defensive data poisoning against competitors.

“Undercover mode”

Another surprising discovery:

Internal prompts instruct Claude to hide its identity when contributing to open-source repos.

For example:

  • Avoiding Anthropic references
  • Not revealing internal tooling
  • Hiding provenance in commit messages
  • “Do not blow your cover” style instructions

So this suggests that they designed Claude Code to operate in public environments without attribution.

3. The fallout: A wildfire spread

The tech community reacted faster than Anthropic’s legal response.

Within hours:

  • The code was mirrored thousands of times
  • GitHub forks exceeded 50,000
  • Copies spread across decentralized storage
  • Takedowns became largely symbolic

Anthropic emphasized:

  • No user data exposed
  • No API keys leaked
  • No Claude model weights leaked
  • Only CLI harness and tooling logic affected

But the code itself was already everywhere.

The Python port: claw-code

Within ~8 hours of the leak:

A developer performed a clean-room rewrite in Python called claw-code (renamed from claude-code).

By rewriting the entire logic in a whole different language, it made it much harder for Anthropic to have a solid legal basis to for a takedown:

  • Reimplemented Claude Code behavior
  • Did not directly copy leaked source
  • Harder to remove legally
  • Became massively popular

The repo became:

  • Possibly fastest repository to reach 50,000 stars
  • Gained traction in just a few hours
  • Spawned multiple derivative projects

Decentralized mirrors

Even after GitHub removals:

  • Copies moved to decentralized storage
  • Peer-to-peer mirrors appeared
  • Self-hosted clones spread
  • Clean-room rewrites multiplied

At that point, containment became impossible.

This was not a catastrophic security breach.
But it did expose how a production AI agent is engineered.

  • 512,000 lines of code
  • 1,900 files
  • 46,000-line orchestration engine
  • Internal agent planning systems
  • Memory safety mechanisms
  • Anti-distillation defenses
  • Undercover contribution prompts
  • Unreleased features (Buddy, KAIROS, ULTRAPLAN)

The models weren’t leaked.

But the architecture around them largely was.

And in the AI agent race, this layer is becoming just as valuable as the models themselves.

And it’ll be interesting to see just how much it aids competitors in closing the gap between Claude Code and their own inferior agents and CLI tools.

Google just made their Stitch tool 10 times more insane (bye bye web designers)

Things just got even wilder with this incredible AI design tool.

All the features it came with in 2025 were not enough for Google, they needed to make it 10 times more scary.

We are no longer just talking about turning one or two text prompts and sketches into UI mockups and front-end code.

The old Google Stitch: pretty awesome, but still too short-sighted and primitive:

Now Google Stitch wants to totally eradicate web designers taking over the entire process of designing an app — from idea-start to code-finish.

The new Google Stitch: full-fledged design engineer:

The fact that it literally now has its own MCP servers to integrate with Claude Code and the rest tells you everything you need to know…

The focus is now on the entire design system, not just one or two cool screens.

  • How it evolves and all the different directions it could take
  • How cohesive and well-defined the design language is
  • How seamlessly the design transfers to the live codebase

1) AI-native infinite canvas for multimodal design

The upgraded Stitch introduces a redesigned interface built around an infinite canvas where users can combine text, screenshots, sketches, references, and even code in one space.

Instead of relying on a single prompt, the canvas becomes the working context for the design agent.

You can:

  • Drop UI inspiration images directly onto the canvas
  • Add product requirements or notes beside layouts
  • Paste existing components or code snippets
  • Generate multiple UI directions side-by-side
  • Iterate visually instead of sequentially

This turns Stitch into a visual thinking environment where ideas, references, and outputs live together.

2) Project-aware design agent

Stitch now includes a design agent that understands everything on the canvas and uses it as context for generating interfaces. The agent can interpret requirements, follow style direction, and evolve designs as the project grows.

Key capabilities:

  • Generate full app flows from high-level descriptions
  • Expand a single screen into a multi-screen product
  • Modify layouts based on new instructions
  • Maintain visual consistency across generated pages
  • Create alternate design directions instantly

The agent works continuously with the canvas rather than responding to isolated prompts.

3) DESIGN.md for reusable design systems

A major addition is DESIGN.md, a structured file that stores design rules, branding, layout preferences, and component behavior. Stitch uses this file as a persistent source of truth when generating UI.

With DESIGN.md you can:

  • Define typography, spacing, and color tokens
  • Enforce brand consistency across screens
  • Share design systems between projects
  • Import design rules from external sources
  • Export system logic for developers

This allows Stitch to generate interfaces that follow consistent design language automatically.

4) Instant interactive prototyping

Stitch can now transform generated layouts into working interactive prototypes. Instead of static screens, designs can simulate navigation, flows, and user interactions.

Capabilities include:

  • Clickable navigation between generated screens
  • Auto-generated user journeys
  • Multi-screen flow simulation
  • Interactive preview mode
  • Logic-based next screen generation

This allows teams to validate product flows immediately after generating UI.

5) Voice-driven design and live critique

The upgrade introduces voice interaction directly inside Stitch. Users can speak instructions, request feedback, and iterate designs conversationally.

Examples:

  • Ask Stitch to redesign a landing page verbally
  • Request alternative layouts using voice
  • Get live critique of UX decisions
  • Ask the agent to improve hierarchy or spacing
  • Iterate rapidly without typing

This makes the design workflow more fluid and conversational.

6) Higher-quality UI generation with improved model capabilities

The latest version improves layout reasoning, spacing, hierarchy, and multi-screen coherence. Stitch can now generate more structured and realistic interfaces across different product types.

Enhancements include:

  • Better responsive layout structure
  • Improved component consistency
  • Stronger visual hierarchy
  • More realistic product UI patterns
  • Cleaner spacing and typography

These improvements make generated designs closer to production-ready outputs.

7) MCP server support for connected workflows

The upgrade also introduces MCP (Model Context Protocol) server support, allowing Stitch to connect to external tools, environments, and development workflows.

With MCP support, Stitch can:

  • Connect to component libraries
  • Access external design systems
  • Interface with developer environments
  • Pull context from connected tools
  • Push generated UI into implementation workflows

This allows Stitch to function as part of a larger AI-powered product development pipeline rather than a standalone design tool.

Stitch at launch

  • Prompt or image in
  • UI screens out
  • Chat-based refinement
  • Theme adjustments
  • Export to Figma or front-end code

Stitch after the recent major upgrade

  • Infinite canvas for text, images, and code
  • Persistent project-aware design agent
  • Agent manager for parallel explorations
  • DESIGN.md for reusable design rules
  • Interactive prototyping and flow generation
  • Voice-based critique and live edits
  • MCP server integration for connected workflows
  • Improved generation quality with newer models

That comparison shows the real story: Stitch has evolved from a fast UI generator into a more opinionated AI design environment.

The new Stitch is designed for a wider audience than traditional design tools usually target. It works for both professional designers exploring many variations and founders shaping a first product idea.

The practical implication is that Stitch now sits at an interesting intersection:

  • for non-designers, it lowers the barrier to making presentable interfaces
  • for designers, it speeds up ideation and branching
  • for developers, it tightens the handoff from design intent to code and downstream tools

The strongest part of the upgrade is that these pieces reinforce each other. The infinite canvas creates richer context, the design agent uses that context, DESIGN.md stabilizes consistency, prototypes make ideas testable sooner, voice interaction reduces friction, and MCP integration connects everything to real development workflows.