featured

Google’s new AI image generator just changed everything

Wow this is huge.

Google just released a massive upgrade to their image generation model — and this thing is on a whole different level.

Nano Banana 2 pushes AI image generation way beyond novelty and closer to something we can actually use in production, use as a daily driver in everyday life.

Created with Nano Banana 2 — Infographic comparing cloud types

It’s not just about spitting out unbelievable or ultra-realistic images this time.

It’s about cost-effective speed, consistency, accuracy, and flexibility — the traits that make an image generation model usable in the real-world of software development, the traits creative teams actually need.

1. Pro-level quality at Flash speed

Nano Banana 2 gives you high-fidelity images in seconds (typically 10–15s) while improving overall visual quality.

Created with Nano Banana 2 — a misty panoramic aerial shot of a verdant valley

What’s improved:

  • More vibrant, dynamic lighting
  • Richer textures and sharper detail
  • Cleaner handling of complex scenes
  • Faster iteration without major quality loss

Why it matters:
You no longer have to choose between speed and polish. The model is built for rapid concepting, quick revisions, and high-quality drafts that are often close to final output.

2. 🌐 Google Search grounding

Localization an image in Nano Banana 2

One of the biggest upgrades is Google Search grounding.

Nano Banana 2 can:

  • Pull real-time visual references from Google Search
  • Verify landmarks, people, and products
  • Use up-to-date visual information before generating

Why this is significant:

  • Reduces guesswork in recognizable subjects
  • Improves factual accuracy
  • Makes the model more viable for commercial and educational use

Instead of approximating a famous building or product from memory, the model can check current references — a major step toward reliable AI visuals.

3. 🎭 Subject consistency

Created with Nano Banana 2 — an image with several characters

Consistency has long been a weak point in image generation. Nano Banana 2 addresses that directly.

It can maintain:

  • Up to 5 characters
  • Up to 14 objects
  • Across multiple images in a sequence
Created with Nano Banana 2 — an image with several characters

What this enables:

  • Storyboarding
  • Comic strip creation
  • Branded character campaigns
  • Multi-frame marketing concepts

Characters keep their appearance. Objects stay recognizable. Visual identity becomes more stable across iterations.

4. 📝 Precision text rendering

Created with Nano Banana 2 — an infographic depicting the water cycle

Text inside AI images used to be notoriously unreliable a few years back.

The first Nano Banana made serious improvements here, and v2 takes it even further.

It can handle:

  • Complex labels and signage
  • Clean typographic layouts
  • Infographics and diagrams
  • Structured text blocks

It also supports:

  • In-image translation
  • Instant localization of text within graphics

Practical benefit:
You can generate posters, packaging mockups, charts, menus, and educational graphics without rebuilding all text manually in a separate design tool.

5. 📐 Flexible specs

Nano Banana 2 supports a wide range of resolutions and aspect ratios.

Resolution range:

  • 512px
  • 1K
  • 2K
  • 4K

Native aspect ratios:

  • 16:9 (widescreen)
  • 9:16 (vertical/social)
  • 21:9 (cinematic)
  • 8:1 (panoramic)

Why this matters:
Modern content lives everywhere — social feeds, websites, presentations, digital signage. This flexibility means assets can be generated in the correct format from the start.

Bottom line

Nano Banana 2 isn’t just about stunning or realistic images. It combines:

  • ⚡ Fast generation
  • 🎨 Higher visual fidelity
  • 🌍 Real-time search grounding
  • 🔁 Stronger multi-image consistency
  • 📝 Accurate in-image text
  • 📏 Flexible output specs

The result is a model designed not just to wow and amaze — but to integrate into real creative workflows.

If these capabilities hold up at scale, Nano Banana 2 could become one of Google’s most practically useful AI image tools to date.

5 genius tricks to make Claude go 10x crazy (amateur vs pro devs)

Claude Code gets unbelievably powerful when you stop treating it like just a “coding assistant”.

And start treating it like an full-fledged operating system for your engineering workflow:

Standards, reusable playbooks, parallel execution, deep codebase interrogation, and tool chains that run end-to-end.

1) Implement team-wide coding standards (and make them stick)

Most teams have standards, but they’re scattered across docs, half-remembered conventions, and PR comments.

Claude Code gives you a single place to encode “how we build software here”: a root CLAUDE.md file Claude reads at the start of every session.

What belongs in it:

  • Non-negotiables (error handling, logging, security rules)
  • Architecture map (module boundaries, “this package owns X”)
  • Golden paths (preferred patterns for DB work, retries, input validation)
  • PR checklist (tests required, docs updates, performance/security checks)
  • Commands (how to run lint/typecheck/tests/migrations so Claude can verify its own work)

Pro move: keep it short and strict. If CLAUDE.md turns into a wiki, it becomes background noise. Treat it like a contract.

2) Extend capabilities with Skills

A Skill is a reusable playbook that turns “how we do X” into something you can invoke consistently. Not more prompting — repeatable procedures.

The point is to make Claude behave like your team’s best engineer on their best day, every day.

How to build one (fast, practical):

  • Define when to use it (and when not to)
  • Specify required inputs (paths, module names, constraints)
  • Write the method as steps (search → analyze → implement → verify)
  • Define the output contract (diff + tests + summary, or checklist + findings)
  • Add quality gates (lint/typecheck/tests must pass before “done”)

Skills worth building first:

  • /review-pr: runs your checklist the same way every time
  • /add-tests: generates tests in your preferred style with coverage expectations
  • /refactor-module: your “safe refactor” procedure, including guardrails

If you do nothing else, build a review Skill. Consistency is as important as raw model intelligence.

3) Get things done 10× faster with Claude Code Agent Teams

Most people run one Claude session and ask it to do everything sequentially.

Pros run Agent Teams: multiple Claude sessions in parallel, each working in its own context, with a lead session coordinating tasks and synthesizing results.

Where it shines:

  • Refactors across many packages (split by directory ownership)
  • Cross-cutting changes (API + UI + tests + docs)
  • Big bug hunts (repro agent, tracing agent, fix+tests agent)

The prompt pattern:

  1. define the outcome
  2. define the split strategy
  3. define a no-collisions rule

Example:
“Create an agent team for this web application. Split work by packages (api/, web/, shared/). Each teammate proposes a minimal diff plus tests. Lead delivers a single integrated patch and summary.”

You’re basically turning Claude into a mini org chart: parallel workers + one integrator.

Most developers search codebases manually: grep for names, chase string literals, click through files until they “feel close.” That’s slow, and it misses the subtle stuff: duplicated checks, hidden bypasses, and patterns that drifted over time.

Pros use Claude Code like a superintelligent code archaeologist: not “find the file,” but “reconstruct the system.”

What amateurs do:
“Find where we handle user authentication.”

What pros command:
“Analyze our entire codebase and identify all authentication-related logic: direct implementations, helper functions, middleware, hooks, and hardcoded auth checks scattered throughout components. Map relationships between these implementations, identify inconsistencies, and flag potential security vulnerabilities or duplication.”

Why this works:

  • It finds semantic equivalents, not just keywords
  • It builds a map (entry points → flows → dependencies)
  • It surfaces drift (multiple token parsers, mismatched role logic)
  • It finds risk (client-only enforcement, missing server checks)

Ask for a structured output:

  • Auth Map (flows + entry points)
  • Inconsistencies (what differs and why it’s risky)
  • Smells/Vulns (missing checks, unsafe fallbacks, duplication)
  • Unification plan (what to centralize, what to delete, how to migrate)

That’s the difference: amateurs “search.” Pros run investigations.

5) Build custom MCP server chains (autonomous pipelines, not “one tool”)

Most people set up one MCP server and call it a day. Pros chain multiple MCP servers into an orchestration network that can run multi-step operations: analysis → changes → tests → deploy → verification → promotion.

Amateurs add just one single server, like “database.”

Pros orchestrate a set like:

  • codeAnalysis (find issues, map affected surfaces)
  • testRunner (targeted tests + suite gating)
  • securityScanner (dependency + pattern scanning)
  • deploymentPipeline (staging deploy, promotion, rollback)

The real unlock is one-shot execution with pre-approved permissions — not reckless “no prompts,” but deliberate guardrails:

  • least-privilege scopes
  • explicit allowlists
  • hard stop-conditions
  • mandatory gates (tests/scans must pass)
  • audit trail (commits, summaries, artifacts)

What amateurs ask:
“Scan for vulnerabilities.”

What pros command (single cascade prompt):
“Analyze our codebase for security vulnerabilities, apply safe fixes, run automated tests, update vulnerable dependencies, commit changes with documentation, deploy to staging, scan the deployed version, and if everything passes, deploy to production with rollback strategies ready.”

Wrap that into a Skill and you stop “asking Claude to help” — you start running pipelines.

This new Claude Code upgrade just changed everything

Wow I’ve never seen Windsurf or Copilot do something this incredible.

But Claude Code is going way way beyond just code generation for us now. This is on a whole different level. This is total and complete software engineering. It’s all coming together.

Not just writing code based on your desires — but doing everything to intelligently make sure every single line of code ever written by you or itself or anyone actually matches those desires.

Just look at what it did here with Claude Code Desktop — we told it to launch the app and make sure everything is right — the checkout flow, the mobile responsiveness the dark mode…

Not only did Claude Code autonomously run all the flows — it caught critical runtime errors along the way and fixed them all.

The best most other coding tools can do is to fix the syntax errors they make while generating code — but what Claude Code is doing here is light years more sophisticated and advanced.

And you know, these time of runtime errors can be so tricky — because a lot of them only occur in very specific flows and usage patterns. The app runs successfully and you think everything is fine — not realizing the serious flaws on their way to production.

And this is just 1 of all the latest upgrades Claude Code just received within the past few days.

We just got Opus and Sonnet 4.6 for higher quality code and superior intelligence — now we are getting even more amazing new features to level up the entire software development process with that intelligence.

1. Built-in local code review

You can now run a “Review code” action on your local changes before pushing anything.

Claude analyzes your diff and leaves comments directly in the desktop diff view. It flags risky changes, missing edge cases, inconsistent patterns, or potential regressions.

Think of it as a pre-PR quality pass.

It’s not replacing human review, but it’s extremely useful for catching the “obvious in hindsight” mistakes before they ever reach your team.

2. Visual debugging — with autonomous self-correction

Claude can now spin up your local development server and see your running app directly in the desktop interface.

It doesn’t just read logs — it uses its vision capabilities to look at what’s actually rendered.

That means it can:

  • Identify layout issues
  • Notice broken spacing or alignment
  • Catch visual regressions
  • Flag components that don’t behave correctly in dark mode

You can literally say something like, “Make sure the dark mode works well,” and Claude can visually inspect the UI, identify contrast issues, spacing inconsistencies, or styling mistakes — and then fix them.

That’s a big step up from traditional AI coding workflows, where you had to describe what the UI looked like and what was wrong with it. Now Claude can see the output itself and self-correct.

It feels much closer to working with a human who can glance at your screen and say, “Yeah, that modal padding is off.”

3. Catching runtime errors — not just syntax mistakes

Syntax errors are the easy part.

What about:

  • Runtime errors that only appear after a button click?
  • State bugs that show up after a specific user flow?
  • Crashes triggered by edge-case inputs?
  • Logic errors that technically run but produce wrong results?

This is where Claude Code Desktop’s preview loop becomes powerful.

Because it can run your app, monitor logs, and interact with it, Claude can catch runtime errors — not just compilation issues. Even more importantly, it can test usage flows that surface bugs you wouldn’t catch from static analysis alone.

Instead of just fixing what won’t compile, Claude can:

  • Trigger flows
  • Observe failures
  • Trace stack errors
  • Patch logic
  • Re-run and verify the fix

That’s a much more comprehensive testing-and-repair loop than simply cleaning up red squiggly lines in an editor.

4. PR monitoring and optional auto-merge

Once your changes are pushed to GitHub, Claude can monitor the PR lifecycle inside the desktop app.

You can:

  • Track CI status
  • Let Claude attempt fixes if CI fails
  • Enable optional auto-merge once checks pass

This is where Claude starts handling workflow glue. Instead of babysitting a PR and refreshing checks, you can move on to something else while Claude watches it.

If CI breaks, it can try to fix the issue. If everything passes and you’ve enabled it, it can merge automatically.

That’s not just coding assistance — that’s delivery assistance.

5. Sessions that move with you

Claude Code sessions can now flow between CLI, desktop, and web. Start in one environment, continue in another, without losing context.

It sounds small, but not having to re-explain your project every time you switch surfaces removes friction fast.

We’re moving beyond “AI that helps you type code” toward “AI that helps you validate and ship working software.”

The real question isn’t whether Claude can generate a component anymore.

It’s whether you’re ready to let it run your app, test your flows, fix your runtime bugs, and quietly merge your PR while you work on the next thing.

Gemini 3.1 Pro is an absolute game changer

I guess it was too soon to call this 4.0 — but don’t let the 3.1 fool you.

This was way more than just a minor upgrade.

This was one of the biggest capability jumps we’ve seen in a while — especially if you care about reasoning, research, and actually shipping well-built, high-quality work.

Everyone has been talking about 1 particular unbelievable improvement with this new update.

Imagine going from scoring 31.1% in a reasoning test… to 77.1% and being the absolute best in the same test just a few months later — but this is what Gemini 3.1 just shocked the world with.

More than a 100% upgrade in capabilities.

And this is abstract reasoning we’re talking about — not memorization or “glorified autocomplete”. It had to solve problems with completely new logic patterns, problems it had never seen before — or something like before.

This is huge.

And this makes the 1 million context window it has even more lethal for coding and every other use case we can think of.

It’s vastly superior to its predecessor in every way. The graphics and SVG generation are so good — which is also a huge win for web developers.

1. Web browsing got dramatically better: 59.2% →…

This one is just as important.

On BrowseComp — a benchmark that measures how well a model can use web tools and navigate information — Gemini 3.1 Pro jumped from 59.2% to 85.9% — overtaking all Claude models, including the recently released Sonnet 4.6.

That’s huge.

The difference between those two numbers isn’t cosmetic. It’s the difference between:

  • Surface-level summaries vs. actual synthesis
  • Grabbing the first answer vs. cross-checking sources
  • Losing context across tabs vs. maintaining a clear research thread

If you use AI for research, competitive analysis, trend tracking, sourcing stats, or building content from multiple references, this upgrade matters a lot.

Better browsing doesn’t just mean “it can search.” It means it’s better at deciding what to search for, what to ignore, and how to combine findings into something coherent.

That’s a big shift.

2. This reasoning upgrade is not a joke

And neither was the test that measured it.

On ARC-AGI-2 — a standard benchmark designed to test abstract reasoning (not pattern regurgitation, but actual problem-solving) — Gemini jumped from 31.1% to 77.1%.

That’s not incremental improvement. That’s a different class of performance.

What does that mean in real life?

It means:

  • Fewer moments where the model “almost” understands your problem but misses a key constraint.
  • Better step-by-step thinking when tasks require multiple logical hops.
  • Stronger performance on planning, debugging, and structured workflows.
  • More reliable outputs when you’re building agents or automation.

If you’ve ever felt like an AI model lost the thread halfway through a complex task — this is the kind of upgrade that directly addresses that frustration.

3. Expanded output limits (aka: it can finally finish the job)

One of the most powerful upgrades — this model can now generate more output tokens than ever in a single go.

Gemini 3.1 Pro supports:

  • Up to ~1 million tokens of input context
  • Up to 65,536 tokens of output

In practical terms?

You can feed it massive documents, long threads, multi-file codebases, research dumps — and it doesn’t immediately choke.

And when it generates output, it doesn’t stop halfway through a spec or give you a half-written guide that needs three “continue” prompts.

For developers, creators, educators, founders, and product teams, this means you can:

  • Generate full-length documentation
  • Draft detailed product requirement docs
  • Create structured courses or long-form content
  • Produce complex code scaffolds in one go

The difference between “smart” and “usable” is often just output capacity. This pushes it firmly into usable territory.

4. Native SVG and creative coding

This part is honestly fun — and useful.

Gemini 3.1 Pro can generate native SVG animations directly from text prompts.

Not screenshots. Not image files. Actual, editable, website-ready SVG code.

Why does that matter?

Because SVG is:

  • Scalable (perfect at any resolution)
  • Lightweight
  • Editable
  • Animatable
  • Easy to embed into websites and apps

That means you can prompt:

“Create an animated SVG of a pulsing network graph with gradient nodes.”

And get code you can drop straight into a project.

For designers, indie hackers, frontend devs, educators, or anyone building interactive content, this opens up a new workflow:

Prompt → tweak → ship.

It’s creative coding without the blank-page paralysis.

And it hints at something bigger: AI models that don’t just generate text or images — they generate real artifacts you can deploy.

Gemini 3.1 Pro is not just “a bit smarter”.

It’s:

  • Dramatically better at abstract reasoning
  • Dramatically better at tool-based research
  • Capable of handling much larger context and outputs
  • More useful for real creative and technical production

If you build things, research things, or create things, this version is meaningfully different from what came before.

And if this trajectory continues, we’re moving from “AI that assists” toward “AI that actually executes complex workflows with you.”

Claude Sonnet 4.6 is absolutely insane

Wow I’ve never seen Sonnet do something like this before. This is huge.

You absolutely cannot ignore this.

I don’t even need to compare it to GPT or Gemini or whatever.

Claude Sonnet is actually no longer trying to be a nice little tradeoff between intelligence and cost.

This new Claude Sonnet is here to be a MASSIVE CHALLENGER to its big brother Claude Opus.

And from the numbers I’m seeing, it has made dangerous progress toward achieving that with this new 4.6 update.

It decimated the previous version of Claude Opus (4.5) in basically every metric — and was incredibly close to the current Opus version — and even beat this latest Opus in notable areas.

Literally 2nd position in the biggest AI benchmarks out there — and guess the one model that stopped it from gaining top spot?

It’s gotten so much better at automating actions on your computer now (Computer Use):

1 MILLION token context — trust me this is not a model you want to mess around with.

With Sonnet 4.6, Claude will handle all your real-world, production AI workloads — especially coding and tool use — without the higher cost of Opus.

1. Essential coding upgrade — that we will all feel

Sonnet 4.6 scored 79.6% on SWE-bench Verified, extremely close to Opus 4.6’s ~80.8%, showing near-flagship coding performance at lower cost.

And not just benchmarks. Sonnet 4.6 is here to work with us in real workflows:

  • Understanding large repos
  • Editing across multiple files
  • Avoiding unnecessary rewrites
  • Following existing structure instead of “overengineering”

In Anthropic’s own testing, developers preferred Sonnet 4.6 over Sonnet 4.5 about 70% of the time in Claude Code, citing better context reading and less duplication/overengineering.

2. Unbelievable Computer Use gains

Anthropic has been massively pushing Computer Use lately: the AI models controlling out software like we would to carry out complex actions for us — clicking, typing, navigating interfaces along the way.

With 4.6, that capability improved significantly.

Sonnet 4.6 achieved 72.5% on OSWorld-Verified, dramatically up from Sonnet 4.5’s ~61.4% and nearly matching Opus 4.6’s ~72.7%, which demonstrates near-parity in practical interface interaction tasks.

Sonnet 4.6 now performs nearly on par with Opus in Computer Use.

That’s a big deal because computer-use tasks are messy. They require:

  • Reading dynamic UI elements
  • Recovering from small mistakes
  • Planning multi-step actions

It’s not perfect, but it’s much closer to “practical assistant” than previous versions.

3. 1M is serious business

The new 1 million token context window means you can easily:

  • Load an entire workspace spanning multiple codebases
  • Drop in several multiple long contracts
  • Analyze huge research dumps
  • Work across extended conversation history

More importantly, Anthropic emphasizes that 4.6 isn’t just ingesting that volume — it’s designed to reason across it.

For anyone doing knowledge-heavy work, that’s where things get interesting.

4. Built agentic and terminal workflows — notable upgrades

Sonnet 4.6 posted 59.1% on Terminal-Bench 2.0, a notable improvement over Sonnet 4.5’s ~51.0% and closer to Opus 4.6’s ~62.7%, underscoring progress in complex, multi-step coding tasks.

Sonnet 4.6 feels very optimized for agents — the kind that:

  1. Plan
  2. Call tools
  3. Execute steps
  4. Reflect
  5. Iterate

Sonnet 4.6 scored 91.7% (retail) and 97.9% (telecom) on t²-bench agentic tool use, which is a clear improvement over Sonnet 4.5’s 86.2 % retail and essentially on par with Opus 4.6’s 91.9 % retail and 99.3 % telecom results.

Benchmarks around tool use (like t²-bench) suggest strong reliability when interacting with structured tools and APIs.

If you’re building workflows that involve repeated tool calls and feedback loops, cost-to-performance matters. And this is where Sonnet 4.6 seems carefully positioned.

5. Safety and prompt injection resistance

When models start browsing or using tools, prompt injection becomes a serious concern.

Sonnet 4.6 significantly improves resistance to malicious or hidden instructions compared to 4.5, performing similarly to Opus 4.6 in safety evaluations.

In other words: it’s better at ignoring sketchy instructions embedded in web pages or documents.

That matters a lot for autonomous or semi-autonomous systems.

6. Pricing stays the same

This is one the biggest deals in this release.

Sonnet 4.6 is far far better than both Sonnet 4.5, yet the pricing remains the same.

  • $3 per million input tokens
  • $15 per million output tokens

Sending the clear message:

Opus-level reliability in many workflows — without Opus-level cost.

When should you use it?

Choose Sonnet 4.6 if you want:

  • A daily-driver model for coding
  • A strong agent backbone
  • Large context handling
  • Reliable tool usage
  • Production deployment without premium-tier costs

Choose Opus 4.6 if:

  • The reasoning task is extremely complex
  • Precision is mission-critical
  • You’re doing heavy multi-agent orchestration

For most teams, Sonnet 4.6 is likely to become the default.

Anthropic seems to be collapsing the gap between “mid-tier” and “frontier.”

Instead of forcing users to upgrade to Opus for serious work, they’re making Sonnet strong enough to handle most of it.

If 4.5 felt like a capable assistant, then 4.6 feels more like a dependable coworker — especially for developers.

And that might be the real story here.

This new open-source model just became a major challenger to Claude Opus

Yet another unbelievable open-source coding model just got unleashed into the world.

Imagine a model that’s just as intelligent as Claude Opus — but 13 times cheaper!

But no need for you to imagine anymore — because this is exactly what the new MiniMax M2.5 is:

  • Unbelievably cheap — yet still incredibly smart
  • Blazing fast
  • Open-source with open weights

No wonder I’ve been seeing so many developers going crazy about it.

I’m seeing experiments showing that you can run complex agentic tasks continuously for as low as $1 for every hour of output.

This is going to be massive for all those our long, multi-step workflows where the models need to plan, browse, write code, revise outputs, and loop until something works.

Aggressive pricing

This is like the biggest reason for all the buzz right now:

  • Performance in real coding and agent workflows comparable to Claude Opus–class models
  • Roughly 13× cheaper in practical usage scenarios
  • Cost low enough that you stop optimizing prompts purely to save money

This is going to make a real difference.

Most agent systems fail economically before they fail technically:

  • In development: repeated tool calls and retries multiply costs quickly.
  • In production: cost add up fast as a growing list of users repeatedly try out multiple AI-powered features

M2.5’s pricing is designed to remove that constraint.

Standard pricing:

  • $0.30 per 1M input tokens
  • $1.20 per 1M output tokens

At rates like these, running long reasoning loops or persistent background agents becomes financially realistic for much larger workloads.

The benchmark that made engineers pay attention

  • 80.2% on SWE-Bench Verified

This is it.

SWE-Bench Verified is widely considered one of the more meaningful coding benchmarks because it measures whether a model can actually resolve real GitHub issues under constrained evaluation conditions.

A high score here signals something specific:

  • Strong code understanding
  • Ability to follow multi-step debugging processes
  • Reliability in structured environments

In other words, it suggests the model can do more than generate code — it can fix existing systems.

Ridiculously fast

Raw intelligence is only part of agent performance. Speed determines whether iteration is usable.

M2.5’s high-speed variant runs at approximately:

  • 100 tokens per second

That level of throughput changes how agents behave in practice:

  • Faster plan → execute → verify cycles
  • Less waiting between iterations
  • Higher tolerance for multi-pass refinement
  • Better human experience when supervising agents

Many agent workflows involve dozens of internal steps. When each step is fast, experimentation becomes normal instead of frustrating.

Open weights: control instead of dependency

Another major part of the story is that M2.5 is not just an API product.

MiniMax released:

  • Open weights
  • A permissive modified-MIT style license
  • Support for local deployment stacks

This matters for companies building internal tooling because it allows:

  • Local inference for sensitive data
  • Predictable costs at scale
  • Custom infrastructure integration
  • Reduced vendor lock-in

The combination of strong performance and deployability makes M2.5 particularly attractive for engineering teams building long-lived internal agents.

GLM-5 is absolutely incredible for coding (7 new features)

Woah this is huge.

China’s Z.ai just released their brand new GLM-5 model and it’s absolutely incredible. I hope Windsurf adds support for this ASAP…

This is not just a “coding” model. This is full-blown software engineering.

They designed it from the ground up to build highly complex systems and intricate dev workflows.

Record-low hallucinations — from 90% in the previous version… to 34% in GLM-5. Thanks to a groundbreaking approach to training the model.

Like for example if I ask the model a question it doesn’t know — it’s more likely to just tell me it doesn’t know — instead of inventing garbage on the fly — like I see at times from GPT and the rest.

And it’s open-source with open weights (!)

Let’s check out all the amazing features in this release.

1. Agent-first behavior (designed to stay on task)

GLM-5 is positioned around what we developers call agent workflows — situations where the model has to plan, execute, check results, and continue working toward a goal instead of responding once and stopping.

The main improvement here isn’t personality or creativity. It’s consistency. The model is tuned to maintain context and direction over longer sequences of actions, which is essential if you want AI to handle real workflows instead of isolated prompts.

2. A true coding-focused model

Software engineering is one the first and foremost priorities of this new model.

GLM-5 is optimized for working across larger codebases and longer development tasks rather than generating small snippets.

In practice this means keeping track of project structure, following constraints across files, and iterating toward working solutions. Improvements in coding usually signal broader gains in reasoning and planning — since programming requires precision and structured thinking.

3. Very large context window (so it can hold more of the problem at once)

GLM-5 supports an extremely long context lengths of 200,000 tokens — allowing large amounts of text, documentation, or code to stay visible to the model at once.

This matters more than it sounds. Instead of feeding information piece by piece, developers can provide entire specifications or large repositories in one session. That reduces fragmentation and makes long-running tasks far more stable.

4. Production-ready tool use

Another major focus is making the model usable inside real applications. GLM-5 includes features aimed at integration rather than conversation alone, such as:

  • function calling for external tools or APIs
  • structured outputs for predictable formatting
  • streaming responses
  • context caching for efficiency
  • different reasoning modes for complex tasks

These features make it easier to embed the model into systems where it needs to coordinate with software rather than simply generate text.

5) The “slime” framework (the training story behind the behavior)

One of the more interesting additions sits behind the scenes. The slime framework is an open reinforcement-learning post-training system designed to make large-scale training more efficient.

Its purpose is to improve how models learn from feedback during long or complex interactions.

Instead of only learning from static examples, the model can be refined through iterative training setups that resemble real workflows. That kind of training infrastructure is closely tied to improvements in stability and long-task performance.

In simple terms, slime helps train models to behave better over time, not just answer individual questions well.

6) Efficient long-context architecture

GLM-5 also uses newer attention techniques designed to keep long-context performance manageable in terms of compute cost. Long context is useful only if it remains practical to run, so part of the engineering effort goes into maintaining efficiency while scaling capability.

This reflects a broader trend in AI development: smarter architecture choices instead of only increasing size.

7) Hardware and ecosystem implications

Another reason GLM-5 has drawn attention is that it was developed with deployment in mind on domestically produced AI chips. That makes it notable beyond technical capability, since it signals growing independence from the traditional hardware stack that has dominated AI training and inference.

GLM-5 isn’t mainly about sounding smarter in conversation.

Its significance comes from where it points the industry next: models designed to manage complexity over time. Long context, structured tool use, reinforcement learning infrastructure like slime, and strong coding ability all serve the same goal — making AI systems that can carry work from start to finish rather than stopping at the first response.

The new GPT-5.3 Codex is amazing for coding

I’ve been seeing a lot of demos of the new GPT-5.3 Codex and they’ve been looking really exciting.

I see Codex definitely heading in a good direction if they keep this up.

Big improvements and brand new features too.

Check out this awesome 3D racing game it made all on its own:

All it took was a one-shot prompt and a few follow-up prompts to create something this comprehensive.

Using the develop web game skill and preselected, generic follow-up prompts like “fix the bug” or “improve the game”, GPT‑5.3-Codex iterated on the games autonomously over millions of tokens. 

OpenAI

And here’s a new feature: I can now talk to the Codex as it’s responding to me — without having for it to completely finish.

Like for those times when I want to add an extra detail to the prompt without having to start over and lose all the progress.

5 big improvements in GPT-5.3-Codex for software developers

1. Long-running workflows (new in Codex)

One of the most important additions is the ability to guide the model while it’s working. Instead of waiting for a finished result, developers can interact with GPT-5.3-Codex in real time — asking questions, changing direction, or refining goals mid-task.

The model provides ongoing updates about what it’s doing and why, allowing users to:

  • Adjust approaches as work progresses
  • Clarify intent without restarting tasks
  • Stay involved in decisions while execution continues

This makes the experience feel more like collaborating with a teammate than issuing commands to a tool.

2. Stronger real-world coding performance

We’ve moving just caring about how syntactically correct the code from these AI models are.

GPT-5.3-Codex does better on realistic engineering tasks — the kind that involve messy repositories, incomplete documentation, and multiple moving parts.

SWE-Lancer IC Diamond (advanced engineering tasks)

  • GPT-5.3-Codex: 81.4%
  • GPT-5.2-Codex: 76.0%
  • GPT-5.2: 74.6%

So we’re talking:

  • Better understanding of existing projects
  • More reliable debugging
  • Fewer breaks when working across large codebases

3. Faster execution

With this update, we are also now running GPT‑5.3-Codex 25% faster for Codex users, thanks to improvements in our infrastructure and inference stack, resulting in faster interactions and faster results.

— OpenAI

Speed matters when AI is performing multi-step work.

GPT-5.3-Codex runs noticeably faster in agent-style workflows, which helps reduce the back-and-forth cycle between writing code, testing, and fixing errors.

For us developers, that translates into:

  • Shorter iteration cycles
  • Faster experimentation
  • Less waiting during long tasks

4. Coding plus reasoning

GPT-5.3-Codex isn’t limited to writing code. It combines programming ability with stronger reasoning, allowing it to help with:

  • Documentation
  • Architecture discussions
  • Code explanations
  • System-level decisions

The result is a broader role in the engineering workflow, not just the coding phase.

5. Better interaction with tools and environments

Modern development doesn’t happen in isolation and this new model reflects that.

It’s improved at working with terminals, commands, and development tools, meaning it can interpret outputs and adjust its approach instead of generating code blindly.

AI builds AI now

One of the more talked-about aspects of the release is that earlier versions of GPT-5.3-Codex were reportedly used internally during development. The system helped diagnose issues, debug workflows, and improve processes along the way.

This just shows us how AI tools are increasingly speeding up the creation of newer AI systems — a feedback loop that could accelerate progress across the industry.

Why this matters

The release comes during an intense race to build more capable AI coding systems. But the competition is no longer just about who generates the best code. The real question is now:

  • Which system helps teams ship software faster?
  • Which reduces complexity instead of adding to it?
  • Which fits naturally into real development workflows?

GPT-5.3-Codex is clearly here to answer these type of questions.

If earlier tools made developers faster, systems like GPT-5.3-Codex aim to change how development itself is organized — moving toward a model where AI handles execution while humans focus on direction and decision-making.

Whether that becomes the standard approach remains to be seen. But one thing is clear: AI coding tools are no longer just assistants. They’re becoming active participants in building software.

My productivity just went up 10x with Claude Code Swarms

Wow this is going to be such an impactful feature for us developers.

It does exactly what I thought it would do — especially once I saw this:

This is your very own team of intelligent coding agents — you can call it Agent Teams — which is the official name anyway…

❌ Before:

Only one assistant doing everything — still far better than manual coding, but I can only make one change at a time.

How about when I have a frustratingly stubborn bug that my assistant is still trying to resolve? And I need to start working on a new feature?

I either have to stop the bug-fixing or I have to waaaait — no matter how long I’m stuck.

✅ But now — with the new Claude Code Agent Swarms or Teams feature:

I can spin up multiple AI teammates working in parallel — all coordinated by a lead agent that organizes the work and combines the results.

Literally an army of developers to take on any batch of tasks — we’ve really come a long way ever since Copilot stepped onto the scene in 2022.

So the lead agent breaks down the task and creates multiple teammates — each running as its own independent Claude Code session.

Every teammate has its own context, its own workflow, and its own focus.

They can:

  • Work on different parts of a task at the same time
  • Share progress through a common task list
  • Message each other directly when coordination is needed

The lead agent’s job is to keep everything moving and present a final result back to you.

This is different from earlier “subagent” approaches where everything still flowed through a single session. Here, the agents are genuinely separate workers.

It makes sense when you look at how people actually work in large codebases

Most real engineering work isn’t one long linear task. It’s investigation, experimentation, implementation, testing, and review happening at the same time.

Agent Teams is designed for that kind of work. Instead of forcing one agent to juggle multiple threads, the system spreads the work across several agents that can move independently.

It also reflects a broader trend across developer tools. AI is moving away from chat-style assistance toward systems that can manage longer workflows with less constant supervision.

Speed is the biggest deal here

For example:

  • One teammate reviews performance issues while another checks security risks.
  • Different agents explore competing debugging theories at the same time.
  • Frontend, backend, and testing tasks can be split up instead of handled sequentially.

You can still step in at any point. If one teammate goes in the wrong direction, you can jump into that session and redirect it without interrupting the others.

The lead agent handles coordination, which means you spend less time managing context and more time making decisions.

Features that make it practical

A few design choices make the system usable beyond demos.

You can view teammates either inside one terminal session or in split panes so you can watch them work side by side. That makes it easier to understand what’s happening instead of waiting for a final summary.

There’s also a “delegate mode,” where the lead agent focuses purely on coordination and doesn’t start coding itself. That sounds small, but it helps prevent chaos when multiple agents are involved.

Another useful option is plan approval. A teammate can be required to propose a plan first, stay in read-only mode, and only proceed once the lead approves it. This adds a safety layer for larger or riskier changes.

Where it works best (and where it doesn’t)

Agent Teams works well when tasks can be separated cleanly. Code reviews, research-heavy work, and debugging are good examples because multiple agents can investigate different angles without getting in each other’s way.

It’s less effective when work is tightly coupled. If several agents need to edit the same files or follow a strict sequence, the overhead of coordination can outweigh the benefits.

In those cases, a single agent is usually faster.

The trade-offs

The biggest downside is cost and complexity. Each teammate is a full Claude Code session, so token usage increases quickly as teams grow.

There are also some rough edges. Because the feature is experimental, things like session recovery and task synchronization aren’t perfect yet. It works best when you’re actively supervising rather than letting it run unattended for long periods.

My final thoughts

Agent Teams points toward a different future for AI development tools. The goal isn’t just to make a single assistant smarter. It’s to make software work itself more parallel and organized, with AI handling coordination as much as implementation.

Right now the most immediate value is in helping us understand and navigate large codebases faster. Multiple agents can explore different areas at once and surface conclusions much sooner than a single assistant could.

I think this could definitely become like the default way people build software.

Claude Opus 4.6 is amazing: massive 1 million context window, 2x longer outputs, and much more

And don’t let the “4.6” fool you — this was far from just another incremental upgrade.

Packed with so many brand new features — to keep transforming Claude into a more reliable partner for the most complex tasks.

From 200K context window in Claude Opus 4.5 → 1 million in Opus 4.6.

Much longer output lengths…

A breakthrough collaborative mode between several agents…

New Adaptive Thinking…

It’s not just about focusing on smarter answers or better benchmarks this time — Opus 4.6 is designed around long, multi-step tasks — that usually require planning, context, and consistency across a lot of information.

Large coding projects, research-heavy reports, and business workflows where details matter and mistakes compound quickly.

Opus 4.6 takes the lead in all frontier models on Humanity’s Last Exam, attained the highest score on Terminal-Bench 2.0 for agentic coding, and outperformed OpenAI’s GPT-5.2 by 144 Elo points on GDPval-AA (real-world knowledge work tasks).

Bigger is better…

The new massive 1 million token context window lets Claude Opus 4.6 read and keep track of enormous amounts of information at once.

Entire codebases, long document collections, or messy project notes can stay in scope without the model forgetting earlier details.

This makes it much less likely to lose track of critical context during conversation and thinking that compromise its accuracy.

Longer outputs for real work

Anthropic also doubled the maximum output length to 128K tokens.

Claude Opus 4.6 now has more room to produce full drafts, long analyses, or large blocks of code without constantly stopping midway.

And this removes a major annoyance for developers and teams — breaking tasks into smaller chunks just to fit output limits. It’s another signal that Anthropic wants Opus to handle full workflows rather than isolated prompts.

Smarter effort, less micromanaging

❌ Reasoning budget

✅ Reasoning effort

Claude has started following the Gemini approach now, with Opus 4.6.

Earlier versions allowed developers to set strict reasoning budgets — but Opus 4.6 instead uses a more adaptive approach — you guide how much effort the model should spend and let it decide how to allocate that internally.

Making interactions feel less like tuning a machine and more like assigning a task to a capable assistant: explain the goal, set expectations, and let it figure out how much thinking is required.

New Agent Teams

Definitely one of the most notable additions.

With this we can let multiple different agents work simultaneously on different parts of a task while coordinating toward a shared goal.

Instead of a single model handling everything sequentially, work can be divided across specialized agents — for example, planning, implementation, and review — mirroring how human engineering teams operate and improving efficiency on complex projects.

Same positioning — higher expectations

Interestingly, Anthropic hasn’t dramatically changed pricing or positioning. Opus remains the premium, high-capability model in the Claude lineup, aimed at users who need reliability more than speed or cost efficiency.

That also raises expectations. As models get better at handling longer tasks, users expect fewer errors, fewer retries, and outputs that are closer to finished work on the first attempt.

The bigger story behind Opus 4.6 is where AI development is heading. The race is no longer just about intelligence in short bursts. It’s about endurance — how well a model can stay coherent, organized, and useful over extended work sessions.

The challenge now isn’t whether models can generate content, but whether they can reliably carry projects across the finish line.

And that’s a much harder problem — but also a much more valuable one to solve.