AI Weekly #2: Agent Deployments & Inference Research

This week’s AI updates for you (8-14 Sep 2025):

Newest AI enterprise deployments. You can get some inspiration from what enterprises are releasing to apply to your business, to recognize market needs to serve them, or just to keep yourself ahead of the AI curve.
New AI research and development results. It requires some technical knowledge, but help you to keep up with the frontier movements of AI. If any term is new to you, you can look up on the internet to find an explanation. I only use popular terms.

Enterprise AI

Expedia launches ChatGPT-powered travel planning

Expedia built a conversational trip-planning feature inside its mobile app using OpenAI’s ChatGPT. It allows users to ask questions about destinations, hotels and activities and get personalised recommendations. This simplifies shopping journey: hotels mentioned in the chat are automatically saved to a trip list, so travellers can compare prices, check availability and book with a few clicks. For Expedia, it increases customer engagement and cross-sell opportunities. Expedia.

Box introduces AI agents across content management

Box announced three new agents:

Extract: to pull structured data from documents,
Automate: to build complex workflows through prompts, and
Apps: to embed AI-assisted dashboards into popular platforms.

All are built on Box’s data platform. These agents reduce manual data entry, turns unstructured files into usable data, and allow business users to build their own analytics tools and automated processes. Box.

Microsoft adds role-based Copilot agents for sales, service and finance

Microsoft’s September 2025 update brings role-specific Copilot solutions to sales, customer-service and finance teams, integrating CRM and ERP data directly into Outlook, Teams and Excel.

Salespeople can retrieve deal pipelines, craft closing plans and update records via chat.
Service agents can summarise cases and draft responses.
Finance teams gain tools for reconciliation, variance analysis and data preparation.

These assistants reduce context-switching and improve decision-making, and boost productivity across business teams. Microsoft.

Sam’s Club equips frontline managers with generative-AI tools

Walmart’s Sam’s Club rolled out enterprise-grade ChatGPT tools to club managers, letting them query sales data, flag local products and identify seasonal patterns in minutes rather than hours. This frees managers from repetitive tasks and prepares them for AI-driven careers, and also improves member experiences through faster decisions and personalised service. Walmart.

Claude can create and edit real Office files in chat

Anthropic added a file-creation feature to Claude that makes real Excel sheets, PowerPoint decks, Word docs, and PDFs directly from a prompt in chat or the desktop app. It runs code in a private sandbox to build tables, charts, and models from your data, then lets you download the files. Rolling out as a preview to Max, Team, and Enterprise; Pro to follow. It cuts tool-hopping and speeds up reporting, planning for business users. Anthropic.

Coinbase expands the agent payment ecosystem with x402 Bazaar

Coinbase launched x402 Bazaar, a kind of marketplace where AI agents can automatically find and pay for services they need (eg. data feeds, translation, or image generation). Instead of developers manually setting up each API connection and payment method, AI agents can now browse available services, use them instantly, and pay per-request with stablecoins. This makes building AI apps much faster since developers don’t need to integrate each service manually, while service providers get a new way to sell their APIs with instant payment. Coinbase.

Google introduces AI advertising suite at Think Week 2025

Google’s Think Week introduced a bundle of AI advertising tools:

AI Max for Search: a one-click option to automate bid optimisation and provides text guidelines
Asset Studio: uses Imagen 4 to generate high-quality images and videos with style references and shareable previews
Demand Gen and Merchant Center new features: optimize omnichannel, local offers and partnerships.

These tools help retailers run holiday campaigns with less effort by automating creative production, campaign budgeting and loyalty-program targeting. For Google, it helps improving ad performance and customer retention. Google.

Adobe makes Agent Orchestrator and AI agents generally available

Adobe’s Experience Platform offers Agent Orchestrator, a reasoning engine that interprets user intent and activates the right AI agent for tasks. Six domain agents are included: Audience, Journey, Experimentation, Data Insights, Site Optimization and Product Support. These tools help marketers deliver personalised campaigns at scale with human controls. Adobe.

Research AI

Speculative cascades by Google Research

Google researchers proposed speculative cascades: combining two LLM inference techniques cascading and speculative decoding.

Cascading: chains together models of different sizes, a small model handle easy queries and defer only hard cases to a larger model.
Speculative decoding: uses a small model to generate a batch of tentative tokens; a larger model accepts or rejects them, which reduces the number of expensive forward passes.

Speculative cascades merge these ideas: There is a rule to decide when to trust the drafter and when to defer to the large model (via a threshold).

If smaller model is confident, then it is a cascade case.
If small model is not confident, then it is a speculative decoding case.

The method help improve cascade latency (because large model already run speculative decoding), and may use more tokens compared to running cascade. It is model-agnostic and can be applied wherever a smaller model’s output distribution approximates a larger model’s, suitable for fast responses without much quality loss. Google Research.

Disaggregated inference by Pytorch

LLMs have two distinct operations during inference:

Prefill: the first forward pass over the whole input prompt to produce the first token. Mostly compute-bound. It runs once per request.
Decode: the step-by-step generation of the remaining tokens. Mostly memory-bound and KV-cache-bound. This takes most of the time.

Each need different resouces: Prefill wants strong computation; decode wants fast memory access. Normally they run on the same server. The PyTorch team proposed to separate them, each on the server specialized for what they need.

This allows prefill and decode to scale independently and to run concurrently, and so improves latency and throughput because the decode servers are no longer idle waiting for prefill and can be provisioned separately. Pytorch.

Qwen3-Next-80B model by Qwen

Qwen introduces Qwen3-Next-80B, a 80B params model with only 3B active params. The architecture uses a hybrid attention mechanism:

3 blocks use Gated DeltaNet, a linear attention variant,
1 block uses full gated self-attention

This reduces computational cost while maintaining precision. Qwen3-Next also uses ultra-sparse MoE, with 512 experts and only 11 active experts. The model also uses multi-token prediction, which allows multiple tokens to be generated in parallel, and so speeds up inference. All of these help the model run 10x faster than comparable dense models. Qwen.

Fine-tuning GPT-OSS with quantization-aware training - NVIDIA

OpenAI released gpt-oss-120B last month, an open-source model, 120B params, MoE architecture. The model is trained with a low 4-bit precision for efficient inference, but that cause finetuning difficult due to instability. NVIDIA team proposes a two-stage finetuning:

supervised fine-tuning (SFT) at 16-bit precision, then
quantization-aware training (QAT) at 4-bit precision.

The method shows that upcasting from 4-bit to 16-bit improves stability, and then can downcast back to 4b for inference works fine. NVIDIA.