## News / Update
The week brought major platform and infrastructure moves alongside industry milestones and outages. IBM integrated Groq’s LPU inference into watsonx, reporting large speed and cost improvements for enterprise AI. A widespread AWS outage knocked major apps offline—Perplexity among them—prompting clarifications that such incidents aren’t AI-related but reflect long-standing cloud fragility. Hardware and systems news stayed hot: Modular broadened support to seven GPU architectures and rapidly set records on AMD’s MI355 series, while AMD’s latest efficiency gains narrowed the gap with NVIDIA. China’s push on alternative chipmaking paths (like SSMB, nanoimprint, and multi-beam e-beam) signaled fresh competition to EUV. Robotics advanced with Unitree’s taller, more lifelike H2 and a new hip design aimed at better mobility. Research and data releases continued: Hugging Face now hosts the 308GB CommonForms VLM dataset, and Nvidia previewed a lighter, faster RL training approach (QeRL). In biotech, NewLimit raised $45M with backing from Eli Lilly to pursue AI-driven longevity. Academia and community events remained active with Oxford’s OATML PhD applications, a packed SchmidhuberAI event in Zurich, and a Deepgram open house in San Francisco. Meanwhile, Microsoft’s new Mexico data center drew local backlash over power and water strain.
## New Tools
A wave of open-source and developer tooling landed. Krea released a 14B real-time video generation model with code and report under Apache 2.0, delivering long-form video at double-digit FPS on a single accelerator. The PDF/OCR stack leveled up across the board: FinePdfs open-sourced its code, datasets, and new XGB-OCR model, while DeepSeek launched high-accuracy OCR that reads handwriting, supports 100+ languages, compresses visual context dramatically, and is optimized for high-throughput pipelines on Hugging Face. For AI development workflows, dstack introduced a UI to spin up GPU dev environments into VS Code/Cursor, Cline rolled out an enterprise edition with governance, and TabbyAPI added tensor parallelism for faster inference. The ecosystem also saw tools for safer and stronger models: ByteDance released ReSA, a large safety dataset built with answer-then-check synthesis; TerminalBench arrived as a new coding-agent benchmark; and SFResearch’s ProgSearch pipeline generates increasingly complex, long-horizon tasks for agent research. Agent frameworks evolved too, with Dexter adding deep-agent capabilities and LangChain integrating with MCP for human-in-the-loop checkpoints. W&B’s Weave Playground simplified prompt testing in one workspace. Quantization became easier with GPTQ now built into Keras 3 across major backends. A practical “grok-4-fast” open model emerged as a cheap, fast option for agentic data analysis. SciSpace launched a detector targeting AI-written research papers.
## LLMs
Model competition and research accelerated across capability, scale, and cost. Leaderboards shifted with Claude Sonnet 4.5 and GLM 4.6 rising in web dev tasks, while Baseten claimed the fastest GLM 4.6 serving and broader GLM updates loomed (Air imminent, GLM-5 rumored with massive context). Alibaba expanded Qwen3 with a trillion-parameter MoE language model and an open-weight vision-language model boasting up to a million-token window and strong multimodal benchmarks. DeepSeek V3.1 combined open-source access with aggressive pricing and demonstrated standout performance in live, real-money trading benchmarks—though results remain sensitive to prompts—underscoring real-world divergence from static tests. Safety and reasoning research advanced with new misalignment classifiers, ByteDance’s safety dataset, and CaRT, a method for teaching models when to stop gathering information and act. Additional signals point to fast-evolving capabilities—rumors of Gemini 3 Pro’s strong reasoning and musicianship, Kimi K2’s speed and accuracy gains, agent benchmarks like TerminalBench for 2025, and claims of more robust agent frameworks that reduce hallucinations. Together, these developments suggest a bifurcation: frontier proprietary models push scale and multimodality, while open systems compete on cost, speed, and specialized agent performance.
## Features
Existing products gained meaningful capabilities that tighten workflows and expand creative control. Claude Code added a safer sandbox mode in its CLI for finer-grained permissions, then arrived on the web and iOS, with its Skills system continuing to differentiate coding automation. Video creation tools took a leap: Google’s Veo 3.1 climbed to the top of video leaderboards, introduced start-to-end frame transitions for smoother shots, and delivered object removal with seamless scene reconstruction—bringing advanced VFX within reach. Sora 2 refined its moderation to cut false positives after user feedback. Gemini Apps enabled prompts to tap live Google Maps data for richer, location-aware answers. Google AI Studio redesigned project and API key management for clearer, multi-project workflows. On the serving side, TabbyAPI’s new tensor parallelism boosts inference throughput. Enterprise developers gained stricter guardrails and integrations via Cline’s new governance features.
## Tutorials & Guides
Foundational learning resources multiplied. A Latent Space podcast episode and companion “Open Model Pretraining Masterclass” organized current best practices and research highlights for training open models heading into 2025. Stanford published a fully open, step-by-step blueprint for building language models, offering a rare end-to-end reference. Hugging Face launched a comprehensive robotics course spanning classical control, real-world RL, generative methods, and generalist policies. Hands-on guides included a fast path to applying GPTQ quantization directly in Keras 3 across JAX, TensorFlow, and PyTorch, and a practical text-to-SQL demo showing how to wire open-source models and orchestration to answer complex database questions.
## Showcases & Demos
Demonstrations highlighted how fast creative and embodied AI are moving. Real-time and high-fidelity video generation stole the spotlight: Krea’s open 14B model streamed long-form outputs in real time; Veo 3.1 delivered cinematic transitions and surgical object erasure; Sora 2 showed one-prompt, scene-level generation; and Kandinsky’s latest produced crisp 5-second clips with longer versions teased. A striking decade-long before-and-after underscored how far image and video synthesis have come. On devices, a Glif-based agent blended AI with live footage for on-the-go Hollywood-style effects. Robotics displays ranged from Unitree’s taller, more lifelike H2 to “Lucy,” a robot purportedly designed and trained end-to-end by AI. Novel applications included models geolocating photos from a single image and Grok tackling advanced mathematical conjectures—signaling broader reasoning potential beyond chat.
## Discussions & Ideas
The community wrestled with productivity, safety, and evaluation realities. Multiple analyses argued that AI-generated code hasn’t accelerated software delivery due to human review bottlenecks and model “coding personalities,” even as the barrier from idea to app keeps dropping. In finance, leaders suggested AI agents won’t replace analysts but will expand their scope. The “AI Operating System” concept gained traction as a unifying layer for intelligent applications. Safety conversations sharpened: concerns about LLMs as insider threats, calls to validate model judges against humans, and critiques of academia’s reliance on VLM evaluators that can flip opinions with minor prompt changes. Methodology debates flared over papers repackaging known techniques like context distillation without attribution. Karpathy’s commentary resurfaced core themes—RL breakthroughs as pivotal for AGI and a reframing of what modern AI systems really are—while others warned AI safety activism can drift from technical reality. Broader industry discourse touched on media miscalculations in AI cost analyses and celebrated how much cleaner and more professional open-source ML codebases have become.