Alibaba Releases Qwen3.7-Plus: Outperforms GPT-5.4 in Screen Understanding, Independently Develops App in 11 Hours, Integrating "See, Think, Write, Act"!

"One model that can see, think, write code, and act." According to Alibaba's official introduction, the Hybrid-Agent system built on Qwen3.7-Plus ran stably for over 11 hours, automatically completing the full R&D cycle of an English vocabulary learning app and independently replicating a stock market application. The model scored 79 in screen understanding, surpassing GPT-5.4 and Gemini-3.1 Pro

Just as MiniMax M3 made waves yesterday, Alibaba's Qwen has released another terrifyingly powerful new "monster."

On June 2, the Alibaba Cloud Tongyi Qianwen team officially announced the release of Qwen3.7-Plus on platform X. This is a multimodal Agent model, described by officials as "unifying vision and language into an integrated intelligent agent foundation."

The team summarized its product positioning in one sentence: "One model that can see, think, write code, and act."

Building apps and replicating stock applications with Qwen3.7-Plus is effortless. The Qwen official blog revealed that the Hybrid-Agent system built on Qwen3.7-Plus ran stably for over 11 hours, automatically completing the full R&D cycle of an English vocabulary learning app. The Hybrid-Agent system also independently completed a high-fidelity replication of the native macOS Stocks application. The model's screen understanding score of 79 also exceeds GPT-5.4 and Gemini-3.1 Pro.

The timing of Qwen's release was quite subtle. Just the day before, MiniMax launched its new flagship open-source model M3, claiming to simultaneously achieve top-tier programming capabilities, 1M ultra-long context, and native multimodality. With both companies releasing products intensively within the same week, the open-source competition among domestic large models is becoming increasingly fierce.

The pricing for Qwen3.7-Plus is: Input $0.4/million tokens, Output $1.6/million tokens.

"See, Think, Write, Act" Integrated: One Model to View Screens, Write Code, and Operate Apps

The core highlight of Qwen3.7-Plus is truly connecting visual understanding with task execution.

The official blog describes that this model can "perceive real-world scenarios, read screens and operate GUIs, generate code based on visual references, and navigate mobile applications end-to-end," seamlessly integrating GUI and CLI interactions within a single agent loop.

There are two key terms here: GUI and CLI. GUI refers to graphical user interfaces, such as web buttons, mobile app menus, and desktop software windows. CLI refers to command-line interfaces, such as the black windows engineers use to install dependencies, run tests, and deploy services.

Simply put: It doesn't just "understand images," but can understand your phone screen or computer interface, then click, input, and navigate on its own to complete tasks.

For example, it can read the screen to understand which button to click in a mobile app or webpage interface; it can look at a design mockup and generate SVG, web pages, or frontend prototypes; it can also run code in the command line, check for errors, and modify the code.

Running for 11 Consecutive Hours to Develop an English Vocabulary Learning App

Regarding what Qwen3.7-Plus can specifically do: the official side provided several product-oriented demonstrations.

The Qwen official blog stated that the Hybrid-Agent system built on Qwen3.7-Plus ran stably for over 11 hours, automatically completing the R&D cycle of an English vocabulary learning app.

Details include: generating over 10,000 lines of code, triggering over 1,000 Agent calls, covering requirement document generation, automatic code writing, automated installation and deployment, test case creation, GUI automated testing, parallel testing across multiple scenarios, automatic updates of product descriptions, and version iterations.

The key point of this case is not "how much code was written," but rather the length of the workflow. A real software task often does not end with generating code once; it also requires installation, running, testing, bug fixing, and re-verification. The official demonstration aims to emphasize this long-process capability.

Replicating a Stock Trading App with Real-Time Market Data API Integration

Another official case involves directly building a stock trading app.

The Qwen official blog stated that the Hybrid-Agent system independently completed a high-fidelity replication of the native macOS Stocks application. The process included: interacting with the native application to understand UI layout and functional details, generating SwiftUI source code based on interaction logs, integrating the LongBridge real-time market data API to fetch live market data, and automatically compiling, building, and launching the replicated application.

The model autonomously executed 10 functional verification tests, including real-time quote loading, stock selection and switching, multi-period view switching, search filtering, and detailed data panel display, all of which passed.

This demonstration is more intuitive: the model does not just generate a static page, but must understand the structure, data sources, and interaction logic of a market data app, and then turn it into a runnable desktop application.

Generating Code from Images: Image/Video to SVG, and Web Prototype Generation

The Qwen official blog stated that Qwen3.7-Plus can convert images, videos, UI screenshots, and design references into executable code, covering everything from SVG reproduction to complete web page generation.

In image/video-to-SVG tasks, the model needs to identify geometric structures, colors, layouts, hierarchical relationships, and dynamic changes, and then express them in code. For icons, illustrations, animations, graphic design, and information visualization, the product value of this capability lies in turning "seen reference images" into "editable code assets."

In web design tasks, the model must not only reproduce the page style but also organize the layout, write frontend code, handle interaction logic, and integrate multimodal materials into the final page.

Meanwhile, Qwen3.7-Plus can serve as a visual agent, combining visual understanding with tool usage to solve tasks such as spot-the-difference, jigsaw puzzles, Klotski, maze solving, and puzzle assembly.

The process here is not "look once and give an answer." The model first understands the image structure and constraints, then converts the visual problem into a computable problem representation, and finally autonomously writes and executes code to solve, search, or verify.

How to Interpret the Benchmarks: Screen Understanding Beats GPT-5.4, But Not First in All Categories

In multimodal benchmark tests, Qwen3.7-Plus has several notable figures:

Screen Understanding and Mobile Control: ScreenSpot Pro score of 79.0, higher than GPT-5.4 (67.4) and Gemini 3.1 Pro (68.1); AndroidWorld score of 81.0, also exceeding Gemini 3.1 Pro (70.7) and Opus-4.6 Max (62.0).

Mathematical Visual Reasoning: MathVision score of 90.3, close to GPT-5.4's 91.0, and exceeding Gemini 3.1 Pro's 87.4.

Search-Enhanced Visual Question Answering: SimpleVQA score of 81.7, WorldVQA score of 61.1, basically on par with Opus-4.6 Max in this track.

Chart Recognition: CharXiv (RQ) score of 85.9, the highest among all compared models.

Regarding pure text capabilities, officials stated that Qwen3.7-Plus is "overall close to Max-level models."

It scored 70.3 on Terminal Bench 2.0, exceeding Opus-4.6 Max (65.4), K2.6 Thinking (66.7), and DeepSeek-V4-Pro Max (67.9).

It scored 62.3 on Deep-Planning (complex multi-step planning), also leading models of the same level.

However, there are weaknesses.

It scored 77.7 on SWE-Verified (real software engineering tasks), lower than Opus-4.6 Max (80.8) and DeepSeek-V4-Pro Max (80.6); it scored 34.7 on HLE (extremely hard reasoning), lower than GPT-5.4 (40.0).

What Do Netizens Think?

The official Qwen account @Alibaba_Qwen released an announcement at 1:54 AM on June 2, showcasing the operation process of the multimodal hybrid Agent with demo videos. As of press time, the tweet had reached 200,000 views.

Users on X stated that the Qwen3.7-Plus model not only has to face various screens but also operate various tools and cope with messy workflows.

Other users commented that Qwen's strategy this time is very clear, betting on Agent and GUI control, which is the right direction now.

Many users stated that integrating "see, think, write, act" into one model is extremely convenient. It is simply like "integrating a whole employee system!"

In related comments, many technical users focused on two directions:

First, the score of 79 on ScreenSpot Pro—considered by many as a key threshold indicator for whether "GUI Agents can truly be commercialized," with Qwen3.7-Plus currently holding the highest score among tested models;

Second, the 98% on Kernel Bench L3—this metric measures the model's ability to optimize GPU computing cores, with 98% meaning it can produce solutions surpassing the PyTorch default compiler for almost all problems. Some users pointed out that this direction was previously almost a "forbidden zone" for professional engineers.

Horizontal Comparison with MiniMax M3

The two models were released almost simultaneously, with different positioning.

MiniMax M3 focuses on open source, promising to release technical reports and model weights within 10 days. Its core differentiation is the 1M ultra-long context (M3's per-token computation cost at 1M context is only 1/20 of the previous generation) and extremely strong long-thread Agent capabilities (147 benchmark submissions, 1959 tool calls to complete FP8 matrix multiplication optimization).

The MiniMax team let M3 independently replicate an ICLR 2025 award-winning paper. This task required understanding text, images, curves, data, and formulas, loading the paper, code, and experiment logs into the long context, and using programming and Agent capabilities to complete the replication. M3 ran autonomously for nearly 12 hours, finally successfully executing the core experiments.

Qwen3.7-Plus currently offers API calls only, without open-sourcing weights. Its core differentiation is the deep integration of multimodal and GUI operation capabilities, and plug-and-play compatibility with mainstream development frameworks.

Both compete directly in programming Agent capabilities, but with different focuses: M3 emphasizes autonomous scientific research and code optimization capabilities under long contexts, while Qwen3.7-Plus emphasizes the end-to-end closed loop of visual perception and interface operation.