Differentiation among major companies in the multi-modal "Deepseek Moment": ByteDance focuses on "efficiency," Kuaishou targets "professionalism," and Alibaba concentrates on "e-commerce"!

Wallstreetcn
2026.02.12 06:35
portai
I'm PortAI, I can summarize articles.

Huachuang Securities stated that at the beginning of the year, domestic multimodal models were updated intensively, with models like Ke Neng 3.0 and Seedance 2.0 significantly enhancing "controllability," marking a leap in AI video from entertainment to industrial production. By reducing the "card drawing" waste rate, the marginal cost of video production converges towards computing power costs. ByteDance focuses on efficiency infrastructure, Kuaishou delves into professional narratives, and Alibaba strengthens vertical e-commerce, collectively driving a revolution in content supply and the revaluation of IP value

At the beginning of the year, the wave of multimodal updates came in quickly: on January 31, Kuaishou launched Kling 3.0, on February 7, ByteDance released Seedance 2.0, and on February 10, ByteDance's Seedream 5.0 and Alibaba's Qwen-Image-2.0 further enhanced the foundation of "text-to-image/image editing."

Yao Lei from Huachuang Securities Research Institute made a straightforward judgment in her report on the 12th—video generation is no longer just a showcase of technology, but is evolving into a tool that can integrate into workflows: "AI video generation is transitioning from blind box entertainment to precise industrial production." The reason for the slow commercialization is attributed to the uncontrollable marginal costs caused by "drawing cards": the same demand needs to be generated and reworked repeatedly, and the waste rate consumes time and budget.

The focus of the upgrades in Kling 3.0 and Seedance 2.0 is not merely on improving image quality, but rather on elevating controllability to a higher priority: consistency of subjects across shots, semantic adherence to complex instructions, and the editing capability of "being able to modify after generation" all work together to reduce the waste rate. The conclusion of the research report is that technological advancements provide a foundation for AI video to enter large-scale B-end workflows, with e-commerce advertising and short drama/comic production feeling the impact sooner.

Delving deeper, the report breaks down the impact into two layers: one layer is product route differentiation—ByteDance is more focused on "efficiency infrastructure," while Kuaishou leans towards "professional storytelling." The other layer is a supply-side revolution recalibrating cost structures—the marginal cost of content production increasingly resembles computing power costs. Corresponding to investment clues, the research report identifies benefiting directions in content IP, content copyright, AI video tools/models, and the demand for reasoning on cloud and platforms.

The real issue being solved is the uncontrollable costs brought by "drawing cards"

The report repeatedly emphasizes a logical chain: the past difficulty in commercializing AI video was not due to "being unable to produce," but rather "producing results that are too unstable." With the same script, the same materials, and prompts, the quality of the final product fluctuates greatly, forcing creators to generate more rounds to gamble on results, leading to uncontrolled marginal costs.

The report believes the significance of the new generation of models lies in shifting "generation capability" back a position and bringing "controllability" to the forefront: By enhancing native multimodal architecture, instruction alignment, and reinforcing subject consistency/semantic adherence, the waste rate can be reduced, and overall video production costs will subsequently decrease. The threshold for commercialization has thus been redefined—from "can it be done" to "can it be delivered stably."

Kling 3.0 bets on "blockbuster feel": physical realism and long logical narratives take precedence

The research report summarizes the keywords of Kling 3.0 into two aspects: systematic upgrades of basic capabilities and the integration of generation and editing (Omni).

On the video side, the upgrade points of Kling 3.0 mainly focus on: stronger subject consistency in multi-shot/continuous action scenes; more detailed parsing of complex text instructions; alleviation of reference confusion when multiple people are in the same frame, and emphasizing "precise mapping of text to visual roles" (including multilingual, dialect accent performances, and natural lip-syncing and expressions) The Omni model is another change that has been highlighted: making localized controllable modifications based on generated content, reducing the need for "starting over." The report also mentions two capabilities that are more focused on professional creation: one is the ability to create video subjects (extracting character features and original voice tones, achieving precise lip-sync matching and driving); the second is the native custom storyboard capability, increasing the single generation duration to 15 seconds, allowing for the specification of duration, shot type, perspective, narrative content, and camera movement at the shot level.

On the image side, Qiling Image 3.0 is also regarded as a part of "workflow completion": it supports up to 10 reference images to lock in the subject outline, core elements, and tonal base; multiple reference images can freely specify elements and make additions, deletions, and modifications; it supports batch image output for storyboard/material package production; while enhancing high-definition output and detail performance.

Seedance 2.0 turns video into an "orchestrated" industrial tool

The report positions Seedance 2.0 more like an "industrial standard": It emphasizes reasonable physical laws, natural movements, precise command understanding, and stable style at the foundational level; and highlights three types of capabilities—consistency optimization (from facial features to clothing, font details, scene transitions, etc.); controllable replication of complex camera movements and actions; and precise replication of creative templates/complex effects.

More critically, the interaction paradigm. The research report believes that Seedance 2.0 uses "@material name" to specify the use of images/videos/audios, essentially breaking down black-box generation into a controllable production process: the model can separately extract @video's camera movement, @image's details, and @audio's rhythm, significantly reducing the "waste rate."

The usage and limitations provided in the report are also closer to "production constraints": supports image input ≤ 9 images; video input ≤ 3 with a total duration not exceeding 15 seconds; audio supports MP3 uploads ≤ 3 with a total duration not exceeding 15 seconds; mixed input total limit of 12 files; generation duration ≤ 15 seconds (optional 4-15 seconds); and provides built-in sound effects/music output. In terms of entry, "first and last frames" and "all-purpose reference" correspond to different material organization methods.

ByteDance focuses on "efficiency infrastructure," Kuaishou on "professional narrative," and Alibaba leans more towards e-commerce verticals

The research report's judgment on the competitive landscape does not focus much on "score rankings," but is more concerned with the strategic differentiation of manufacturers.

The report summarizes ByteDance's route as low-threshold, low-cost tool-based, generalized capabilities, similar to an advanced form of "Jianying," aiming to reduce content production costs across the internet and feed back into the ecosystem; Kuaishou's Quling bets on physical simulation, realistic complex scenes, and character consistency, making it more suitable for professional content that requires high coherence, such as film demos and movie plots Alibaba Qianwen is more focused on vertical scenarios (e-commerce) in the direction of high-fidelity updates for image models, enhancing capabilities related to product digitization.

The three paths do not point to the same business model: one pursues large-scale throughput, one pursues high-quality narrative delivery, and one pursues "usable production" in vertical industries.

Supply-side Revolution in Content: Marginal Costs Converging to Computing Costs, IP Becomes Scarcer

In the commercial projection, the report describes the "supply-side revolution" as very radical: after the dual enhancement of image and video foundational capabilities, the marginal cost of content production will increasingly tend toward computing costs.

In the short term, it is more optimistic about two types of changes: the improvement in material output efficiency of marketing/e-commerce service providers, leading to gross profit improvement; the comic and short drama industries may experience a capacity explosion. In the medium to long term, the contradiction is pushed towards the IP end—when content becomes easier to produce, the pricing of scarcity will be more concentrated on IP: top IP and derivative products will have higher value, and mid-tier IP may also achieve value reassessment through AI videoization. Meanwhile, giants with strong computing power infrastructure (cloud) and closed-loop traffic scenarios (platforms) will more directly benefit from the dividends brought by frequent calls on the inference side