
The pricing positions GLM-Image as a cost-effective option for enterprises generating marketing materials, presentations, and other text-heavy visual content at scale.
Technical approach and benchmark performance
GLM-Image employs a hybrid architecture combining a 9-billion-parameter autoregressive model with a 7-billion-parameter diffusion decoder, according to Zhipu’s technical report. The autoregressive component handles instruction understanding and overall image composition, while the diffusion decoder focuses on rendering fine details and accurate text.
The architecture addresses challenges in generating knowledge-intensive visual content where both semantic understanding and precise text rendering matter, such as presentation slides, infographics, and commercial posters.
On the CVTG-2K benchmark, which measures accuracy in placing text across multiple image locations, GLM-Image achieved a Word Accuracy score of 0.9116, ranking first among open-source models. The model also led the LongText-Bench test for rendering extended text passages, scoring 0.952 for English and 0.979 for Chinese across eight scenarios including signs, posters, and dialog boxes.
The model natively supports multiple resolutions from 1024×1024 to 2048×2048 pixels without requiring retraining, the report added.
Hardware optimization strategy
Training GLM-Image on Ascend hardware required Zhipu to develop custom optimization techniques for Huawei’s chip architecture. The company built a training suite that implements dynamic graph multi-level pipelined deployment, enabling different stages of the training process to run concurrently and reducing bottlenecks.




















