AI model overview: the variants under the Z.ai brand

A practical orientation to every AI model variant in the Z.ai portfolio — chat, code, and multimodal options across multiple parameter classes, with clear guidance on which variant fits which workload and which hosting path makes sense for your infrastructure.

Variant tracks

6B–100B+

Parameter sweep

128K

Max context

26+

Languages

Headline Facts

Z.ai hosts three AI model variant tracks: general-purpose chat (GLM-4 line), code-specialised (GLM Coder), and multimodal. Each track spans multiple parameter classes from 6B local builds to 100B+ hosted flagships. Open-weight builds cover the smaller sizes; flagship inference runs through the BigModel API. All variants share an OpenAI-compatible API contract.

The Z.ai AI model portfolio at a glance

Three variant tracks, a consistent parameter-sweep philosophy, and a shared API contract that lets developers move between variants without switching frameworks.

When a developer or product team asks "what AI model does Z.ai offer," the answer is not a single model — it is a structured portfolio of variants that cover three primary task domains and span a wide range of parameter classes. Understanding the portfolio as a map rather than a menu makes the selection decision considerably easier. The three tracks are: the general-purpose chat line, the code-specialised branch, and the multimodal variants. Each track ships at multiple parameter sizes, and the sizes are not arbitrary — they are calibrated to three distinct deployment environments: consumer hardware, a single production GPU, and hosted cloud infrastructure.

The general-purpose chat line is the default entry point for most developers. It covers the GLM-4 generation and its successor, which Zhipu AI markets as GLM-4.5+. The GLM-4.5+ flagship delivers 128K token context, strong instruction following across complex multi-step prompts, and multilingual coverage across 26 languages, all available through the BigModel API. For teams that need local inference rather than a hosted endpoint, the ChatGLM4-9B open-weight build is the closest equivalent in the same generational lineage — smaller, but architecturally identical and prompt-compatible with the hosted version.

The code-specialised branch addresses a specific class of developer tools and IDE integrations that require higher accuracy on programming tasks than the general-purpose model delivers at a given parameter size. The GLM Coder builds are fine-tuned on a curated corpus of source code, documentation, and unit tests, and they consistently outperform the base GLM checkpoint on HumanEval and MBPP at equivalent scale. For a team building a code assistant, switching from the general-purpose API endpoint to the code variant is a one-identifier change — the API contract, authentication, and billing surface are identical.

The multimodal variants extend the text-only capability with image understanding. They accept interleaved image and text inputs and can describe images, answer visual questions, and parse documents that contain embedded figures or tables. The multimodal builds are available through the hosted API; there are no open-weight multimodal releases in the current generational cycle. Teams using multimodal inputs should evaluate latency carefully: image tokenisation adds per-request overhead that is not present in text-only calls, and the effective token cost of an image input can be substantial at the pricing tiers the BigModel platform uses.

Parameter sweep and hosting fit

Why parameter class matters, how each size class maps to a deployment environment, and what trade-offs to expect when moving between them.

The parameter sweep across the Z.ai AI model family is not accidental. Zhipu AI ships each generation at multiple sizes because the community has established a clear mapping between parameter count, hardware requirements, and quality threshold for common workloads. The 6B and 9B builds sit at the consumer end: they run on a GPU with 8–16 GB VRAM in float16, and with 4-bit quantisation they fit into 6–8 GB, which covers the majority of developer laptops with a dedicated GPU. Quality at this size class is strong for most single-turn tasks and short multi-turn conversations, but starts to show strain on complex reasoning tasks that require holding many premises in mind simultaneously.

The mid-size parameter class — roughly 32B — occupies the sweet spot for teams running a dedicated inference server on a single high-end GPU or a small multi-GPU node. At 32B, the model handles longer reasoning chains, more complex instruction structures, and longer context fills with noticeably more consistency than the 9B build. This is the parameter class that tends to produce the most positive surprises during production pilots: quality that exceeds what the benchmark numbers suggest, particularly on the multilingual tasks where the larger parameter count enables better cross-lingual reasoning.

The 100B+ flagship is the hosted-only tier. No team is running this locally without specialised infrastructure, and Zhipu AI does not distribute the weights at this scale as open-weight builds. The BigModel API provides access to the flagship, and the per-token pricing reflects the inference infrastructure cost. For teams that genuinely need the quality ceiling — long-document synthesis, complex multi-turn reasoning, high-stakes content generation — the flagship tier is the right answer. For teams that are evaluating whether they need it, the practical test is to run representative production samples through the 9B and 32B tiers first; if quality is acceptable, the smaller build is almost always preferable on cost grounds.

Choosing the right variant for your workload

A decision framework for matching the Z.ai AI model variant to a specific workload, based on task type, infrastructure, context requirements, and language coverage needs.

The variant selection decision reduces to four questions asked in order. First, what is the primary task type? Code completion and debugging should route to GLM Coder; image-containing inputs should route to the multimodal variant; everything else starts with the general-purpose chat line. Second, what is the available inference infrastructure? Local consumer GPU routes to the 6B or 9B open-weight build; a single production server routes to the 32B class; cloud-hosted infrastructure or no local GPU routes to the hosted API. Third, how long is the input? Tasks involving documents longer than roughly 16K tokens require the hosted flagship or the long-context variant. Fourth, what languages are involved? For Chinese and English, any parameter class works well; for other languages, larger parameter classes provide more headroom.

One pattern that recurs among teams adopting the Z.ai AI model family is starting at a higher parameter class than necessary and then scaling down after quality validation. The more cost-efficient approach is to start with the smallest variant that plausibly covers the task, evaluate it on a representative sample, and scale up only if the quality gap is evident in practice rather than assumed from benchmark numbers. The NIST AI RMF framework recommends systematic evaluation against actual use-case samples before production commitment, and that advice applies directly to variant selection decisions. The benchmarks page provides public evaluation data to anchor the starting point of that evaluation process.

"The 9B open-weight build handled our internal summarisation pipeline well enough that we never migrated to the hosted API. Having a model that runs on-premise, uses the same prompt structure as the cloud version, and costs nothing per token made the business case straightforward."

Yelena O. Trubchaninova
Distributed Systems Engineer · Foxwarden Compute Trust · Albuquerque, NM

Z.ai AI model variant comparison — specialty, parameters, and hosting fit
Variant	Specialty	Parameters	Hosting fit	Notes
ChatGLM4-9B	General chat, local inference	9B	Consumer GPU (8–16 GB VRAM)	Open-weight on Hugging Face; prompt-compatible with hosted API
GLM-4 (mid)	General chat, longer reasoning	~32B	Single production GPU or multi-GPU node	Strong mid-tier quality; popular for enterprise pilot deployments
GLM-4.5+ (flagship)	Long-context reasoning, instruction following	100B+	Hosted BigModel API only	128K context; highest quality ceiling; per-token billing
GLM Coder	Code completion, debugging, documentation	6B–32B	Local GPU or hosted API	HumanEval and MBPP optimised; same API contract as chat variants
GLM Multimodal	Image + text understanding	Hosted only	Hosted BigModel API only	Image tokenisation adds per-request overhead; evaluate latency before committing

Z.ai AI model variants: frequently asked questions

Five questions covering the variant portfolio, selection guidance, parameter sweep, multimodal support, and self-hosting options.

What AI model variants does Z.ai offer?

Z.ai offers three main AI model variant tracks under the GLM family: general-purpose chat models (the GLM-4 and GLM-4.5+ line), a code-specialised branch (GLM Coder), and multimodal variants that accept image and text inputs. Each track ships at multiple parameter sizes to support local inference, single-GPU server deployments, and hosted flagship use cases.

What parameter sizes does the Z.ai AI model family span?

The Z.ai AI model family spans from 6B open-weight builds suitable for consumer GPUs through mid-size options in the 9B–32B range for single-server deployments, up to 100B+ flagship models available only through the hosted BigModel API. The open-weight builds use the same base architecture as the hosted flagship, so prompts transfer between parameter classes without structural changes.

How do I choose the right Z.ai AI model for my workload?

The primary decision axes are workload type (chat vs. code vs. multimodal), available compute (local GPU vs. cloud-hosted), and context length requirement. For general text workloads on local hardware, ChatGLM4-9B is the right starting point. For code-heavy tasks, GLM Coder outperforms the general-purpose model at the same parameter class. For 128K-context or high-throughput production use, the hosted GLM-4.5+ API is the practical choice.

Does Z.ai offer a multimodal AI model?

Yes. The GLM family includes multimodal variants that accept image and text inputs alongside the text-only flagship line. The multimodal builds are available through the BigModel API and support image description, visual question answering, and document parsing where the source material contains embedded images. Image tokenisation adds per-request overhead, so latency should be evaluated before production commitment.

Can I self-host a Z.ai AI model?

Yes, for the smaller parameter classes. The ChatGLM open-weight builds and the smaller GLM Coder variants are available on Hugging Face under permissive licenses for local deployment. The 6B and 9B builds run on consumer GPUs with 8–16 GB VRAM; quantised builds extend that to lower-end hardware. Flagship 100B+ models are hosted-only through the BigModel API and are not distributed as open weights.

AI model variants in the wider Z.ai reference

Each variant track has a dedicated reference page; this overview is the entry point for readers who are still orienting to the portfolio before diving into a specific track.

This page gives the overview, but each AI model variant track has dedicated depth elsewhere on this site. The GLM AI model page covers the general-purpose chat line in full, including the parameter sweep, instruction tuning approach, and multilingual coverage. The ChatGLM page documents the open-weight download lineage and the local inference workflow. For the text-only LLM framing and context window history, the Zhipu AI LLM page covers that arc from the original 2K-context build to the current 128K standard. The benchmarks page places the variants on public evaluation charts, and the latest release page describes the current generation without being tied to a version number that will age quickly.

GLM model reference ChatGLM downloads Zhipu AI LLM Benchmark scores Latest release