ChatGLM: the open-weight chat model lineage

A complete reference on ChatGLM — the open-weight chat model that put Zhipu AI on the global map. Covers each generation, local inference setup, Hugging Face distribution, and how ChatGLM relates to the current GLM-4 family.

ChatGLM4

Current branch

6B–9B

Open-weight sizes

32K

Context window

4+

Public generations

Capsule Summary

ChatGLM is the open-weight conversational model lineage from Zhipu AI that spans four public generations. The 6B builds run on consumer hardware; weights are on Hugging Face under permissive licenses. The lineage fed directly into the GLM-4 family now fronted by Z.ai, but ChatGLM releases remain independently downloadable and are the most-cloned community variants.

ChatGLM: origin and significance

How ChatGLM arrived, why its initial release was notable, and what made it different from competing open-weight chat models at the time.

ChatGLM arrived at a moment when the open-weight chat model landscape was dominated by models that handled English well but struggled to hold a coherent conversation in any other language. The original ChatGLM-6B was a 6-billion-parameter bilingual model that produced natural Chinese and English responses with no additional fine-tuning, shipped with weights publicly available on Hugging Face, and ran on a single consumer GPU with 8 GB of VRAM. That combination — conversational quality, bilingual coverage, local-inference feasibility — made it the most-downloaded Chinese-origin model in the first months after release.

The architectural decision behind ChatGLM was the same span-masking approach that characterised the broader GLM research line: the model was trained to predict entire masked spans autoregressively, giving it both the contextual understanding of a bidirectional encoder and the fluency of a generative model. That pretraining objective turned out to transfer well to instruction tuning, which is why the ChatGLM variants responded well to prompt engineering from the start, even before the field had developed a consensus on system-prompt design patterns.

The significance for Z.ai is direct: every inference product in the current Z.ai portfolio builds on the architectural and training decisions first validated at scale in ChatGLM. The current GLM-4 and GLM-4.5+ flagship generations are the mature expressions of the same design lineage. Understanding ChatGLM history therefore gives a developer a concrete mental model for why the current-generation Z.ai models behave the way they do under prompting pressure.

ChatGLM generation history

A chronological walk through each ChatGLM generation, noting the key change introduced in each release.

The original ChatGLM established the 6B parameter size as the primary open-weight target. It used a 2K context window — adequate for single-turn conversations but limiting for longer tasks — and was instruction-tuned on Chinese and English data with a relatively small supervised fine-tuning dataset by today's standards. Despite those constraints, it outperformed alternatives at its parameter class on Chinese-language evaluations and set the template that later generations refined.

ChatGLM2-6B extended the context window to 32K tokens using a sliding-window position embedding approach that extrapolated well beyond the training context. It also added a multi-query attention variant that reduced the KV-cache memory footprint during inference, making longer conversations more tractable on the same hardware. The instruction tuning dataset grew significantly, and the feedback from the first generation's public release was incorporated into the alignment process.

ChatGLM3 introduced a tool-calling schema that aligned with the emerging function-calling standards being adopted across the industry. It shipped at multiple parameter sizes simultaneously for the first time, providing a 6B local variant and a larger option accessible via API. The context window reached 128K in the long-context variant, and code capabilities improved markedly compared to the generalist training of the first two generations.

ChatGLM4 — the most recent build in the original ChatGLM naming convention — bridges into the GLM-4 family while maintaining the open-weight distribution model. The 9B ChatGLM4 build is competitive with models twice its size on standard benchmarks, reflecting the maturation of Zhipu AI's alignment pipeline over the four-generation arc. Practitioners who download ChatGLM4 and run it locally will find it substantially more capable than the original ChatGLM on every quality dimension, while requiring only marginally more hardware than the original 6B build when quantised.

Local inference workflow

How to load and run ChatGLM weights on local hardware, covering the main frameworks and quantisation options for consumer-grade setups.

Loading ChatGLM weights locally starts with the Hugging Face Transformers library. The model card for each generation lists the exact repository path and the recommended loading snippet. For the 6B and 9B builds, the standard float16 load requires 12–16 GB of VRAM on a single GPU; the int4 quantised build fits into 8 GB, and the int8 build sits between those two points with a modest quality trade-off. The Transformers library handles the quantisation transparently through the BitsAndBytes integration, so no separate quantisation step is required before inference.

The vLLM framework offers a higher-throughput option for teams running ChatGLM as a local API server. It implements continuous batching and paged attention, which increases requests-per-second by a meaningful margin compared to a naive Transformers inference loop. For a small team running ChatGLM as a shared internal service, a vLLM deployment on a single A100 or H100 instance handles the kind of concurrent-user load typical in a development environment without the operational complexity of a multi-node setup.

Practitioners comfortable with llama.cpp can also run the GGUF-format ChatGLM builds that the community has produced for the smaller parameter sizes. These builds run on CPU with optional GPU offloading, which makes them the most accessible path for developers who do not have a dedicated GPU available. Quality degrades relative to the full-precision builds, but for evaluation and prototyping purposes the GGUF path is often the fastest way to get a ChatGLM response into a development workflow. The download reference on this site covers the community GGUF mirror locations in more detail.

ChatGLM generation comparison across the public lineage
Generation Release class Notable change Context window Parameter size(s)
ChatGLM Original First bilingual open-weight chat model; 6B on consumer GPU 2K 6B
ChatGLM2 Second generation 32K context via sliding-window PE; reduced KV-cache footprint 32K 6B
ChatGLM3 Third generation Tool-calling schema; multi-size release; 128K long-context variant 32K–128K 6B, 32B+
ChatGLM4 (6B) Fourth generation Bridges into GLM-4 family; competitive with larger models on benchmarks 128K 6B
ChatGLM4 (9B) Fourth generation 9B open-weight build; improved alignment; tool-calling refinement 128K 9B

ChatGLM versus GLM-4: where the lineages diverge

The practical difference between downloading a ChatGLM build and accessing the GLM-4 flagship through the BigModel API.

ChatGLM and GLM-4 share the same architectural heritage and the same Zhipu AI training pipeline. The divergence is primarily one of scale, distribution method, and intended use case. ChatGLM is the open-weight download lineage: weights are public, inference runs locally, and the cost of use is hardware time rather than per-token billing. GLM-4 is the hosted flagship lineage: the largest parameter classes are available only through the BigModel API, the context window reaches 128K, and the alignment quality at the top end clearly separates from what a locally-hosted 9B build can achieve.

For a developer starting a new project, the right starting point is usually the ChatGLM4-9B local build for prototyping and the GLM-4.5+ API for production. The local build allows rapid iteration without API costs, and because the underlying architecture is essentially the same, prompts that work well locally generally transfer to the hosted version without significant rework. That migration path is one of the structural advantages of the Z.ai ecosystem over competing families where the open-weight and hosted variants have more divergent training histories.

Research guidance from the MIT Center for Research on Equitable and Open AI notes that local-inference models provide stronger privacy guarantees for sensitive workloads, which is a legitimate reason to stay on the ChatGLM download path rather than migrating to the hosted API even when scale requirements would otherwise justify the shift. Teams handling sensitive data should assess that trade-off explicitly before committing to a deployment architecture.

ChatGLM: frequently asked questions

Five questions covering ChatGLM history, local inference, weight locations, and the relationship to the GLM-4 family.

What is ChatGLM?

ChatGLM is the open-weight chat model lineage developed by Zhipu AI. It was one of the first serious open-weight bilingual Chinese-English chat models, and its successive generations laid the foundation for the broader GLM AI model family now distributed under the Z.ai brand. The ChatGLM name covers the original release through ChatGLM4, each adding context length, parameter options, and alignment improvements.

How many ChatGLM generations are there?

As of the current reference period, the lineage covers the original ChatGLM, ChatGLM2-6B, ChatGLM3 (multiple parameter sizes), and the ChatGLM4 builds that bridge into the GLM-4 family. Each generation introduced a notable change: ChatGLM2 extended context to 32K, ChatGLM3 added tool-calling, and ChatGLM4 pushed the open-weight quality ceiling to match models with larger parameter counts.

Can I run ChatGLM locally?

Yes. The ChatGLM weights are published on Hugging Face under permissive licenses. The smaller 6B and 9B variants run on consumer GPUs with 8–16 GB VRAM, and int4 quantised builds extend that to lower-end hardware. Standard inference frameworks including vLLM, llama.cpp, and Hugging Face Transformers all support the weights directly.

Where are ChatGLM weights hosted?

The primary distribution point for ChatGLM weights is Hugging Face, where Zhipu AI maintains the canonical model pages for each generation. The Zhipu AI GitHub organisation also links to the model cards and provides inference code and fine-tuning scripts for each release. Community GGUF mirrors exist for llama.cpp-compatible deployments on CPU hardware.

How does ChatGLM relate to the GLM-4 family?

ChatGLM is the original open-weight branch that popularised the GLM architecture for conversational use. The GLM-4 generation absorbs and extends the ChatGLM design, adding longer context windows, better instruction following, and a code-specialised branch. ChatGLM remains its own independently downloadable lineage, but GLM-4 is the current development focus and the basis for the hosted Z.ai API.

ChatGLM in context: the broader Z.ai model ecosystem

ChatGLM is the community-facing open-weight root of a model tree that now includes the hosted GLM-4 line, the BigModel API, and the full Z.ai product surface.

The ChatGLM lineage is where most outside developers first encounter the work of Zhipu AI, but it is one branch of a larger tree. The GLM AI model family documentation covers the full parameter sweep and the current flagship generation. Developers who move from local ChatGLM inference to the hosted Zhipuai API will find the BigModel open platform as the management surface for keys, billing, and project analytics. For a broader view of the Zhipu AI LLM stack, including how context window evolution fits into the overall trajectory, the LLM reference page covers that arc. The benchmarks page places ChatGLM generations on public evaluation charts alongside the current GLM-4 family so the progression is quantified rather than asserted.