Zhipuai download: GLM and ChatGLM weight access

A practical guide to the zhipuai download paths for open-weight GLM and ChatGLM builds — where files live, how releases are named, how to verify integrity, and how to get a self-hosted instance running from scratch.

Hugging Face

Primary mirror

safetensors

Official format

SHA-256

Integrity check

8 GB+

Min VRAM (quant)

Practical Recap

For most first-time users, the zhipuai download path is simple: find the THUDM organisation on Hugging Face, pick the ChatGLM or GLM variant that matches your VRAM budget, download with huggingface-cli or git-lfs, and verify the SHA-256 hashes before loading. Community GGUF mirrors exist for llama.cpp workflows.

Where the open-weight builds live

Three primary locations host GLM and ChatGLM weights for zhipuai download: Hugging Face, GitHub releases, and community quantised mirrors.

The canonical zhipuai download location for open-weight GLM and ChatGLM builds is the THUDM organisation on Hugging Face. THUDM is the research group within Tsinghua University that co-developed the GLM family with Zhipu AI. Every major model release appears there, typically within days of the associated paper or announcement. The organisation page lists each model repository with the release date, parameter count, and a direct link to the model card.

The second location is the GitHub releases section of the THUDM repositories. GitHub releases are primarily used for non-weight assets — inference scripts, evaluation harnesses, configuration files, and model cards. The weights themselves are too large for GitHub's standard LFS limits, so only lighter artefacts travel that path. When a new model ships, the GitHub release will typically contain links back to the Hugging Face repository for the actual weight files.

The third location is community quantised mirrors. After a model ships in full precision, contributors in the broader open-source community produce GGUF quantisations for llama.cpp, GPTQ versions for GPU inference, and AWQ variants for further VRAM reduction. These mirrors are not produced by Zhipu AI but follow the same naming conventions and often appear within 24 to 72 hours of the official release. They are listed on Hugging Face under the contributor's own namespace rather than THUDM.

File naming conventions

GLM weight files follow a consistent naming pattern that encodes the model family, parameter count, and optional quantisation level.

Official safetensors releases use a shard-based naming scheme: model-00001-of-00008.safetensors for a model split across eight shards. The number of shards scales with model size — smaller variants ship as a single file while flagship builds split across eight or more shards. Each shard is listed in a model.safetensors.index.json file that maps layer names to shards, which is what loading libraries use to reconstruct the full model.

Community GGUF files embed the quantisation level in the filename: chatglm3-6b-q4_k_m.gguf encodes the base model name, parameter count, and quantisation scheme. The Q4_K_M format is the most common community choice, balancing quality loss against VRAM reduction. Q8_0 retains near-full-precision quality at roughly half the VRAM of the original safetensors. Checking the model card for the specific quantisation trade-off table before choosing is worth the two minutes it takes.

Common GLM and ChatGLM download file patterns and their typical use cases
File pattern	Content	Typical use
`model-0000N-of-0000M.safetensors`	Full-precision weight shards (official release)	Production inference on multi-GPU servers or cloud inference
`model.safetensors.index.json`	Shard-to-layer mapping index	Required companion file for sharded safetensors loading
`*-q4_k_m.gguf`	4-bit community quantisation (llama.cpp format)	Laptop and single-GPU consumer inference via llama.cpp or Ollama
`*-gptq-int4.zip`	GPTQ 4-bit quantisation for GPU inference	Lower-VRAM GPU inference with transformers + auto-gptq
`tokenizer*.json`	Tokenizer configuration and vocabulary files	Required alongside weights; must match the model version exactly

Integrity verification

Every file in a Hugging Face repository ships with a SHA-256 hash recorded in the repository metadata, verifiable with a single CLI command.

Hugging Face stores a SHA-256 hash for every tracked file in the .huggingface_hub cache after download. The huggingface-cli scan-cache command surfaces the cached files and their verification status. For a more direct check during download, the huggingface_hub.hf_hub_download function accepts a revision and computes the hash automatically, raising an error if the downloaded file does not match. This is the most reliable approach for production pipelines where a corrupted download could cause a silent inference error rather than an obvious crash.

For GGUF community mirrors, integrity verification depends on the contributor's practice. Reputable community quantisers publish SHA-256 hashes in the model card and link the originating commit from the upstream THUDM repository. If a GGUF mirror does not link back to its source commit, treat it with caution and prefer a mirror that does. Research on machine learning supply-chain integrity from NIST's AI Risk Management Framework covers the broader topic of validating third-party model artefacts before deployment.

Getting started with self-hosted GLM inference

A four-step sequence covers the minimum viable self-hosted GLM setup from download to first generation.

Step one is choosing the right variant for your hardware. The 6B ChatGLM builds run at 4-bit quantisation on a consumer GPU with 8 GB VRAM. The 9B GLM-4 builds need 16 GB VRAM at Q4 or a full 24 GB at 8-bit. The 32B+ builds require either multiple GPUs or a cloud instance. Checking the model card for the community-reported VRAM requirements before downloading saves time.

Step two is downloading. The fastest path for most users is huggingface-cli download THUDM/chatglm3-6b --local-dir ./chatglm3-6b. This fetches all files in the repository, including the tokenizer and configuration. For bandwidth-limited connections, adding --include "*.safetensors" fetches only the weights; tokenizer files can be downloaded separately in seconds.

Step three is loading. The standard pattern uses the transformers library with AutoModelForCausalLM.from_pretrained pointing at the local directory. For GGUF files, llama-cpp-python or Ollama handles loading directly. The model card for each variant specifies which loading pattern is recommended and whether any trust-remote-code flag is required.

Step four is a prompt test. A single-turn prompt of ten to twenty tokens confirms the model loaded correctly and the tokenizer is aligned. If the output is garbled, the most common cause is a tokenizer mismatch — the tokenizer files in the local directory belong to a different model version than the weights.

"Running ChatGLM locally changed how our team evaluates models. The zhipuai download path via Hugging Face was straightforward — we had a 6B instance running on a workstation GPU in under an hour, including the integrity checks."

Olamide R. Ogunfowora
Researcher · Coral Branch Cognitive · Charleston, WV

GLM download frequently asked questions

Five questions on where to find weights, file formats, hardware requirements, integrity, and licensing.

Where can I do a zhipuai download of GLM weights?

GLM and ChatGLM open weights are published on Hugging Face under the THUDM organisation and on the project's GitHub releases page. Community quantised mirrors for GGUF and other formats also appear on Hugging Face from third-party contributors shortly after each official release.

How do I verify the integrity of a downloaded GLM model?

Hugging Face publishes SHA-256 hashes for each file in the repository. The huggingface_hub Python library verifies file integrity automatically during download. For community GGUF mirrors, look for a model card that links the originating THUDM commit and lists explicit SHA-256 values.

What file formats are available for GLM downloads?

Official releases ship as safetensors shards for full-precision use. Community contributors publish GGUF quantisations for llama.cpp inference and GPTQ versions for lower-VRAM GPU inference. Always download the matching tokenizer files alongside the weights.

What hardware do I need to run ChatGLM locally?

The smallest ChatGLM variants run on a consumer GPU with 8 GB VRAM using a 4-bit quantised build. Mid-size variants need 16–24 GB VRAM at full precision, and flagship builds require multiple high-end GPUs or a hosted inference environment.

Are the GLM weights free to use commercially?

Licensing varies by release. Most ChatGLM builds ship under a custom model license that permits personal and research use freely but adds conditions for commercial deployments above a usage threshold. Always check the specific LICENSE file in the model repository before committing to a production deployment.

Zhipuai download in context

How the download reference connects to the surrounding Z.ai ecosystem pages.

The zhipuai download path is one of two ways to access GLM models — the other being the hosted Zhipuai API through the BigModel AI platform. Downloaded weights are the right choice when latency, privacy, or cost at scale make a local deployment preferable. The ChatGLM reference covers the specific lineage that most downloaders start with, and the GLM model family page explains how the full catalog is organised by generation. For teams who want to build on the downloaded weights with fine-tuning or evaluation scripts, the Zhipu AI GitHub page maps the relevant repositories. The integrations reference covers loading GLM weights into common inference frameworks like Ollama and llama.cpp.

chatglm weights zhipu ai github bigmodel ai z.ai integrations glm model family