Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.
ssh admin.hotaisle.app
Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.
Here is the magic (assuming a 4x)...
docker run -it --rm \
--pull=always \
--ipc=host \
--network=host \
--privileged \
--cap-add=CAP_SYS_ADMIN \
--device=/dev/kfd \
--device=/dev/dri \
--device=/dev/mem \
--group-add render \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v /home/hotaisle:/mnt/data \
-v /root/.cache:/mnt/model \
rocm/vllm-dev:nightly
mv /root/.cache /root/.cache.foo
ln -s /mnt/model /root/.cache
VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--kv-cache-dtype fp8 \
--quantization fp8 \
--enable-auto-tool-choice \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--load-format fastsafetensors \
--enable-expert-parallel \
--allowed-local-media-path / \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--mm-encoder-tp-mode data