Introducing zml-smi
Summary
zml-smi is a universal diagnostic and monitoring tool for GPUs, TPUs, and NPUs, combining features reminiscent of nvidia-smi and nvtop. It supports NVIDIA, AMD, Google TPU, and AWS Trainium devices with plans to expand, and includes detailed host, process, and device metrics across multiple platforms, delivered in a sandboxed environment. The article outlines usage, metrics, and a complex AMD integration workaround to maintain sandboxing.