One thing I wonder is why no one has made a fork of PyTorch yet that removes all the API surface that doesn't produce GPU friendly code. Make dtype and device arg mandatory without defaults, remove in place operations that trigger a CPU sync, etc. This would increase confidence that written code will run on the GPU and pass torch.export() on the first try.