Nvidia GPU从性能到开发生态环境绝对领先。但性能到底差多少?网上搜了一下:
Under the right circumstances, we found that Gaudi 2 had the highest LLM training performance vs. the same-generation NVIDIA A100 and AMD MI250 GPUs, with an average speedup of 1.22x vs. the A100-80GB, 1.34x vs. the A100-40GB, and 1.59x vs. the MI250.
On each platform, we ran the same training scripts from LLM Foundry using MPT models with a sequence length of 2048, BF16 mixed precision, and the ZeRO Stage-3 distributed training algorithm. On NVIDIA or AMD systems, this algorithm is implemented via PyTorch FSDP with sharding_strategy: FULL_SHARD. On Intel systems, this is currently done via DeepSpeed ZeRO with Stage: 3 but FSDP support is expected to be added in the near future.
On each system, we also used the most optimized implementation of scaled-dot-product-attention (SDPA) available:
- NVIDIA: Triton FlashAttention-2
- AMD: ROCm ComposableKernel FlashAttention-2
- Intel: Gaudi TPC FusedSDPA
编程工具好像PyTorch最流行,官方版本支持Nvidia和AMD,intel好像有一个改动版支持Gaudi。在LLM早期军备竞赛阶段,大公司优先考虑性能最好的。在基础模型成熟以后,更多finetuning和domain adaptation,有多大必要抢最好的GPU。我们1~2千人AI 研发公司,最常用的GPU是A5000。感觉Nvidia还是有些远忧,希望老黄能将Nvidia带到更成功的AI应用领域。从Nvidia的发家史中可以看出老黄的长远眼光。