Benchmark
Comparison between Hugging Face Transformers
Make Big Models trainable on consumer GPUs
Tested on 32GB V100 machine using bert-large-uncased
, we have comparable throughput as Hugging Face Transformers but much fewer GPU memory footprint.
repo | max-batchsize(#examples) | time(s) | throughput(#examples/s) |
---|---|---|---|
transformers | 11 | 1.11 | 9.9 |
transformers+fp16 | 14 | 0.53 | 26.4 |
modelcenter | 256 | 10.3 | 24.9 |
Tested on a single consumer GPU, 11GB 2080Ti, however, training bert-large-uncased
is no longer supported in Hugging Face Transformers, but we make it possible.
repo | max-batchsize(#examples) |
---|---|
transformers | 0 |
transformers+fp16 | 0 |
modelcenter | 72 |
Make Huge Models train easily.
Tested on 40GB A100 machine using T5-11B
, we make it possible to train with 16 batch-size using two GPUs.
Comparison between Deepspeed ZeRO
see also BMTrain’s Performance