Train GPT-3 model on V100(16GB Mem) Using improved Transformer

Sep 08, 2021 2 min read

Pytorch GPT-X

My Own Pytorch GPT-X

1. Abstract

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

2. Model

Transformer

Additional Module

① Rezero

Rezero Is All You Need link

② Explicit Sparse Transformer

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection link

③ Macaron Architecture

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View link

④ RealFormer, Residual Attention

RealFormer link

Train

DeepSpeed

TODO

~~ReZero~~
RealFormer, Residual Attention
~~Macaron architectures~~
~~Macaron architectures – layer Scale 0.5~~
~~Explicit Sparse Transformer~~
torch lightning
Deepspeed train on single GPU
Deepspeed parallel trainig on 2 V100 GPU with 16GB Memory

Parameter For Few-shot

The 175B parameter model is very large, but a large model is needed for Few-Shot Learning. So this repository try to use DeepSpeed for training extremely big model.

GPT-3 Config

model_name	n_params	n_layer	d_model	n_heads	d_head	batch_size	learning_rate
GPT-3 175B	175B	96	12288	96	128	3.2M	0.6 x 10^-4
GPT-3 13B	13B	40	5140	40	128	2M	1.0 x 10^-4
GPT-3 6.7B	6.7B	32	4096	32	128	2M	1.2 x 10^-4
GPT-3 2.7B	2.7B	32	25560	32	80	1M	1.6 x 10^-4

References

Transformer

lucidrains/x-transformers

DeepSpeed

ReZero

/majumderb/rezero

Explicit Sparse Transformer

x-transformer: explicit_sparse_transformer

Macaron Architecrue

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

GitHub

https://github.com/nawnoes/pytorch-gpt-x

Transformer Deep Learning

John was the first writer to have joined pythonawesome.com. He has since then inculcated very effective writing and reviewing culture at pythonawesome which rivals have found impossible to imitate.

Train GPT-3 model on V100(16GB Mem) Using improved Transformer

Pytorch GPT-X

1. Abstract

2. Model

Transformer

Additional Module

① Rezero

② Explicit Sparse Transformer

③ Macaron Architecture

④ RealFormer, Residual Attention

Train

DeepSpeed

TODO

Parameter For Few-shot

GPT-3 Config

References

GitHub

John

A collection of free MIDI chords and progressions ready to be used in your DAW, Akai MPC, or Roland MC-707/101

IG Trading Algos and Scripts in Python

Pytorch GPT-X

1. Abstract

2. Model

Transformer

Additional Module

① Rezero

② Explicit Sparse Transformer

③ Macaron Architecture

④ RealFormer, Residual Attention

Train

DeepSpeed

TODO

Parameter For Few-shot

GPT-3 Config

References

GitHub

A collection of free MIDI chords and progressions ready to be used in your DAW, Akai MPC, or Roland MC-707/101

IG Trading Algos and Scripts in Python

You might also like...