module
Linear
- class model_center.layer.Linear(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
A fully connected layer, which performs \(\pmb{y} = \mathbf{W} \pmb{x} + \pmb{b}\)
- Parameters
dim_in (int) – input dimension of \(\pmb{x}\)
dim_out (int) – output dimension of \(\pmb{y}\)
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.
bias (bool, optional) – whether to add bias term \(\pmb{b}\). Defaults to False.
- forward(x: torch.Tensor)
- Parameters
x (
torch.Tensor
of shape(batch, seq_len, dim_in)
) – The input of linear layer- Returns
The output of the linear transform y.
- Return type
torch.Tensor
of shape(batch, seq_len, dim_out)
Embedding
- class model_center.layer.Embedding(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
Embed a sequence of indices through a embedding lookup matrix \(\mathbf{W}\).
- Parameters
vocab_size (int) – indices be in range \([0, \text{vocab_size})\)
embedding_size (int) – the output dimension of the embedding lookup matrix.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.
- forward(ids: torch.Tensor)
- Parameters
ids (
torch.Tensor
of shape(batch_size, seq_len)
) – Indices of input sequence tokens.- Returns
The embedding output.
- Return type
torch.Tensor
of shape(batch_size, seq_len, embedding_size)
- projection(x: torch.Tensor)
Projection based on embedding’s weight. For example, embedding map vocab_size to embed_size, than projection map embed_size back to vocab_size.
- Parameters
x (
torch.Tensor
of shape(batch, seq_len, dim_model)
) – Input of projection- Returns
The projection output.
- Return type
torch.Tensor
of shape(batch, seq_len, vocab_output_size)
RelativePositionEmbedding
- class model_center.layer.RelativePositionEmbedding(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
- Parameters
num_heads (int) – number of heads used in attention module.
num_buckets (int, optional) – Defaults to 32.
max_distance (int, optional) – Defaults to 128.
bidirectional (bool, optional) – Defaults to False.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – Defaults to 0.0.
init_std (float, optional) – Defaults to 1.
- forward(query_len, key_len)
Provides relative position embeddings for key and query of num_heads attention heads.
- Parameters
query_len (
int
) – Length of query.key_len (
int
) – Length of key.
- Returns
Relative position embedding.
- Return type
torch.Tensor
of shape(num_heads, query_len, key_len)
RotaryEmbedding
- class model_center.layer.RotaryEmbedding(rotary_dim: int)
Bases:
torch.nn.modules.module.Module
- Parameters
rotary_dim (int) – rotary dimension
- forward(h_q, h_k)
- Parameters
h_q – (batch_size*num_head, len_q, dim_head)
h_k – (batch_size*num_head, len_k, dim_head)
- Returns
(batch_size*num_head, len_q, dim_head) h_k : (batch_size*num_head, len_k, dim_head)
- Return type
h_q
LayerNorm
- class model_center.layer.LayerNorm(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
LayerNorm if bias = True: \(y = {x-\text{E}[x]\over \text{Var}[x]+\text{eps}} * w + \text{bias}\)
RMS LayerNorm if bias = False: \(y = {x\over \text{Var}[x]+\text{eps}} * w\)
- Parameters
dim_norm (int) – norm dimesion
dtype (optional) – Defaults to torch.half.
bias (bool, optional) – whether to add the \(\text{bias}\) term. Defaults to True.
eps (float, optional) – \(\text{eps}\) term. Defaults to 1e-5.
init_var (float, optional) – weight will be all initialized to init_var. Defaults to 1.0.
- forward(x: torch.Tensor)
- Parameters
x (
torch.Tensor
of shape(batch_size, seq_len, dim_norm)
) – Input tensor that need to be normalized.- Returns
The layernorm output.
- Return type
torch.Tensor
of shape(batch_size, seq_len, dim_norm)
Attention
- class model_center.layer.Attention(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
attention module consisting procedure of Q, K, V combination and its output projection. For more detail, see Attention is All you Need.
- Parameters
dim_in (int) – input dimension.
dim_head (int) – dimension of each heads used in attention.
num_heads (int) – number of heads used in attention.
dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attetion module. Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attention module. Defaults to 0.02.
bias (bool, optional) – whether to use bias term in fully-connected layers used in attention module. Defaults to False.
mask_value (float, optional) – mask value of the masked position. Defaults to -inf.
pos_bias_type (str, optional) – relative for relative position bias, rotary for ratery position embedding. Defaults to none.
attn_scale (bool, optional) – whether to scale before softmax, i.e., \(\text{softmax}({Q K^T \over \sqrt{\text{dim_model}}})\). Default to False.
dropout_p (float, optional) – Defaults to 0.
- forward(query: torch.Tensor, key_value: torch.Tensor, mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)
This model inherits from bmt.DistributedModule.
- Parameters
query (
torch.Tensor
of shape(batch, len_q, dim_model)
) – Indices of input sequence tokens. It will be embedded by model’s internal embedding lookup matrix.key_value (
torch.Tensor
of shape(batch, len_k, dim_model)
) – Length of input sequence before padding.mask (
torch.Tensor
of shape(batch, len_q, len_k)
) – Used to avoid performing attention on padding token indices.position_bias (
torch.Tensor
of shape(num_heads, len_q, len_k)
or(1, num_heads, len_k, len_q)
) – Provide positional information about tensor key_value and query.
- Returns
The attention output.
- Return type
out (
torch.Tensor
of shape(batch, len_q, dim_model)
)
FeedForward
- class model_center.layer.FeedForward(*args: Any, **kwargs: Any)
Bases:
bmtrain.DistributedModule
FeedForward module
- Parameters
dim_in (int) – input dimension.
dim_ff (int) – middle dimension.
dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.02.
bias (bool, optional) – whether to use bias term in fully-connected layers used in feed-forward module. Defaults to False.
activate_fn (str, optional) – Defaults to gated_gelu.
dropout_p (int, optional) – Defaults to 0.
- forward(x: torch.Tensor)
- Parameters
x (
torch.Tensor
of shape(batch, seq_len, dim_in)
) – The input of feed-forward module.- Returns
The output of feed-forward module.
- Return type
torch.Tensor
of shape(batch, seq_len, dim_out)