module

Linear

class model_center.layer.Linear(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

A fully connected layer, which performs \(\pmb{y} = \mathbf{W} \pmb{x} + \pmb{b}\)

Parameters

dim_in (int) – input dimension of \(\pmb{x}\)
dim_out (int) – output dimension of \(\pmb{y}\)
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.
bias (bool, optional) – whether to add bias term \(\pmb{b}\). Defaults to False.

forward(x: torch.Tensor)

Parameters: x (torch.Tensor of shape (batch, seq_len, dim_in)) – The input of linear layer
Returns: The output of the linear transform y.
Return type: torch.Tensor of shape (batch, seq_len, dim_out)

Embedding

class model_center.layer.Embedding(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

Embed a sequence of indices through a embedding lookup matrix \(\mathbf{W}\).

Parameters

vocab_size (int) – indices be in range \([0, \text{vocab_size})\)
embedding_size (int) – the output dimension of the embedding lookup matrix.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.

forward(ids: torch.Tensor)

Parameters: ids (torch.Tensor of shape (batch_size, seq_len)) – Indices of input sequence tokens.
Returns: The embedding output.
Return type: torch.Tensor of shape (batch_size, seq_len, embedding_size)

projection(x: torch.Tensor)

Projection based on embedding’s weight. For example, embedding map vocab_size to embed_size, than projection map embed_size back to vocab_size.

Parameters: x (torch.Tensor of shape (batch, seq_len, dim_model)) – Input of projection
Returns: The projection output.
Return type: torch.Tensor of shape (batch, seq_len, vocab_output_size)

RelativePositionEmbedding

class model_center.layer.RelativePositionEmbedding(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

Relative Position Embedding

Parameters

num_heads (int) – number of heads used in attention module.
num_buckets (int, optional) – Defaults to 32.
max_distance (int, optional) – Defaults to 128.
bidirectional (bool, optional) – Defaults to False.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – Defaults to 0.0.
init_std (float, optional) – Defaults to 1.

forward(query_len, key_len)

Provides relative position embeddings for key and query of num_heads attention heads.

Parameters

query_len (int) – Length of query.
key_len (int) – Length of key.

Returns

Relative position embedding.

Return type

torch.Tensor of shape (num_heads, query_len, key_len)

RotaryEmbedding

class model_center.layer.RotaryEmbedding(rotary_dim: int)

Bases: torch.nn.modules.module.Module

Rotary Position Embedding

Parameters: rotary_dim (int) – rotary dimension

forward(h_q, h_k)

Parameters

h_q – (batch_size*num_head, len_q, dim_head)
h_k – (batch_size*num_head, len_k, dim_head)

Returns

(batch_size*num_head, len_q, dim_head) h_k : (batch_size*num_head, len_k, dim_head)

Return type

h_q

LayerNorm

class model_center.layer.LayerNorm(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

LayerNorm if bias = True: \(y = {x-\text{E}[x]\over \text{Var}[x]+\text{eps}} * w + \text{bias}\)

RMS LayerNorm if bias = False: \(y = {x\over \text{Var}[x]+\text{eps}} * w\)

Parameters

dim_norm (int) – norm dimesion
dtype (optional) – Defaults to torch.half.
bias (bool, optional) – whether to add the \(\text{bias}\) term. Defaults to True.
eps (float, optional) – \(\text{eps}\) term. Defaults to 1e-5.
init_var (float, optional) – weight will be all initialized to init_var. Defaults to 1.0.

forward(x: torch.Tensor)

Parameters: x (torch.Tensor of shape (batch_size, seq_len, dim_norm)) – Input tensor that need to be normalized.
Returns: The layernorm output.
Return type: torch.Tensor of shape (batch_size, seq_len, dim_norm)

Attention

class model_center.layer.Attention(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

attention module consisting procedure of Q, K, V combination and its output projection. For more detail, see Attention is All you Need.

Parameters

dim_in (int) – input dimension.
dim_head (int) – dimension of each heads used in attention.
num_heads (int) – number of heads used in attention.
dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attetion module. Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attention module. Defaults to 0.02.
bias (bool, optional) – whether to use bias term in fully-connected layers used in attention module. Defaults to False.
mask_value (float, optional) – mask value of the masked position. Defaults to -inf.
pos_bias_type (str, optional) – relative for relative position bias, rotary for ratery position embedding. Defaults to none.
attn_scale (bool, optional) – whether to scale before softmax, i.e., \(\text{softmax}({Q K^T \over \sqrt{\text{dim_model}}})\). Default to False.
dropout_p (float, optional) – Defaults to 0.

forward(query: torch.Tensor, key_value: torch.Tensor, mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)

This model inherits from bmt.DistributedModule.

Parameters

query (torch.Tensor of shape (batch, len_q, dim_model)) – Indices of input sequence tokens. It will be embedded by model’s internal embedding lookup matrix.
key_value (torch.Tensor of shape (batch, len_k, dim_model)) – Length of input sequence before padding.
mask (torch.Tensor of shape (batch, len_q, len_k)) – Used to avoid performing attention on padding token indices.
position_bias (torch.Tensor of shape (num_heads, len_q, len_k) or (1, num_heads, len_k, len_q)) – Provide positional information about tensor key_value and query.

Returns

The attention output.

Return type

out (torch.Tensor of shape (batch, len_q, dim_model))

FeedForward

class model_center.layer.FeedForward(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

FeedForward module

Parameters

dim_in (int) – input dimension.
dim_ff (int) – middle dimension.
dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.
dtype (optional) – Defaults to torch.half.
init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.
init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.02.
bias (bool, optional) – whether to use bias term in fully-connected layers used in feed-forward module. Defaults to False.
activate_fn (str, optional) – Defaults to gated_gelu.
dropout_p (int, optional) – Defaults to 0.

forward(x: torch.Tensor)

Parameters: x (torch.Tensor of shape (batch, seq_len, dim_in)) – The input of feed-forward module.
Returns: The output of feed-forward module.
Return type: torch.Tensor of shape (batch, seq_len, dim_out)