module

Linear

class model_center.layer.Linear(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

A fully connected layer, which performs \(\pmb{y} = \mathbf{W} \pmb{x} + \pmb{b}\)

Parameters
  • dim_in (int) – input dimension of \(\pmb{x}\)

  • dim_out (int) – output dimension of \(\pmb{y}\)

  • dtype (optional) – Defaults to torch.half.

  • init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.

  • init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.

  • bias (bool, optional) – whether to add bias term \(\pmb{b}\). Defaults to False.

forward(x: torch.Tensor)
Parameters

x (torch.Tensor of shape (batch, seq_len, dim_in)) – The input of linear layer

Returns

The output of the linear transform y.

Return type

torch.Tensor of shape (batch, seq_len, dim_out)

Embedding

class model_center.layer.Embedding(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

Embed a sequence of indices through a embedding lookup matrix \(\mathbf{W}\).

Parameters
  • vocab_size (int) – indices be in range \([0, \text{vocab_size})\)

  • embedding_size (int) – the output dimension of the embedding lookup matrix.

  • dtype (optional) – Defaults to torch.half.

  • init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 0.

  • init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\). Defaults to 1.

forward(ids: torch.Tensor)
Parameters

ids (torch.Tensor of shape (batch_size, seq_len)) – Indices of input sequence tokens.

Returns

The embedding output.

Return type

torch.Tensor of shape (batch_size, seq_len, embedding_size)

projection(x: torch.Tensor)

Projection based on embedding’s weight. For example, embedding map vocab_size to embed_size, than projection map embed_size back to vocab_size.

Parameters

x (torch.Tensor of shape (batch, seq_len, dim_model)) – Input of projection

Returns

The projection output.

Return type

torch.Tensor of shape (batch, seq_len, vocab_output_size)

RelativePositionEmbedding

class model_center.layer.RelativePositionEmbedding(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

Relative Position Embedding

Parameters
  • num_heads (int) – number of heads used in attention module.

  • num_buckets (int, optional) – Defaults to 32.

  • max_distance (int, optional) – Defaults to 128.

  • bidirectional (bool, optional) – Defaults to False.

  • dtype (optional) – Defaults to torch.half.

  • init_mean (float, optional) – Defaults to 0.0.

  • init_std (float, optional) – Defaults to 1.

forward(query_len, key_len)

Provides relative position embeddings for key and query of num_heads attention heads.

Parameters
  • query_len (int) – Length of query.

  • key_len (int) – Length of key.

Returns

Relative position embedding.

Return type

torch.Tensor of shape (num_heads, query_len, key_len)

RotaryEmbedding

class model_center.layer.RotaryEmbedding(rotary_dim: int)

Bases: torch.nn.modules.module.Module

Rotary Position Embedding

Parameters

rotary_dim (int) – rotary dimension

forward(h_q, h_k)
Parameters
  • h_q – (batch_size*num_head, len_q, dim_head)

  • h_k – (batch_size*num_head, len_k, dim_head)

Returns

(batch_size*num_head, len_q, dim_head) h_k : (batch_size*num_head, len_k, dim_head)

Return type

h_q

LayerNorm

class model_center.layer.LayerNorm(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

LayerNorm if bias = True: \(y = {x-\text{E}[x]\over \text{Var}[x]+\text{eps}} * w + \text{bias}\)

RMS LayerNorm if bias = False: \(y = {x\over \text{Var}[x]+\text{eps}} * w\)

Parameters
  • dim_norm (int) – norm dimesion

  • dtype (optional) – Defaults to torch.half.

  • bias (bool, optional) – whether to add the \(\text{bias}\) term. Defaults to True.

  • eps (float, optional) – \(\text{eps}\) term. Defaults to 1e-5.

  • init_var (float, optional) – weight will be all initialized to init_var. Defaults to 1.0.

forward(x: torch.Tensor)
Parameters

x (torch.Tensor of shape (batch_size, seq_len, dim_norm)) – Input tensor that need to be normalized.

Returns

The layernorm output.

Return type

torch.Tensor of shape (batch_size, seq_len, dim_norm)

Attention

class model_center.layer.Attention(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

attention module consisting procedure of Q, K, V combination and its output projection. For more detail, see Attention is All you Need.

Parameters
  • dim_in (int) – input dimension.

  • dim_head (int) – dimension of each heads used in attention.

  • num_heads (int) – number of heads used in attention.

  • dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.

  • dtype (optional) – Defaults to torch.half.

  • init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attetion module. Defaults to 0.

  • init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in attention module. Defaults to 0.02.

  • bias (bool, optional) – whether to use bias term in fully-connected layers used in attention module. Defaults to False.

  • mask_value (float, optional) – mask value of the masked position. Defaults to -inf.

  • pos_bias_type (str, optional) – relative for relative position bias, rotary for ratery position embedding. Defaults to none.

  • attn_scale (bool, optional) – whether to scale before softmax, i.e., \(\text{softmax}({Q K^T \over \sqrt{\text{dim_model}}})\). Default to False.

  • dropout_p (float, optional) – Defaults to 0.

forward(query: torch.Tensor, key_value: torch.Tensor, mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)

This model inherits from bmt.DistributedModule.

Parameters
  • query (torch.Tensor of shape (batch, len_q, dim_model)) – Indices of input sequence tokens. It will be embedded by model’s internal embedding lookup matrix.

  • key_value (torch.Tensor of shape (batch, len_k, dim_model)) – Length of input sequence before padding.

  • mask (torch.Tensor of shape (batch, len_q, len_k)) – Used to avoid performing attention on padding token indices.

  • position_bias (torch.Tensor of shape (num_heads, len_q, len_k) or (1, num_heads, len_k, len_q)) – Provide positional information about tensor key_value and query.

Returns

The attention output.

Return type

out (torch.Tensor of shape (batch, len_q, dim_model))

FeedForward

class model_center.layer.FeedForward(*args: Any, **kwargs: Any)

Bases: bmtrain.DistributedModule

FeedForward module

Parameters
  • dim_in (int) – input dimension.

  • dim_ff (int) – middle dimension.

  • dim_out (int, optional) – output dimension. Defaults to None, which means dim_in = dim_out.

  • dtype (optional) – Defaults to torch.half.

  • init_mean (float, optional) – mean of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.

  • init_std (float, optional) – std of \(\mathbf{W}\sim\mathcal{N}(\text{mean}, \text{std}^2)\) for fully-connected module used in feed-forward layer. Defaults to 0.02.

  • bias (bool, optional) – whether to use bias term in fully-connected layers used in feed-forward module. Defaults to False.

  • activate_fn (str, optional) – Defaults to gated_gelu.

  • dropout_p (int, optional) – Defaults to 0.

forward(x: torch.Tensor)
Parameters

x (torch.Tensor of shape (batch, seq_len, dim_in)) – The input of feed-forward module.

Returns

The output of feed-forward module.

Return type

torch.Tensor of shape (batch, seq_len, dim_out)