block

Encoder

class model_center.layer.Encoder(num_layers: int, dim_model: int, dim_ff: int, num_heads: int, dim_head: int, dtype: torch.dtype = torch.float16, int8: bool = False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, att_init_mean: float = 0.0, att_init_std: float = 0.02, att_bias: bool = False, att_mask_value: float = - inf, ffn_init_mean: float = 0.0, ffn_init_std: float = 0.02, ffn_bias: bool = False, ffn_activate_fn: str = 'gated_gelu', pos_bias_type: str = 'none', post_layer_norm: bool = False, length_scale: bool = False, attn_scale: bool = False, dropout_p: float = 0, parallel_ffn: bool = False)

Bases: torch.nn.modules.module.Module

Layers of encoder transformer blocks plus an final layernorm.

Parameters

num_layers (int) – number of layers.
dim_model (int) – main dimension of modules in transformer blocks.
dim_ff (int) – dim_ff used in model_center.layer.FeedForward.
num_heads (int) – num_heads used in model_center.layer.Attention.
dim_head (int) – dim_head used in model_center.layer.Attention.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
att_init_mean (float, optional) – init_mean used in model_center.layer.Attention. Defaults to 0.0.
att_init_std (float, optional) – init_std used in model_center.layer.Attention. Defaults to 0.02.
att_bias (bool, optional) – bias used in in model_center.layer.Attention. Defaults to False.
att_mask_value (float, optional) – mask_value used in in model_center.layer.Attention. Defaults to float(“-inf”).
ffn_init_mean (float, optional) – init_mean used in model_center.layer.FeedForward. Defaults to 0.0.
ffn_init_std (float, optional) – init_std used in model_center.layer.FeedForward. Defaults to 0.02.
ffn_bias (bool, optional) – bias used in model_center.layer.FeedForward. Defaults to False.
ffn_activate_fn (str, optional) – activate_fn used in model_center.layer.FeedForward. Defaults to “gated_gelu”.
pos_bias_type (str, optional) – pos_bias_type used in model_center.layer.Attention. Defaults to “none”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
attn_scale (bool, optional) – attn_scale used in in model_center.layer.Attention. Defaults to False.
dropout_p (float, optional) – Defaults to 0.

forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)

Parameters

hidden-states (torch.Tensor of shape (batch, seq_enc, dim_model)) – Input of encoder, might be the embedding of a batch of sequences.
attention_mask (torch.Tensor of shape (batch, seq_enc, seq_enc)) – Avoid invalid areas to participate in the calculation
position_bias (torch.Tensor of shape (num_heads, seq_enc, seq_enc)) –

Returns

The encoder output.

Return type

torch.Tensor of shape (batch, seq_enc, dim_model)

Decoder

class model_center.layer.Decoder(num_layers: int, dim_model: int, dim_ff: int, num_heads: int, dim_head: int, dtype: torch.dtype = torch.float16, int8: bool = False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, att_init_mean: float = 0.0, att_init_std: float = 0.02, att_bias: bool = False, att_mask_value: float = - inf, ffn_init_mean: float = 0.0, ffn_init_std: float = 0.02, ffn_bias: bool = False, ffn_activate_fn: str = 'gated_gelu', pos_bias_type: str = 'none', length_scale: bool = False, attn_scale: bool = False, dropout_p: float = 0, parallel_ffn: bool = False)

Bases: torch.nn.modules.module.Module

Layers of decoder transformer blocks plus an final layernorm.

Parameters

num_layers (int) – number of layers.
dim_model (int) – main dimension of modules in transformer blocks.
dim_ff (int) – dim_ff used in model_center.layer.FeedForward.
num_heads (int) – num_heads used in model_center.layer.Attention.
dim_head (int) – dim_head used in model_center.layer.Attention.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
att_init_mean (float, optional) – init_mean used in model_center.layer.Attention. Defaults to 0.0.
att_init_std (float, optional) – init_std used in model_center.layer.Attention. Defaults to 0.02.
att_bias (bool, optional) – bias used in in model_center.layer.Attention. Defaults to False.
att_mask_value (float, optional) – mask_value used in in model_center.layer.Attention. Defaults to float(“-inf”).
ffn_init_mean (float, optional) – init_mean used in model_center.layer.FeedForward. Defaults to 0.0.
ffn_init_std (float, optional) – init_std used in model_center.layer.FeedForward. Defaults to 0.02.
ffn_bias (bool, optional) – bias used in model_center.layer.FeedForward. Defaults to False.
ffn_activate_fn (str, optional) – activate_fn used in model_center.layer.FeedForward. Defaults to “gated_gelu”.
pos_bias_type (str, optional) – pos_bias_type used in model_center.layer.Attention. Defaults to “none”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
attn_scale (bool, optional) – attn_scale used in in model_center.layer.Attention. Defaults to False.
dropout_p (float, optional) – Defaults to 0.

forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, position_bias: torch.Tensor, cross_hidden_states=None, cross_attention_mask=None, cross_position_bias=None)

Parameters

hidden_states (torch.Tensor of shape (batch, seq_dec, dim_model)) – Input of decoder, Can be the embedding of a batch of sequences.
attention_mask (torch.Tensor of shape (batch, seq_dec, seq_dec)) – Avoid invalid areas to participate in the calculation.
position_bias (torch.Tensor of shape (num_heads, seq_dec, seq_dec)) –
cross_hidden_states (torch.Tensor of shape (batch, seq_enc, dim_model)) – Input of decoder, Can be the output of encoder.
cross_attention_mask (torch.Tensor of shape (batch, seq_dec, seq_enc)) – Avoid invalid areas to participate in the calculation when the output of encoder participates in the calculation.
cross_position_bias (torch.Tensor of shape (num_heads, seq_dec, seq_enc)) –

Returns

The decoder output.

Return type

torch.Tensor of shape (batch, seq_dec, dim_model)

TransformerBlock

class model_center.layer.TransformerBlock(dim_model: int, dim_ff: int, num_heads: int, dim_head: int, is_decoder: bool = False, dtype=torch.float16, int8=False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, att_init_mean: float = 0.0, att_init_std: float = 0.02, att_bias: bool = False, att_mask_value: float = - inf, ffn_init_mean: float = 0.0, ffn_init_std: float = 0.02, ffn_bias: bool = False, ffn_activate_fn: str = 'gated_gelu', pos_bias_type: str = 'none', post_layer_norm: bool = False, parallel_ffn: bool = False, length_scale: bool = False, attn_scale: bool = False, dropout_p: float = 0)

Bases: torch.nn.modules.module.Module

The whole transformer block. A sequence of operation. Consists of self-attention block[, cross-attention block] and feed-forward block.

Parameters

dim_model (int) – main dimension of modules in transformer blocks.
dim_ff (int) – dim_ff used in model_center.layer.FeedForward.
num_heads (int) – num_heads used in model_center.layer.Attention.
dim_head (int) – dim_head used in model_center.layer.Attention.
is_decoder (bool, optional) – whether to use cross-attention. Defaults to False.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
att_init_mean (float, optional) – init_mean used in model_center.layer.Attention. Defaults to 0.0.
att_init_std (float, optional) – init_std used in model_center.layer.Attention. Defaults to 0.02.
att_bias (bool, optional) – bias used in in model_center.layer.Attention. Defaults to False.
att_mask_value (float, optional) – mask_value used in in model_center.layer.Attention. Defaults to float(“-inf”).
ffn_init_mean (float, optional) – init_mean used in model_center.layer.FeedForward. Defaults to 0.0.
ffn_init_std (float, optional) – init_std used in model_center.layer.FeedForward. Defaults to 0.02.
ffn_bias (bool, optional) – bias used in model_center.layer.FeedForward. Defaults to False.
ffn_activate_fn (str, optional) – activate_fn used in model_center.layer.FeedForward. Defaults to “gated_gelu”.
pos_bias_type (str, optional) – pos_bias_type used in model_center.layer.Attention. Defaults to “none”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
attn_scale (bool, optional) – attn_scale used in in model_center.layer.Attention. Defaults to False.
dropout_p (float, optional) – Defaults to 0.

forward(self_hidden_states: torch.Tensor, self_attention_mask: torch.Tensor, self_position_bias: Optional[torch.Tensor] = None, cross_hidden_states=None, cross_attention_mask=None, cross_position_bias=None)

Parameters

self_hidden_states (torch.Tensor of shape (batch, seq_self, dim_model)) – Input of transformer block(self-attention block). It can be the raw embedding of a batch of sequences.
self_attention_mask (torch.Tensor of shape (batch, seq_self, seq_self)) – Avoid invalid areas to participate in the calculation of self-attention.
self_position_bias (torch.Tensor of shape (num_heads, seq_self, seq_self)) – Provide positional information to self-attention block.
cross_hidden_states (torch.Tensor of shape (batch, seq_cross, dim_model)) – Input of cross-attention block.
cross_attention_mask (torch.Tensor of shape (batch, seq_self, seq_cross)) – Avoid invalid areas to participate in the calculation of cross-attention.
cross_position_bias (torch.Tensor of shape (num_heads, seq_self, seq_cross)) – Provide positional information to cross-attention block.

Returns

The output of transformer block.

Return type

torch.Tensor of shape (batch, seq_self, dim_model)

FFNBlock

class model_center.layer.FFNBlock(dim_model: int, dim_ff: int, dtype=torch.float16, int8=False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, ffn_init_mean: float = 0.0, ffn_init_std: float = 0.02, ffn_bias: bool = False, ffn_activate_fn: str = 'gated_gelu', post_layer_norm: bool = False, length_scale: bool = False, dropout_p: float = 0)

Bases: torch.nn.modules.module.Module

The whole feed-forward block. A sequence of operation. Consists of layernorm, feed-forward and residual connection.

Parameters

dim_model (int) – main dimension of modules in transformer blocks.
dim_ff (int) – dim_ff used in model_center.layer.FeedForward.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
ffn_init_mean (float, optional) – init_mean used in model_center.layer.FeedForward. Defaults to 0.0.
ffn_init_std (float, optional) – init_std used in model_center.layer.FeedForward. Defaults to 0.02.
ffn_bias (bool, optional) – bias used in model_center.layer.FeedForward. Defaults to False.
ffn_activate_fn (str, optional) – activate_fn used in model_center.layer.FeedForward. Defaults to “gated_gelu”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
dropout_p (float, optional) – Defaults to 0.

forward(hidden_states: torch.Tensor)

Parameters: hidden_states (torch.Tensor of shape (batch, seq_self, dim_model)) – Hidden states before feed forward layer.
Returns: The output of feed-forward block
Return type: torch.Tensor of shape (batch, seq_self, dim_model)

SelfAttentionBlock

class model_center.layer.SelfAttentionBlock(dim_model: int, num_heads: int, dim_head: int, dtype=torch.float16, int8=False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, att_init_mean: float = 0.0, att_init_std: float = 0.02, att_bias: bool = False, att_mask_value: float = - inf, pos_bias_type: str = 'none', post_layer_norm: bool = False, length_scale: bool = False, attn_scale: bool = False, dropout_p: float = 0)

Bases: torch.nn.modules.module.Module

The whole cross-attention block. A sequence of operation. Consists of layernorm, self-attention and residual connection.

Parameters

dim_model (int) – main dimension of modules in transformer blocks.
num_heads (int) – num_heads used in model_center.layer.Attention.
dim_head (int) – dim_head used in model_center.layer.Attention.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
att_init_mean (float, optional) – init_mean used in model_center.layer.Attention. Defaults to 0.0.
att_init_std (float, optional) – init_std used in model_center.layer.Attention. Defaults to 0.02.
att_bias (bool, optional) – bias used in in model_center.layer.Attention. Defaults to False.
att_mask_value (float, optional) – mask_value used in in model_center.layer.Attention. Defaults to float(“-inf”).
pos_bias_type (str, optional) – pos_bias_type used in model_center.layer.Attention. Defaults to “none”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
attn_scale (bool, optional) – attn_scale used in in model_center.layer.Attention. Defaults to False.
dropout_p (float, optional) – Defaults to 0.

forward(hidden_states: torch.Tensor, attention_mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)

Parameters

hidden_states (torch.Tensor of shape (batch, seq_self, dim_model)) – Input of self-attention block. It can be the embedding of a batch of sequences.
attention_mask (torch.Tensor of shape (batch, seq_self, seq_self)) – Avoid invalid areas to participate in the calculation.
position_bias (torch.Tensor of shape (num_heads, seq_self, seq_self)) – Provide positional information to self-attention block.

Returns

The output of attention block.

Return type

torch.Tensor of shape (batch, seq_self, dim_model)

CrossAttentionBlock

class model_center.layer.CrossAttentionBlock(dim_model: int, num_heads: int, dim_head: int, dtype=torch.float16, int8=False, norm_init_var: float = 1.0, norm_bias: bool = False, norm_eps: float = 1e-05, att_init_mean: float = 0.0, att_init_std: float = 0.02, att_bias: bool = False, att_mask_value: float = - inf, pos_bias_type: str = 'none', post_layer_norm: bool = False, length_scale: bool = False, attn_scale: bool = False, dropout_p: float = 0)

Bases: torch.nn.modules.module.Module

The whole cross-attention block. A sequence of operation. Consists of layernorm, cross-attention and residual connection.

Parameters

dim_model (int) – main dimension of modules in transformer blocks.
num_heads (int) – num_heads used in model_center.layer.Attention.
dim_head (int) – dim_head used in model_center.layer.Attention.
dtype (optional) – Defaults to torch.half.
norm_init_var (float, optional) – init_var used in model_center.layer.LayerNorm. Defaults to 1.0.
norm_bias (bool, optional) – bias used in model_center.layer.LayerNorm. Defaults to False.
norm_eps (float, optional) – eps used in model_center.layer.LayerNorm. Defaults to 1e-5.
att_init_mean (float, optional) – init_mean used in model_center.layer.Attention. Defaults to 0.0.
att_init_std (float, optional) – init_std used in model_center.layer.Attention. Defaults to 0.02.
att_bias (bool, optional) – bias used in in model_center.layer.Attention. Defaults to False.
att_mask_value (float, optional) – mask_value used in in model_center.layer.Attention. Defaults to float(“-inf”).
pos_bias_type (str, optional) – pos_bias_type used in model_center.layer.Attention. Defaults to “none”.
post_layer_norm (bool, optional) – whether to use post-layernorm. Defaults to False, which means pre-layernorm.
attn_scale (bool, optional) – attn_scale used in in model_center.layer.Attention. Defaults to False.
dropout_p (float, optional) – Defaults to 0.

forward(hidden_states: torch.Tensor, key_value_states: torch.Tensor, attention_mask: torch.Tensor, position_bias: Optional[torch.Tensor] = None)

Parameters

hidden_states (torch.Tensor of shape (batch, seq_self, dim_model)) – Input of cross-attention block. It can be seen as query in the coming self-attention operation.
key_value_states (torch.Tensor of shape (batch, seq_cross, dim_model)) – Used as key_value in coming self_attention operation.
attention_mask (torch.Tensor of shape (batch, seq_self, seq_cross)) – Avoid invalid areas to participate in the calculation.
position_bias (torch.Tensor of shape (num_heads, seq_self, seq_cross)) – Provide positional information to self-attention block.

Returns

The output of cross-attention block.

Return type

torch.Tensor of shape (batch, seq_self, dim_model)