CPM2

CPM2

CPM2Config

class model_center.model.CPM2Config(vocab_size=26240, dim_model=768, num_heads=12, dim_head=64, dim_ff=256, num_encoder_layers=12, num_decoder_layers=12, dropout_p=0, emb_init_mean=0.0, emb_init_std=1, pos_bias_type='relative', position_bias_num_buckets=32, position_bias_max_distance=128, pos_init_mean=0.0, pos_init_std=1, norm_init_var=1.0, norm_bias=False, norm_eps=1e-06, att_init_mean=0.0, att_init_std=0.02, att_bias=False, att_mask_value=- inf, ffn_init_mean=0.0, ffn_init_std=0.02, ffn_bias=False, ffn_activate_fn='gated_gelu', proj_init_mean=0.0, proj_init_std=1, proj_bias=False, length_scale=False, attn_scale=False, half=True, int8=False, cls_head=None, post_layer_norm=False)

This is a configuration class that stores the configuration of the CPM-2 model, which inherits from the Config class. It is used to instantiate the Bert model according to the specified parameters and define the model architecture. You can set specific parameters to control the output of the model.

For example: [dim_model] is used to determine the Dimension of the encoder layers. You can choose to use the default value of 768 or customize their dimensions.

CPM2Model

class model_center.model.CPM2(config: model_center.model.config.cpm2_config.CPM2Config)
forward(enc_input: torch.Tensor, enc_length: torch.Tensor, dec_input: torch.Tensor, dec_length: torch.Tensor)
This model inherits from BaseModel. This model is also a PyTorch torch.nn.Module subclass.

You can use it as a regular PyTorch Module.

Parameters
  • enc_input (torch.Tensor of shape (batch, seq_enc)) – Indices of input sequence tokens for encoder. It will be embedded by model’s internal embedding lookup matrix.

  • enc_length (torch.Tensor of shape (batch)) – Length of input sequence for encoder before padding.

  • dec_input (torch.Tensor of shape (batch, seq_dec)) – Indices of input sequence tokens for decoder. It will be embedded by model’s internal embedding lookup matrix.

  • dec_length (torch.Tensor of shape (batch)) – Length of input sequence for encoder before padding.

Returns

The CPM-2 output. Prediction scores of the language modeling before SoftMax.

Return type

torch.Tensor of shape (batch, seq_dec, vocab_output_size) or (batch, seqlen, cls_head)

CPM2Tokenizer

class model_center.tokenizer.CPM2Tokenizer(vocab_file, max_sentinels=190, max_len=None, q2b=False, sod_token='<s>', eod_token='<eod>', pad_token='<pad>', unk_token='<unk>', line_token='</n>', space_token='</_>')
decode(tokens)

Decode ids into a string.

encode(text)

Encode a string into ids.

tokenize(text)

Tokenize a string.