rllm.nn.encoder.TransTabPreEncoder

class rllm.nn.encoder.TransTabPreEncoder(out_dim: int, metadata: Dict[ColType, List[Dict[str, Any]]], categorical_columns: List[str] | None = None, numerical_columns: List[str] | None = None, binary_columns: List[str] | None = None, tokenizer: BertTokenizerFast | None = None, tokenizer_dir: str = './tokenizer', hidden_dropout_prob: float = 0.0, layer_norm_eps: float = 1e-05, use_align_layer: bool = True, disable_tokenizer_parallel: bool = True, ignore_duplicate_cols: bool = False)[source]

Bases: TablePreEncoder

Pre-encoder for TransTab (“TransTab”).

Converts a TableData feat_dict into token embeddings consumable by downstream Transformer layers, handling tokenizer management, column deduplication, and word/numeric sub-encoders.

Parameters:
  • out_dim (int) – Output embedding dimensionality (d_model).

  • metadata (Dict[ColType, List[Dict[str, Any]]]) – Per-column statistics metadata.

  • categorical_columns (List[str], optional) – Categorical column names. Default: None.

  • numerical_columns (List[str], optional) – Numerical column names. Default: None.

  • binary_columns (List[str], optional) – Binary column names. Default: None.

  • tokenizer (BertTokenizerFast, optional) – Pre-initialised tokenizer; takes precedence over tokenizer_dir. Default: None.

  • tokenizer_dir (str) – Tokenizer directory; "bert-base-uncased" is downloaded here when absent. Default: "./tokenizer".

  • hidden_dropout_prob (float) – Dropout for the word-embedding sub-encoder. Default: 0.0.

  • layer_norm_eps (float) – LayerNorm \(\varepsilon\). Default: 1e-5.

  • use_align_layer (bool) – Apply a linear projection before concatenation. Default: True.

  • disable_tokenizer_parallel (bool) – Set TOKENIZERS_PARALLELISM=false. Default: True.

  • ignore_duplicate_cols (bool) – Auto-rename duplicates instead of raising. Default: False.

forward(x: DataFrame | Dict[ColType, Tensor | Tuple[Tensor, ...]] | TableData, *, shuffle: bool = False, align_and_concat: bool = True, return_dict: bool = False, requires_grad: bool = False) Dict[str, Tensor] | Dict[ColType, Tensor] | Tensor[source]

Encode a table batch into embeddings.

Parameters:
  • x (TableData) – Materialised input table batch.

  • shuffle (bool) – Shuffle column order within each type. Default: False.

  • align_and_concat (bool) – Apply alignment projection and concatenate all type embeddings. Default: True.

  • return_dict (bool) – Return Dict[ColType, Tensor] instead of a concatenated tensor when align_and_concat=False. Default: False.

  • requires_grad (bool) – Enable gradients during encoding. Default: False.

Returns:

{"embedding": [B, S, H], "attention_mask": [B, S]} when align_and_concat=True; otherwise a dict or concatenated tensor.

load(ckpt_dir: str) None[source]

Restore tokenizer, column config, and encoder weights from ckpt_dir.

save(path: str) None[source]

Save tokenizer, column config, and encoder weights to path.

update(cat: List[str] | None = None, num: List[str] | None = None, bin: List[str] | None = None) None[source]

Extend column lists with new names and recheck for duplicates.

Parameters:
  • cat (List[str], optional) – New categorical columns.

  • num (List[str], optional) – New numerical columns.

  • bin (List[str], optional) – New binary columns.

Raises:

ValueError – On duplicate columns when ignore_duplicate_cols is False.