rllm.nn.encoder.TransTabPreEncoder¶
- class rllm.nn.encoder.TransTabPreEncoder(out_dim: int, metadata: Dict[ColType, List[Dict[str, Any]]], categorical_columns: List[str] | None = None, numerical_columns: List[str] | None = None, binary_columns: List[str] | None = None, tokenizer: BertTokenizerFast | None = None, tokenizer_dir: str = './tokenizer', hidden_dropout_prob: float = 0.0, layer_norm_eps: float = 1e-05, use_align_layer: bool = True, disable_tokenizer_parallel: bool = True, ignore_duplicate_cols: bool = False)[source]¶
Bases:
TablePreEncoderPre-encoder for TransTab (“TransTab”).
Converts a
TableDatafeat_dictinto token embeddings consumable by downstream Transformer layers, handling tokenizer management, column deduplication, and word/numeric sub-encoders.- Parameters:
out_dim (int) – Output embedding dimensionality (d_model).
metadata (Dict[ColType, List[Dict[str, Any]]]) – Per-column statistics metadata.
categorical_columns (List[str], optional) – Categorical column names. Default:
None.numerical_columns (List[str], optional) – Numerical column names. Default:
None.binary_columns (List[str], optional) – Binary column names. Default:
None.tokenizer (BertTokenizerFast, optional) – Pre-initialised tokenizer; takes precedence over
tokenizer_dir. Default:None.tokenizer_dir (str) – Tokenizer directory;
"bert-base-uncased"is downloaded here when absent. Default:"./tokenizer".hidden_dropout_prob (float) – Dropout for the word-embedding sub-encoder. Default:
0.0.layer_norm_eps (float) – LayerNorm \(\varepsilon\). Default:
1e-5.use_align_layer (bool) – Apply a linear projection before concatenation. Default:
True.disable_tokenizer_parallel (bool) – Set
TOKENIZERS_PARALLELISM=false. Default:True.ignore_duplicate_cols (bool) – Auto-rename duplicates instead of raising. Default:
False.
- forward(x: DataFrame | Dict[ColType, Tensor | Tuple[Tensor, ...]] | TableData, *, shuffle: bool = False, align_and_concat: bool = True, return_dict: bool = False, requires_grad: bool = False) Dict[str, Tensor] | Dict[ColType, Tensor] | Tensor[source]¶
Encode a table batch into embeddings.
- Parameters:
x (TableData) – Materialised input table batch.
shuffle (bool) – Shuffle column order within each type. Default:
False.align_and_concat (bool) – Apply alignment projection and concatenate all type embeddings. Default:
True.return_dict (bool) – Return
Dict[ColType, Tensor]instead of a concatenated tensor whenalign_and_concat=False. Default:False.requires_grad (bool) – Enable gradients during encoding. Default:
False.
- Returns:
{"embedding": [B, S, H], "attention_mask": [B, S]}whenalign_and_concat=True; otherwise a dict or concatenated tensor.
- load(ckpt_dir: str) None[source]¶
Restore tokenizer, column config, and encoder weights from
ckpt_dir.
- update(cat: List[str] | None = None, num: List[str] | None = None, bin: List[str] | None = None) None[source]¶
Extend column lists with new names and recheck for duplicates.
- Parameters:
cat (List[str], optional) – New categorical columns.
num (List[str], optional) – New numerical columns.
bin (List[str], optional) – New binary columns.
- Raises:
ValueError – On duplicate columns when
ignore_duplicate_colsisFalse.