rllm.preprocessing.tokenize_merged_cols

rllm.preprocessing.tokenize_merged_cols(df: DataFrame, col_types: dict, tokenizer_config: TokenizerConfig, target_col: str | None = None) tuple | None[source]

Merge all text columns per row and then tokenize. Depending on configuration, each text segment may include its column name as a prefix before row-wise concatenation. If no eligible text column exists, the function returns None.

Parameters:
  • df (DataFrame) – Input table.

  • col_types (dict) – Mapping of column name to ColType.

  • tokenizer_config (TokenizerConfig) – Tokenizer configuration.

  • target_col (Optional[str]) – Target column excluded from text merge.

Returns:

(input_ids, attention_mask) with shape \((B, L)\) if text columns exist; otherwise None.

Return type:

Optional[tuple]