rllm.preprocessing.process_tokenized_column¶

rllm.preprocessing.process_tokenized_column(col_series: Series, col_name: str, tokenizer_config: TokenizerConfig, include_colname: bool = True, name_value_sep: str = ' ') → tuple[Tensor, Tensor][source]¶

Tokenize a single text column into ids and attention masks. The function can optionally prepend each cell with its column name before tokenization. It returns batched tensors aligned to the same sequence length.

Parameters:

col_series (Series) – Input text column.
col_name (str) – Column name.
tokenizer_config (TokenizerConfig) – Tokenizer configuration.
include_colname (bool) – Whether to prepend the column name.
name_value_sep (str) – Separator between column name and value.

Returns:

(input_ids, attention_mask), both with shape \((N, L)\).

Return type:

tuple[torch.Tensor, torch.Tensor]