rllm.preprocessing.TokenizerConfig¶

class rllm.preprocessing.TokenizerConfig(tokenizer: Callable[[list[str]], Any], batch_size: int | None = None, pad_token_id: int = 0, tokenize_combine: bool = True, include_colname: bool = True, save_colname_token_ids: bool = False, segment_sep: str = ' ', name_value_sep: str = ' ')[source]¶

Bases: object

Configuration for text tokenization across preprocessing utilities. It controls batching, padding behavior, and whether multiple text columns are merged before tokenization. It also defines how column names are joined with cell values when building input strings.

Parameters:

tokenizer (Callable[[list[str]], Any]) – Tokenizer callable that accepts a list of strings.
batch_size (Optional[int]) – Optional mini-batch size for tokenization.
pad_token_id (int) – Padding token ID used when masks are generated.
tokenize_combine (bool) – Whether to tokenize all text columns as one merged string per row.
include_colname (bool) – Whether to prepend column names to cell values.
save_colname_token_ids (bool) – Whether to cache tokenized column-name ids for downstream reuse.
segment_sep (str) – Separator between merged text segments.
name_value_sep (str) – Separator between column name and text value.