rllm.preprocessing.save_column_name_tokens

rllm.preprocessing.save_column_name_tokens(col_types: dict, tokenizer: Callable, pad_token_id: int, standardize_func: Callable) dict[str, tuple[Tensor, Tensor]][source]

Tokenize all column names once and cache their token tensors. This is useful when column-name tokens are reused across many samples. The returned mapping stores one (input_ids, attention_mask) pair per column name.

Parameters:
  • col_types (dict) – Mapping of column names to ColType.

  • tokenizer (Callable) – Tokenizer callable.

  • pad_token_id (int) – Padding token ID.

  • standardize_func (Callable) – Function that normalizes tokenizer output.

Returns:

Mapping from column name to token ids and attention mask, each with shape \((L,)\).

Return type:

dict[str, tuple[torch.Tensor, torch.Tensor]]