rllm.preprocessing.save_column_name_tokens¶
- rllm.preprocessing.save_column_name_tokens(col_types: dict, tokenizer: Callable, pad_token_id: int, standardize_func: Callable) dict[str, tuple[Tensor, Tensor]][source]¶
Tokenize all column names once and cache their token tensors. This is useful when column-name tokens are reused across many samples. The returned mapping stores one
(input_ids, attention_mask)pair per column name.- Parameters:
col_types (dict) – Mapping of column names to
ColType.tokenizer (Callable) – Tokenizer callable.
pad_token_id (int) – Padding token ID.
standardize_func (Callable) – Function that normalizes tokenizer output.
- Returns:
Mapping from column name to token ids and attention mask, each with shape \((L,)\).
- Return type:
dict[str, tuple[torch.Tensor, torch.Tensor]]