rllm.preprocessing.tokenize_strings

rllm.preprocessing.tokenize_strings(seqs: list[str], tokenizer: Callable, pad_token_id: int, standardize_func: Callable, batch_size: int | None = None) tuple[Tensor, Tensor][source]

Tokenize a list of strings and build batched model inputs. Tokenization can run in one shot or in mini-batches to reduce peak memory. The output is standardized to (input_ids, attention_mask) tensors.

Parameters:
  • seqs (list[str]) – Strings to tokenize.

  • tokenizer (Callable) – Tokenizer callable.

  • pad_token_id (int) – Padding token ID.

  • standardize_func (Callable) – Function that normalizes tokenizer output into ids and masks.

  • batch_size (Optional[int]) – Mini-batch size. None means one shot.

Returns:

(input_ids, attention_mask), both with shape \((B, L)\).

Return type:

tuple[torch.Tensor, torch.Tensor]