rllm.preprocessing.tokenize_strings¶

rllm.preprocessing.tokenize_strings(seqs: list[str], tokenizer: Callable, pad_token_id: int, standardize_func: Callable, batch_size: int | None = None) → tuple[Tensor, Tensor][source]¶

Tokenize a list of strings and build batched model inputs. Tokenization can run in one shot or in mini-batches to reduce peak memory. The output is standardized to (input_ids, attention_mask) tensors.

Parameters:

seqs (list[str]) – Strings to tokenize.
tokenizer (Callable) – Tokenizer callable.
pad_token_id (int) – Padding token ID.
standardize_func (Callable) – Function that normalizes tokenizer output into ids and masks.
batch_size (Optional[int]) – Mini-batch size. None means one shot.

Returns:

(input_ids, attention_mask), both with shape \((B, L)\).

Return type:

tuple[torch.Tensor, torch.Tensor]