rllm.preprocessing.tokenize_strings¶
- rllm.preprocessing.tokenize_strings(seqs: list[str], tokenizer: Callable, pad_token_id: int, standardize_func: Callable, batch_size: int | None = None) tuple[Tensor, Tensor][source]¶
Tokenize a list of strings and build batched model inputs. Tokenization can run in one shot or in mini-batches to reduce peak memory. The output is standardized to
(input_ids, attention_mask)tensors.- Parameters:
seqs (list[str]) – Strings to tokenize.
tokenizer (Callable) – Tokenizer callable.
pad_token_id (int) – Padding token ID.
standardize_func (Callable) – Function that normalizes tokenizer output into ids and masks.
batch_size (Optional[int]) – Mini-batch size.
Nonemeans one shot.
- Returns:
(input_ids, attention_mask), both with shape \((B, L)\).- Return type:
tuple[torch.Tensor, torch.Tensor]