rllm.preprocessing.standardize_tokenizer_output¶

rllm.preprocessing.standardize_tokenizer_output(tok_output, pad_token_id: int) → tuple[Tensor, Tensor][source]¶

Standardize tokenizer outputs into (input_ids, attention_mask).

Supported input formats:

Mapping (for example, transformers.BatchEncoding) with input_ids and optional attention_mask.
Tuple/List: (input_ids, attention_mask) or List[List[int]].
Single object exposing input_ids and optional attention_mask.
Raw ids only: List[int] / List[List[int]] / np.ndarray / torch.Tensor.

Behavior:

Converts inputs to 2D tensors \((B, L)\); ragged sequences are padded with pad_token_id.
If attention_mask is missing, it is derived from (input_ids != pad_token_id).
Ensures input_ids and attention_mask share the same shape and use torch.long dtype.

Notation:

\(B\) is batch size (the number of tokenized samples).
\(L\) is sequence length after padding/truncation alignment in the standardized output.

Parameters:

Returns:

(input_ids, attention_mask), both with shape \((B, L)\) and dtype torch.long.

Return type:

tuple[torch.Tensor, torch.Tensor]