rllm.preprocessing.standardize_tokenizer_output¶
- rllm.preprocessing.standardize_tokenizer_output(tok_output, pad_token_id: int) tuple[Tensor, Tensor][source]¶
Standardize tokenizer outputs into
(input_ids, attention_mask).Supported input formats:
Mapping (for example,
transformers.BatchEncoding) withinput_idsand optionalattention_mask.Tuple/List:
(input_ids, attention_mask)orList[List[int]].Single object exposing
input_idsand optionalattention_mask.Raw ids only:
List[int]/List[List[int]]/np.ndarray/torch.Tensor.
Behavior:
Converts inputs to 2D tensors \((B, L)\); ragged sequences are padded with
pad_token_id.If
attention_maskis missing, it is derived from(input_ids != pad_token_id).Ensures
input_idsandattention_maskshare the same shape and usetorch.longdtype.
Notation:
\(B\) is batch size (the number of tokenized samples).
\(L\) is sequence length after padding/truncation alignment in the standardized output.
- Parameters:
tok_output – Raw output from a tokenizer.
pad_token_id (int) – Padding token ID.
- Returns:
(input_ids, attention_mask), both with shape \((B, L)\) and dtypetorch.long.- Return type:
tuple[torch.Tensor, torch.Tensor]