rllm.preprocessing.standardize_tokenizer_output

rllm.preprocessing.standardize_tokenizer_output(tok_output, pad_token_id: int) tuple[Tensor, Tensor][source]

Standardize tokenizer outputs into (input_ids, attention_mask).

Supported input formats:

  • Mapping (for example, transformers.BatchEncoding) with input_ids and optional attention_mask.

  • Tuple/List: (input_ids, attention_mask) or List[List[int]].

  • Single object exposing input_ids and optional attention_mask.

  • Raw ids only: List[int] / List[List[int]] / np.ndarray / torch.Tensor.

Behavior:

  • Converts inputs to 2D tensors \((B, L)\); ragged sequences are padded with pad_token_id.

  • If attention_mask is missing, it is derived from (input_ids != pad_token_id).

  • Ensures input_ids and attention_mask share the same shape and use torch.long dtype.

Notation:

  • \(B\) is batch size (the number of tokenized samples).

  • \(L\) is sequence length after padding/truncation alignment in the standardized output.

Parameters:
  • tok_output – Raw output from a tokenizer.

  • pad_token_id (int) – Padding token ID.

Returns:

(input_ids, attention_mask), both with shape \((B, L)\) and dtype torch.long.

Return type:

tuple[torch.Tensor, torch.Tensor]