rllm.preprocessing.df_to_tensor

rllm.preprocessing.df_to_tensor(df: DataFrame, col_types: Dict[str, ColType], target_col: str | None = None, fillna_config: FillNAConfig | None = None, categorical_missing_values: Sequence | None = None, binary_true_values: Sequence[str] | None = None, tokenizer_config: TokenizerConfig | None = None, text_embedder_config: TextEmbedderConfig | None = None, timestamp_format: str | None = None, timestamp_fields: Sequence[str] | None = None, concat: bool = True, cat_hardcode: bool = True) Tuple[Dict[ColType, Tensor | Tuple[Tensor, Tensor]], Tensor | None][source]

Convert a typed DataFrame into model-ready tensor features by dispatching each column with ColType. The function applies cleaning and missing-value handling, then builds tensors into a unified feature dict. It can also optionally extract target_col as y for supervised training.

Parameters:
  • df – Input DataFrame

  • col_types – Dictionary mapping column names to column types

  • target_col – Name of target column

  • fillna_config – Fill-NA configuration shared by all supported column types. When None, FillNAConfig defaults are used.

  • categorical_missing_values – Optional extra values treated as missing when encoding categorical columns. Passed to encode_categorical().

  • binary_true_values – Optional list of string values that should be interpreted as 1 in binary columns. Passed to convert_binary().

  • tokenizer_config – Configuration for tokenization; if provided and tokenize_combine is True, all TEXT columns are jointly tokenized and stored as a single entry in feat_dict[ColType.TEXT].

  • text_embedder_config – Configuration for text embedding.

  • timestamp_format – Optional format string for parsing TIMESTAMP columns. None lets pd.to_datetime infer the format.

  • timestamp_fields – Optional list of time components to extract from TIMESTAMP columns (subset of ["YEAR", "MONTH", "DAY", "DAYOFWEEK", "HOUR", "MINUTE", "SECOND"]).

  • concat – Whether to concatenate/stack features of the same column type (e.g., numerical and categorical along the last dim, text and timestamp along the feature/channel dim).

  • cat_hardcode – Whether to cast categorical features to integer type.

Returns:

(feat_dict, y) where feat_dict contains feature tensors by column type,

and y is the target tensor (None if no target_col). When TEXT columns are tokenized, the corresponding value is a tuple of (input_ids, attention_mask); otherwise it is an embedded tensor.

Return type:

tuple