rllm.preprocessing.df_to_tensor¶
- rllm.preprocessing.df_to_tensor(df: DataFrame, col_types: Dict[str, ColType], target_col: str | None = None, fillna_config: FillNAConfig | None = None, categorical_missing_values: Sequence | None = None, binary_true_values: Sequence[str] | None = None, tokenizer_config: TokenizerConfig | None = None, text_embedder_config: TextEmbedderConfig | None = None, timestamp_format: str | None = None, timestamp_fields: Sequence[str] | None = None, concat: bool = True, cat_hardcode: bool = True) Tuple[Dict[ColType, Tensor | Tuple[Tensor, Tensor]], Tensor | None][source]¶
Convert a typed DataFrame into model-ready tensor features by dispatching each column with
ColType. The function applies cleaning and missing-value handling, then builds tensors into a unified feature dict. It can also optionally extracttarget_colasyfor supervised training.- Parameters:
df – Input DataFrame
col_types – Dictionary mapping column names to column types
target_col – Name of target column
fillna_config – Fill-NA configuration shared by all supported column types. When
None,FillNAConfigdefaults are used.categorical_missing_values – Optional extra values treated as missing when encoding categorical columns. Passed to
encode_categorical().binary_true_values – Optional list of string values that should be interpreted as 1 in binary columns. Passed to
convert_binary().tokenizer_config – Configuration for tokenization; if provided and
tokenize_combineis True, all TEXT columns are jointly tokenized and stored as a single entry infeat_dict[ColType.TEXT].text_embedder_config – Configuration for text embedding.
timestamp_format – Optional format string for parsing
TIMESTAMPcolumns.Noneletspd.to_datetimeinfer the format.timestamp_fields – Optional list of time components to extract from
TIMESTAMPcolumns (subset of["YEAR", "MONTH", "DAY", "DAYOFWEEK", "HOUR", "MINUTE", "SECOND"]).concat – Whether to concatenate/stack features of the same column type (e.g., numerical and categorical along the last dim, text and timestamp along the feature/channel dim).
cat_hardcode – Whether to cast categorical features to integer type.
- Returns:
- (feat_dict, y) where feat_dict contains feature tensors by column type,
and y is the target tensor (None if no target_col). When TEXT columns are tokenized, the corresponding value is a tuple of
(input_ids, attention_mask); otherwise it is an embedded tensor.
- Return type:
tuple