rllm.data.table_data.TableData¶
- class rllm.data.table_data.TableData(df: DataFrame, col_types: Dict[str, ColType], name: str | None = None, table_type: TableType | None = None, pkey: str | None = None, fkeys: Sequence[str] | None = None, time_col: str | None = None, lazy_feature: bool = False, feat_dict: Dict[ColType, Tensor] | None = None, metadata: Dict[ColType, List[dict[str, Any]]] | None = None, target_col: str | None = None, y: Tensor | None = None, text_embedder_config: TextEmbedderConfig | None = None, tokenizer_config: TokenizerConfig | None = None, convert_text_coltypes: Set[ColType] | None = None, fillna_config: FillNAConfig | None = None, timestamp_format: str | None = None, timestamp_fields: Sequence[str] | None = None, **kwargs)[source]¶
Bases:
BaseTableA base class for creating single table data.
TableData is designed with lazy feature generation in mind. Call lazy_materialize to materialize feat_dict and metadata.
TableData always unify df.index as pkey for eazy process. If pkey is None, use cls.NONEPKEY instead.
TableData use BaseStorage to store normal properties like df, pkey. Extra private properties should be listed in cls.PRIVATE_PROPERTIES.
- Parameters:
df (DataFrame) – The tabular data frame containing the dataset.
col_types (Dict[str, ColType]) – A dictionary mapping each column in the data frame to a semantic type (e.g., CATEGORICAL, NUMERICAL).
name (str, optional) – The name of the table. If None, use table_ + uuid4 instead. (default:
None)table_type (TableType, optional) – The type of the table. (default:
None, in which case it will be inferred)pkey (str, optional) – The column name used as the primary key for the table. (default:
None, in which case use df.index.name)fkey_to_table_map (Dict[str, str], optional) – A dictionary mapping foreign keys to the tables they reference. (default:
None, in which case it will be inferred)time_col (str, optional) – The timestamp column name for the time- aware tables.
lazy_feature (bool, optional) – Whether to generate features lazily. If set to
True, features will only be generated by called lazy_materialize method. (default:Falsefor compatibility)feat_dict (Dict[ColType, Tensor], optional) – A dictionary storing tensors for each column type (default:
None, in which case it will be generated)metadata (Dict[ColType, List[dict[str, Any]]], optional) – Metadata for each column type, specifying the statistics and properties of the columns. (default:
None)target_col (str, optional) – The column name used as the target for prediction tasks. (default:
None)y (Tensor, optional) – A tensor containing the target values. (default:
None, in which case it will be generated)convert_text_coltypes – (Set[ColType], optional): Specifies which column types to automatically convert to TEXT type for tokenization. This is useful for models like TransTab that require text processing for certain feature types. When provided, columns of the specified types (excluding target_col) will be converted to TEXT type to enable tokenization-based processing.
**kwargs – Additional key-value attributes to set as instance variables.
- property cols: List[str]¶
The columns of the table data, including index and target columns.
- count_categorical_features() dict[str, int][source]¶
Return categorical features and its count of unique values
- property feat_cols: List[str]¶
The input feature columns of the dataset.
- property fkeys: List[str]¶
The foreign keys of the table.
- property index_col: str | None¶
The name of the index column. TableData always uses pkey as df.index.name
- infer_table_type() TableType[source]¶
Infer the table type. Tend to infer as a data table, unless table has no primary key and multiple foreign keys. This func may not be accruate, please check the result.
- lazy_materialize(keep_df: bool = True, text_embedder_config: TextEmbedderConfig | None = None, tokenizer_config: TokenizerConfig | None = None, fillna_config: FillNAConfig | None = None, timestamp_format: str | None = None, timestamp_fields: Sequence[str] | None = None)[source]¶
Materialize the feat_dict and metadata.
- Parameters:
keep_df (bool, optional) – Whether to keep the raw dataframe. (default:
True)text_embedder_config (TextEmbedderConfig, optional) – Config for text embedding. (default:
None)tokenizer_config (TokenizerConfig, optional) – Config for tokenization. (default:
None)fillna_config (FillNAConfig, optional) – Strategy for filling missing values. (default:
None)timestamp_format (str, optional) – Format string for parsing TIMESTAMP columns. (default:
None)timestamp_fields (Sequence[str], optional) – Time components to extract from TIMESTAMP columns. (default:
None)
- property num_cols¶
The number of feat columns we used.
- property num_rows¶
The number of rows of the dataset.
- property task_type: TaskType¶
The task type of the dataset.