encode_csv ¶
CLI module for encoding CSV data files.
Functions:
-
encode_batch
–Encode a batch of data.
-
load_encoders_from_config
–Load the encoders from the data config.
-
main
–Encode the data according to the configuration.
encode_batch ¶
Encode a batch of data.
This function applies configured encoders to specified columns within a batch. Each encoder's batch_encode
method is called to transform the column data.
Parameters:
-
batch
(LazyBatch
) –The input batch of data (a Hugging Face LazyBatch).
-
encoders_config
(dict[str, Any]
) –A dictionary where keys are column names and values are encoder objects to be applied to that column.
Returns:
-
dict[str, list]
–A dictionary representing the encoded batch, with all original columns
-
dict[str, list]
–present and encoded columns updated according to the encoders.
Source code in src/stimulus/cli/encode_csv.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
|
load_encoders_from_config ¶
Load the encoders from the data config.
Parameters:
-
data_config_path
(str
) –Path to the data config file.
Returns:
Source code in src/stimulus/cli/encode_csv.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
|
main ¶
Encode the data according to the configuration.
Parameters:
-
data_path
(str
) –Path to input data (CSV, parquet, or HuggingFace dataset directory).
-
config_yaml
(str
) –Path to config YAML file.
-
out_path
(str
) –Path to output encoded dataset directory.
-
num_proc
(Optional[int]
, default:None
) –Number of processes to use for encoding.
Source code in src/stimulus/cli/encode_csv.py
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|