Skip to content

encoders

This file contains encoders classes for encoding various types of data.

Classes:

AbstractEncoder

Bases: ABC

Abstract class for encoders.

Encoders are classes that encode the raw data into torch.tensors. Different encoders provide different encoding methods. Different encoders may take different types of data as input.

Methods:

  • encode

    encodes a single data point

  • encode_all

    encodes a list of data points into a torch.tensor

  • encode_multiprocess

    encodes a list of data points using multiprocessing

  • decode

    decodes a single data point

Methods:

  • decode

    Decode a single data point.

  • encode

    Encode a single data point.

  • encode_all

    Encode a list of data points.

decode abstractmethod

decode(data: Any) -> Any

Decode a single data point.

This is an abstract method, child classes should overwrite it.

Parameters:

  • data (Any) –

    a single encoded data point

Returns:

  • decoded_data_point ( Any ) –

    the decoded data point

Source code in src/stimulus/data/encoding/encoders.py
58
59
60
61
62
63
64
65
66
67
68
69
70
@abstractmethod
def decode(self, data: Any) -> Any:
    """Decode a single data point.

    This is an abstract method, child classes should overwrite it.

    Args:
        data (Any): a single encoded data point

    Returns:
        decoded_data_point (Any): the decoded data point
    """
    raise NotImplementedError

encode abstractmethod

encode(data: Any) -> Any

Encode a single data point.

This is an abstract method, child classes should overwrite it.

Parameters:

  • data (Any) –

    a single data point

Returns:

  • encoded_data_point ( Any ) –

    the encoded data point

Source code in src/stimulus/data/encoding/encoders.py
30
31
32
33
34
35
36
37
38
39
40
41
42
@abstractmethod
def encode(self, data: Any) -> Any:
    """Encode a single data point.

    This is an abstract method, child classes should overwrite it.

    Args:
        data (Any): a single data point

    Returns:
        encoded_data_point (Any): the encoded data point
    """
    raise NotImplementedError

encode_all abstractmethod

encode_all(data: list[Any]) -> Tensor

Encode a list of data points.

This is an abstract method, child classes should overwrite it.

Parameters:

  • data (list[Any]) –

    a list of data points

Returns:

  • encoded_data ( Tensor ) –

    encoded data points

Source code in src/stimulus/data/encoding/encoders.py
44
45
46
47
48
49
50
51
52
53
54
55
56
@abstractmethod
def encode_all(self, data: list[Any]) -> torch.Tensor:
    """Encode a list of data points.

    This is an abstract method, child classes should overwrite it.

    Args:
        data (list[Any]): a list of data points

    Returns:
        encoded_data (torch.Tensor): encoded data points
    """
    raise NotImplementedError

NumericEncoder

NumericEncoder(dtype: dtype = float32)

Bases: AbstractEncoder

Encoder for float/int data.

Attributes:

  • dtype (dtype) –

    The data type of the encoded data. Default = torch.float32 (32-bit floating point)

Parameters:

  • dtype (dtype, default: float32 ) –

    the data type of the encoded data. Default = torch.float (32-bit floating point)

Methods:

Source code in src/stimulus/data/encoding/encoders.py
309
310
311
312
313
314
315
def __init__(self, dtype: torch.dtype = torch.float32) -> None:
    """Initialize the NumericEncoder class.

    Args:
        dtype (torch.dtype): the data type of the encoded data. Default = torch.float (32-bit floating point)
    """
    self.dtype = dtype

decode

decode(data: Tensor) -> list[float]

Decodes the data.

Parameters:

  • data (Tensor) –

    the encoded data

Returns:

  • decoded_data ( list[float] ) –

    the decoded data

Source code in src/stimulus/data/encoding/encoders.py
349
350
351
352
353
354
355
356
357
358
def decode(self, data: torch.Tensor) -> list[float]:
    """Decodes the data.

    Args:
        data (torch.Tensor): the encoded data

    Returns:
        decoded_data (list[float]): the decoded data
    """
    return data.cpu().numpy().tolist()

encode

encode(data: float) -> Tensor

Encodes the data.

This method takes as input a single data point, should be mappable to a single output.

Parameters:

  • data (float) –

    a single data point

Returns:

  • encoded_data_point ( Tensor ) –

    the encoded data point

Source code in src/stimulus/data/encoding/encoders.py
317
318
319
320
321
322
323
324
325
326
327
328
def encode(self, data: float) -> torch.Tensor:
    """Encodes the data.

    This method takes as input a single data point, should be mappable to a single output.

    Args:
        data (float): a single data point

    Returns:
        encoded_data_point (torch.Tensor): the encoded data point
    """
    return self.encode_all([data])

encode_all

encode_all(data: list[float]) -> Tensor

Encodes the data.

This method takes as input a list of data points, or a single float, and returns a torch.tensor.

Parameters:

  • data (list[float]) –

    a list of data points or a single data point

Returns:

  • encoded_data ( Tensor ) –

    the encoded data

Source code in src/stimulus/data/encoding/encoders.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
def encode_all(self, data: list[float]) -> torch.Tensor:
    """Encodes the data.

    This method takes as input a list of data points, or a single float, and returns a torch.tensor.

    Args:
        data (list[float]): a list of data points or a single data point

    Returns:
        encoded_data (torch.Tensor): the encoded data
    """
    if not isinstance(data, list):
        data = [data]

    self._check_input_dtype(data)
    self._warn_float_is_converted_to_int(data)

    return torch.tensor(data, dtype=self.dtype)

NumericRankEncoder

NumericRankEncoder(*, scale: bool = False)

Bases: AbstractEncoder

Encoder for float/int data that encodes the data based on their rank.

Attributes:

  • scale (bool) –

    whether to scale the ranks to be between 0 and 1. Default = False

Methods:

  • encode

    encodes a single data point

  • encode_all

    encodes a list of data points into a torch.tensor

  • decode

    decodes a single data point

  • _check_input_dtype

    checks if the input data is int or float data

Parameters:

  • scale (bool, default: False ) –

    whether to scale the ranks to be between 0 and 1. Default = False

Methods:

  • decode

    Returns an error since decoding does not make sense without encoder information, which is not yet supported.

  • encode

    Returns an error since encoding a single float does not make sense.

  • encode_all

    Encodes the data.

Source code in src/stimulus/data/encoding/encoders.py
478
479
480
481
482
483
484
def __init__(self, *, scale: bool = False) -> None:
    """Initialize the NumericRankEncoder class.

    Args:
        scale (bool): whether to scale the ranks to be between 0 and 1. Default = False
    """
    self.scale = scale

decode

decode(data: Any) -> Any

Returns an error since decoding does not make sense without encoder information, which is not yet supported.

Source code in src/stimulus/data/encoding/encoders.py
514
515
516
def decode(self, data: Any) -> Any:
    """Returns an error since decoding does not make sense without encoder information, which is not yet supported."""
    raise NotImplementedError("Decoding is not yet supported for NumericRank.")

encode

encode(data: Any) -> Tensor

Returns an error since encoding a single float does not make sense.

Source code in src/stimulus/data/encoding/encoders.py
486
487
488
def encode(self, data: Any) -> torch.Tensor:
    """Returns an error since encoding a single float does not make sense."""
    raise NotImplementedError("Encoding a single float does not make sense. Use encode_all instead.")

encode_all

encode_all(data: list[Union[int, float]]) -> Tensor

Encodes the data.

This method takes as input a list of data points, and returns the ranks of the data points. The ranks are normalized to be between 0 and 1, when scale is set to True.

Parameters:

Returns:

  • encoded_data ( Tensor ) –

    the encoded data

Source code in src/stimulus/data/encoding/encoders.py
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
def encode_all(self, data: list[Union[int, float]]) -> torch.Tensor:
    """Encodes the data.

    This method takes as input a list of data points, and returns the ranks of the data points.
    The ranks are normalized to be between 0 and 1, when scale is set to True.

    Args:
        data (list[Union[int, float]]): a list of numeric values

    Returns:
        encoded_data (torch.Tensor): the encoded data
    """
    if not isinstance(data, list):
        data = [data]
    self._check_input_dtype(data)

    # Get ranks (0 is lowest, n-1 is highest)
    # and normalize to be between 0 and 1
    array_data: np.ndarray = np.array(data)
    ranks: np.ndarray = np.argsort(np.argsort(array_data))
    if self.scale:
        ranks = ranks / max(len(ranks) - 1, 1)
    return torch.tensor(ranks)

StrClassificationEncoder

StrClassificationEncoder(*, scale: bool = False)

Bases: AbstractEncoder

A string classification encoder that converts lists of strings into numeric labels using scikit-learn.

When scale is set to True, the labels are scaled to be between 0 and 1.

Attributes:

  • scale (bool) –

    Whether to scale the labels to be between 0 and 1. Default = False

Methods:

  • encode

    str) -> int: Raises a NotImplementedError, as encoding a single string is not meaningful in this context.

  • encode_all

    list[str]) -> torch.tensor: Encodes an entire list of string data into a numeric representation using LabelEncoder and returns a torch tensor. Ensures that the provided data items are valid strings prior to encoding.

  • decode

    Any) -> Any: Raises a NotImplementedError, as decoding is not supported with the current design.

  • _check_dtype

    list[str]) -> None: Validates that all items in the data list are strings, raising a ValueError otherwise.

Parameters:

  • scale (bool, default: False ) –

    whether to scale the labels to be between 0 and 1. Default = False

Methods:

  • decode

    Returns an error since decoding does not make sense without encoder information, which is not yet supported.

  • encode

    Returns an error since encoding a single string does not make sense.

  • encode_all

    Encodes the data.

Source code in src/stimulus/data/encoding/encoders.py
406
407
408
409
410
411
412
def __init__(self, *, scale: bool = False) -> None:
    """Initialize the StrClassificationEncoder class.

    Args:
        scale (bool): whether to scale the labels to be between 0 and 1. Default = False
    """
    self.scale = scale

decode

decode(data: Any) -> Any

Returns an error since decoding does not make sense without encoder information, which is not yet supported.

Source code in src/stimulus/data/encoding/encoders.py
446
447
448
def decode(self, data: Any) -> Any:
    """Returns an error since decoding does not make sense without encoder information, which is not yet supported."""
    raise NotImplementedError("Decoding is not yet supported for StrClassification.")

encode

encode(data: str) -> int

Returns an error since encoding a single string does not make sense.

Parameters:

  • data (str) –

    a single string

Source code in src/stimulus/data/encoding/encoders.py
414
415
416
417
418
419
420
def encode(self, data: str) -> int:
    """Returns an error since encoding a single string does not make sense.

    Args:
        data (str): a single string
    """
    raise NotImplementedError("Encoding a single string does not make sense. Use encode_all instead.")

encode_all

encode_all(data: Union[str, list[str]]) -> Tensor

Encodes the data.

This method takes as input a list of data points, should be mappable to a single output, using LabelEncoder from scikit learn and returning a numpy array. For more info visit : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

Parameters:

Returns:

  • encoded_data ( tensor ) –

    the encoded data

Source code in src/stimulus/data/encoding/encoders.py
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
def encode_all(self, data: Union[str, list[str]]) -> torch.Tensor:
    """Encodes the data.

    This method takes as input a list of data points, should be mappable to a single output, using LabelEncoder from scikit learn and returning a numpy array.
    For more info visit : https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

    Args:
        data (Union[str, list[str]]): a list of strings or single string

    Returns:
        encoded_data (torch.tensor): the encoded data
    """
    if not isinstance(data, list):
        data = [data]

    self._check_dtype(data)

    encoder = preprocessing.LabelEncoder()
    encoded_data = torch.tensor(encoder.fit_transform(data))
    if self.scale:
        encoded_data = encoded_data / max(len(encoded_data) - 1, 1)

    return encoded_data

TextOneHotEncoder

TextOneHotEncoder(
    alphabet: str = "acgt",
    *,
    convert_lowercase: bool = False,
    padding: bool = False
)

Bases: AbstractEncoder

One hot encoder for text data.

NOTE encodes based on the given alphabet If a character c is not in the alphabet, c will be represented by a vector of zeros.

Attributes:

  • alphabet (str) –

    the alphabet to one hot encode the data with.

  • convert_lowercase (bool) –

    whether to convert the sequence and alphabet to lowercase. Default is False.

  • padding (bool) –

    whether to pad the sequences with zeros. Default is False.

  • encoder (OneHotEncoder) –

    preprocessing.OneHotEncoder object initialized with self.alphabet

Methods:

  • encode

    encodes a single data point

  • encode_all

    encodes a list of data points into a numpy array

  • encode_multiprocess

    encodes a list of data points using multiprocessing

  • decode

    decodes a single data point

  • _sequence_to_array

    transforms a sequence into a numpy array

Parameters:

  • alphabet (str, default: 'acgt' ) –

    the alphabet to one hot encode the data with.

Raises:

  • TypeError

    If the input alphabet is not a string.

Methods:

  • decode

    Decodes one-hot encoded tensor back to sequences.

  • encode

    One hot encodes a single sequence.

  • encode_all

    Encodes a list of sequences.

  • encode_multiprocess

    Encodes a list of sequences using multiprocessing.

Source code in src/stimulus/data/encoding/encoders.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
def __init__(self, alphabet: str = "acgt", *, convert_lowercase: bool = False, padding: bool = False) -> None:
    """Initialize the TextOneHotEncoder class.

    Args:
        alphabet (str): the alphabet to one hot encode the data with.

    Raises:
        TypeError: If the input alphabet is not a string.
    """
    if not isinstance(alphabet, str):
        error_msg = f"Expected a string input for alphabet, got {type(alphabet).__name__}"
        logger.error(error_msg)
        raise TypeError(error_msg)

    if convert_lowercase:
        alphabet = alphabet.lower()

    self.alphabet = alphabet
    self.convert_lowercase = convert_lowercase
    self.padding = padding

    self.encoder = preprocessing.OneHotEncoder(
        categories=[list(alphabet)],
        handle_unknown="ignore",
    )  # handle_unknown='ignore' unsures that a vector of zeros is returned for unknown characters, such as 'Ns' in DNA sequences
    self.encoder.fit(np.array(list(alphabet)).reshape(-1, 1))

decode

decode(data: Tensor) -> Union[str, list[str]]

Decodes one-hot encoded tensor back to sequences.

Parameters:

  • data (Tensor) –

    2D or 3D tensor of one-hot encoded sequences - 2D shape: (sequence_length, alphabet_size) - 3D shape: (batch_size, sequence_length, alphabet_size)

NOTE that when decoding 3D shape tensor, it assumes all sequences have the same length.

Returns:

  • Union[str, list[str]]

    Union[str, list[str]]: Single sequence string or list of sequence strings

Raises:

  • TypeError

    If the input data is not a 2D or 3D tensor

Source code in src/stimulus/data/encoding/encoders.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
def decode(self, data: torch.Tensor) -> Union[str, list[str]]:
    """Decodes one-hot encoded tensor back to sequences.

    Args:
        data (torch.Tensor): 2D or 3D tensor of one-hot encoded sequences
            - 2D shape: (sequence_length, alphabet_size)
            - 3D shape: (batch_size, sequence_length, alphabet_size)

    NOTE that when decoding 3D shape tensor, it assumes all sequences have the same length.

    Returns:
        Union[str, list[str]]: Single sequence string or list of sequence strings

    Raises:
        TypeError: If the input data is not a 2D or 3D tensor
    """
    expected_2d_tensor = 2
    expected_3d_tensor = 3

    if data.dim() == expected_2d_tensor:
        # Single sequence
        data_np = data.numpy().reshape(-1, len(self.alphabet))
        decoded = self.encoder.inverse_transform(data_np).flatten()
        return "".join([i for i in decoded if i is not None])

    if data.dim() == expected_3d_tensor:
        # Multiple sequences
        batch_size, seq_len, _ = data.shape
        data_np = data.reshape(-1, len(self.alphabet)).numpy()
        decoded = self.encoder.inverse_transform(data_np)
        sequences = decoded.reshape(batch_size, seq_len)
        # Convert to masked array where None values are masked
        masked_sequences = np.ma.masked_equal(sequences, None)
        # Fill masked values with "-"
        filled_sequences = masked_sequences.filled("-")
        return ["".join(seq) for seq in filled_sequences]

    raise ValueError(f"Expected 2D or 3D tensor, got {data.dim()}D")

encode

encode(data: str) -> Tensor

One hot encodes a single sequence.

Takes a single string sequence and returns a torch tensor of shape (sequence_length, alphabet_length). The returned tensor corresponds to the one hot encoding of the sequence. Unknown characters are represented by a vector of zeros.

Parameters:

  • data (str) –

    single sequence

Returns:

  • encoded_data_point ( Tensor ) –

    one hot encoded sequence

Raises:

  • TypeError

    If the input data is not a string.

Examples:

>>> encoder = TextOneHotEncoder(alphabet="acgt")
>>> encoder.encode("acgt")
tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1]])
>>> encoder.encode("acgtn")
tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1],
        [0, 0, 0, 0]])
>>> encoder = TextOneHotEncoder(alphabet="ACgt")
>>> encoder.encode("acgt")
tensor([[0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1]])
>>> encoder.encode("ACgt")
tensor([[1, 0, 0, 0],
        [0, 1, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 1]])
Source code in src/stimulus/data/encoding/encoders.py
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def encode(self, data: str) -> torch.Tensor:
    """One hot encodes a single sequence.

    Takes a single string sequence and returns a torch tensor of shape (sequence_length, alphabet_length).
    The returned tensor corresponds to the one hot encoding of the sequence.
    Unknown characters are represented by a vector of zeros.

    Args:
        data (str): single sequence

    Returns:
        encoded_data_point (torch.Tensor): one hot encoded sequence

    Raises:
        TypeError: If the input data is not a string.

    Examples:
        >>> encoder = TextOneHotEncoder(alphabet="acgt")
        >>> encoder.encode("acgt")
        tensor([[1, 0, 0, 0],
                [0, 1, 0, 0],
                [0, 0, 1, 0],
                [0, 0, 0, 1]])
        >>> encoder.encode("acgtn")
        tensor([[1, 0, 0, 0],
                [0, 1, 0, 0],
                [0, 0, 1, 0],
                [0, 0, 0, 1],
                [0, 0, 0, 0]])

        >>> encoder = TextOneHotEncoder(alphabet="ACgt")
        >>> encoder.encode("acgt")
        tensor([[0, 0, 0, 0],
                [0, 0, 0, 0],
                [0, 0, 1, 0],
                [0, 0, 0, 1]])
        >>> encoder.encode("ACgt")
        tensor([[1, 0, 0, 0],
                [0, 1, 0, 0],
                [0, 0, 1, 0],
                [0, 0, 0, 1]])
    """
    sequence_array = self._sequence_to_array(data)
    transformed = self.encoder.transform(sequence_array)
    numpy_array = np.squeeze(np.stack(transformed.toarray()))
    return torch.from_numpy(numpy_array)

encode_all

encode_all(data: Union[str, list[str]]) -> Tensor

Encodes a list of sequences.

Takes a list of string sequences and returns a torch tensor of shape (number_of_sequences, sequence_length, alphabet_length). The returned tensor corresponds to the one hot encoding of the sequences. Unknown characters are represented by a vector of zeros.

Parameters:

Returns:

  • encoded_data ( Tensor ) –

    one hot encoded sequences

Raises:

  • TypeError

    If the input data is not a list or a string.

  • ValueError

    If all sequences do not have the same length when padding is False.

Examples:

>>> encoder = TextOneHotEncoder(alphabet="acgt")
>>> encoder.encode_all(["acgt", "acgtn"])
tensor([[[1, 0, 0, 0],
         [0, 1, 0, 0],
         [0, 0, 1, 0],
         [0, 0, 0, 1],
         [0, 0, 0, 0]], // this is padded with zeros
    [[1, 0, 0, 0],
     [0, 1, 0, 0],
     [0, 0, 1, 0],
     [0, 0, 0, 1],
     [0, 0, 0, 0]]])
Source code in src/stimulus/data/encoding/encoders.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def encode_all(self, data: Union[str, list[str]]) -> torch.Tensor:
    """Encodes a list of sequences.

    Takes a list of string sequences and returns a torch tensor of shape (number_of_sequences, sequence_length, alphabet_length).
    The returned tensor corresponds to the one hot encoding of the sequences.
    Unknown characters are represented by a vector of zeros.

    Args:
        data (Union[str, list[str]]): list of sequences or a single sequence

    Returns:
        encoded_data (torch.Tensor): one hot encoded sequences

    Raises:
        TypeError: If the input data is not a list or a string.
        ValueError: If all sequences do not have the same length when padding is False.

    Examples:
        >>> encoder = TextOneHotEncoder(alphabet="acgt")
        >>> encoder.encode_all(["acgt", "acgtn"])
        tensor([[[1, 0, 0, 0],
                 [0, 1, 0, 0],
                 [0, 0, 1, 0],
                 [0, 0, 0, 1],
                 [0, 0, 0, 0]], // this is padded with zeros

                [[1, 0, 0, 0],
                 [0, 1, 0, 0],
                 [0, 0, 1, 0],
                 [0, 0, 0, 1],
                 [0, 0, 0, 0]]])
    """
    encoded_data = None  # to prevent UnboundLocalError
    # encode data
    if isinstance(data, str):
        encoded_data = self.encode(data)
        return torch.stack([encoded_data])
    if isinstance(data, list):
        # TODO instead maybe we can run encode_multiprocess when data size is larger than a certain threshold.
        encoded_list = self.encode_multiprocess(data)
    else:
        error_msg = f"Expected list or string input for data, got {type(data).__name__}"
        logger.error(error_msg)
        raise TypeError(error_msg)

    # handle padding
    if self.padding:
        max_length = max([len(d) for d in encoded_list])
        encoded_data = torch.stack([F.pad(d, (0, 0, 0, max_length - len(d))) for d in encoded_list])
    else:
        lengths = {len(d) for d in encoded_list}
        if len(lengths) > 1:
            error_msg = "All sequences must have the same length when padding is False."
            logger.error(error_msg)
            raise ValueError(error_msg)
        encoded_data = torch.stack(encoded_list)

    if encoded_data is None:
        raise ValueError("Encoded data is None. This should not happen.")

    return encoded_data

encode_multiprocess

encode_multiprocess(data: list[str]) -> list[Tensor]

Encodes a list of sequences using multiprocessing.

Source code in src/stimulus/data/encoding/encoders.py
195
196
197
198
def encode_multiprocess(self, data: list[str]) -> list[torch.Tensor]:
    """Encodes a list of sequences using multiprocessing."""
    with mp.Pool() as pool:
        return pool.map(self.encode, data)