Skip to content

data_transformation_generators

This file contains noise generators classes for generating various types of noise.

Classes:

AbstractAugmentationGenerator

AbstractAugmentationGenerator()

Bases: AbstractDataTransformer

Abstract class for augmentation generators.

All augmentation function should have the seed in it. This is because the multiprocessing of them could unset the seed.

Methods:

Source code in src/stimulus/data/transform/data_transformation_generators.py
85
86
87
88
def __init__(self) -> None:
    """Initialize the augmentation generator."""
    super().__init__()
    self.add_row = True

transform abstractmethod

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (Any) –

    the data to be transformed

Returns:

  • transformed_data ( Any ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all abstractmethod

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (list) –

    the data to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

AbstractDataTransformer

AbstractDataTransformer()

Bases: ABC

Abstract class for data transformers.

Data transformers implement in_place or augmentation transformations. Whether it is in_place or augmentation is specified in the "add_row" attribute (should be True or False and set in children classes constructor)

Child classes should override the transform and transform_all methods.

transform_all should always return a list

Both methods should take an optional seed argument set to None by default to be compliant with stimulus' core principle of reproducibility. Seed should be initialized through np.random.seed(seed) in the method implementation.

Attributes:

  • add_row (bool) –

    whether the transformer adds rows to the data

Methods:

Methods:

Source code in src/stimulus/data/transform/data_transformation_generators.py
31
32
33
34
def __init__(self) -> None:
    """Initialize the data transformer."""
    self.add_row: bool = False
    self.seed: int = 42

transform abstractmethod

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (Any) –

    the data to be transformed

Returns:

  • transformed_data ( Any ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all abstractmethod

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (list) –

    the data to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

AbstractNoiseGenerator

AbstractNoiseGenerator()

Bases: AbstractDataTransformer

Abstract class for noise generators.

All noise function should have the seed in it. This is because the multiprocessing of them could unset the seed.

Methods:

Source code in src/stimulus/data/transform/data_transformation_generators.py
73
74
75
76
def __init__(self) -> None:
    """Initialize the noise generator."""
    super().__init__()
    self.add_row = False

transform abstractmethod

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (Any) –

    the data to be transformed

Returns:

  • transformed_data ( Any ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
36
37
38
39
40
41
42
43
44
45
46
47
48
49
@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all abstractmethod

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

  • data (list) –

    the data to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

GaussianChunk

GaussianChunk(
    chunk_size: int, seed: int = 42, std: float = 1
)

Bases: AbstractAugmentationGenerator

Subset data around a random midpoint.

This augmentation strategy chunks the input sequences, for which the middle positions are obtained through a gaussian distribution.

In concrete, it changes the middle position (ie. peak summit) to another position. This position is chosen based on a gaussian distribution, so the region close to the middle point are more likely to be chosen than the rest. Then a chunk with size chunk_size around the new middle point is returned. This process will be repeated for each sequence with transform_all.

Methods:

Parameters:

  • chunk_size (int) –

    Size of chunks to extract

  • seed (int, default: 42 ) –

    Random seed for reproducibility

  • std (float, default: 1 ) –

    Standard deviation for the Gaussian distribution

Methods:

  • transform

    Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.

  • transform_all

    Adds chunks to multiple lists using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py
257
258
259
260
261
262
263
264
265
266
267
268
def __init__(self, chunk_size: int, seed: int = 42, std: float = 1) -> None:
    """Initialize the Gaussian chunk generator.

    Args:
        chunk_size: Size of chunks to extract
        seed: Random seed for reproducibility
        std: Standard deviation for the Gaussian distribution
    """
    super().__init__()
    self.chunk_size = chunk_size
    self.seed = seed
    self.std = std

transform

transform(data: str) -> str

Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.

Parameters:

  • data (str) –

    the sequence to be transformed

Returns:

  • transformed_data ( str ) –

    the chunk of the sequence

Raises:

Source code in src/stimulus/data/transform/data_transformation_generators.py
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
def transform(self, data: str) -> str:
    """Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.

    Args:
        data (str): the sequence to be transformed

    Returns:
        transformed_data (str): the chunk of the sequence

    Raises:
        AssertionError: if the input data is shorter than the chunk size
    """
    np.random.seed(self.seed)

    # make sure that the data is longer than chunk_size otherwise raise an error
    if len(data) <= self.chunk_size:
        raise ValueError("The input data is shorter than the chunk size")

    # Get the middle position of the input sequence
    middle_position = len(data) // 2

    # Change the middle position by a value obtained through a gaussian distribution
    new_middle_position = int(middle_position + np.random.normal(0, self.std))

    # Get the start and end position of the chunk
    start_position = new_middle_position - self.chunk_size // 2
    end_position = new_middle_position + self.chunk_size // 2

    # if the start position is negative, set it to 0
    start_position = max(start_position, 0)

    # Get the chunk of size chunk_size from the start position if the end position is smaller than the length of the data
    if end_position < len(data):
        return data[start_position : start_position + self.chunk_size]
    # Otherwise return the chunk of the sequence from the end of the sequence of size chunk_size
    return data[-self.chunk_size :]

transform_all

transform_all(data: list) -> list

Adds chunks to multiple lists using multiprocessing.

Parameters:

  • data (list) –

    the sequences to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed sequences

Source code in src/stimulus/data/transform/data_transformation_generators.py
307
308
309
310
311
312
313
314
315
316
317
318
def transform_all(self, data: list) -> list:
    """Adds chunks to multiple lists using multiprocessing.

    Args:
        data (list): the sequences to be transformed

    Returns:
        transformed_data (list): the transformed sequences
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.starmap(self.transform, function_specific_input)

GaussianNoise

GaussianNoise(
    mean: float = 0, std: float = 1, seed: int = 42
)

Bases: AbstractNoiseGenerator

Add Gaussian noise to data.

This noise generator adds Gaussian noise to float values.

Methods:

Parameters:

  • mean (float, default: 0 ) –

    Mean of the Gaussian noise

  • std (float, default: 1 ) –

    Standard deviation of the Gaussian noise

  • seed (int, default: 42 ) –

    Random seed for reproducibility

Methods:

  • transform

    Adds Gaussian noise to a single point of data.

  • transform_all

    Adds Gaussian noise to a list of data points.

Source code in src/stimulus/data/transform/data_transformation_generators.py
151
152
153
154
155
156
157
158
159
160
161
162
def __init__(self, mean: float = 0, std: float = 1, seed: int = 42) -> None:
    """Initialize the Gaussian noise generator.

    Args:
        mean: Mean of the Gaussian noise
        std: Standard deviation of the Gaussian noise
        seed: Random seed for reproducibility
    """
    super().__init__()
    self.mean = mean
    self.std = std
    self.seed = seed

transform

transform(data: float) -> float

Adds Gaussian noise to a single point of data.

Parameters:

  • data (float) –

    the data to be transformed

Returns:

  • transformed_data ( float ) –

    the transformed data point

Source code in src/stimulus/data/transform/data_transformation_generators.py
164
165
166
167
168
169
170
171
172
173
174
def transform(self, data: float) -> float:
    """Adds Gaussian noise to a single point of data.

    Args:
        data (float): the data to be transformed

    Returns:
        transformed_data (float): the transformed data point
    """
    np.random.seed(self.seed)
    return data + np.random.normal(self.mean, self.std)

transform_all

transform_all(data: list) -> list

Adds Gaussian noise to a list of data points.

Parameters:

  • data (list) –

    the data to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed data points

Source code in src/stimulus/data/transform/data_transformation_generators.py
176
177
178
179
180
181
182
183
184
185
186
def transform_all(self, data: list) -> list:
    """Adds Gaussian noise to a list of data points.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data points
    """
    np.random.seed(self.seed)
    return list(np.array(data) + np.random.normal(self.mean, self.std, len(data)))

ReverseComplement

ReverseComplement(sequence_type: str = 'DNA')

Bases: AbstractAugmentationGenerator

Reverse complement biological sequences.

This augmentation strategy reverse complements the input nucleotide sequences.

Methods:

  • transform

    reverse complements a single data point

  • transform_all

    reverse complements a list of data points

Raises:

  • ValueError

    if the type of the sequence is not DNA or RNA

Parameters:

  • sequence_type (str, default: 'DNA' ) –

    Type of sequence ('DNA' or 'RNA')

Methods:

  • transform

    Returns the reverse complement of a list of string data using the complement_mapping.

  • transform_all

    Reverse complement multiple data points using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
def __init__(self, sequence_type: str = "DNA") -> None:
    """Initialize the reverse complement generator.

    Args:
        sequence_type: Type of sequence ('DNA' or 'RNA')
    """
    super().__init__()
    if sequence_type not in ("DNA", "RNA"):
        raise ValueError(
            "Currently only DNA and RNA sequences are supported. Update the class ReverseComplement to support other types.",
        )
    if sequence_type == "DNA":
        self.complement_mapping = str.maketrans("ATCG", "TAGC")
    elif sequence_type == "RNA":
        self.complement_mapping = str.maketrans("AUCG", "UAGC")

transform

transform(data: str) -> str

Returns the reverse complement of a list of string data using the complement_mapping.

Parameters:

  • data (str) –

    the sequence to be transformed

Returns:

  • transformed_data ( str ) –

    the reverse complement of the sequence

Source code in src/stimulus/data/transform/data_transformation_generators.py
218
219
220
221
222
223
224
225
226
227
def transform(self, data: str) -> str:
    """Returns the reverse complement of a list of string data using the complement_mapping.

    Args:
        data (str): the sequence to be transformed

    Returns:
        transformed_data (str): the reverse complement of the sequence
    """
    return data.translate(self.complement_mapping)[::-1]

transform_all

transform_all(data: list) -> list

Reverse complement multiple data points using multiprocessing.

Parameters:

  • data (list) –

    the sequences to be transformed

Returns:

  • transformed_data ( list ) –

    the reverse complement of the sequences

Source code in src/stimulus/data/transform/data_transformation_generators.py
229
230
231
232
233
234
235
236
237
238
239
240
def transform_all(self, data: list) -> list:
    """Reverse complement multiple data points using multiprocessing.

    Args:
        data (list): the sequences to be transformed

    Returns:
        transformed_data (list): the reverse complement of the sequences
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.map(self.transform, function_specific_input)

UniformTextMasker

UniformTextMasker(
    probability: float = 0.1,
    mask: str = "*",
    seed: int = 42,
)

Bases: AbstractNoiseGenerator

Mask characters in text.

This noise generators replace characters with a masking character with a given probability.

Methods:

  • transform

    adds character masking to a single data point

  • transform_all

    adds character masking to a list of data points

Parameters:

  • probability (float, default: 0.1 ) –

    Probability of masking each character

  • mask (str, default: '*' ) –

    Character to use for masking

  • seed (int, default: 42 ) –

    Random seed for reproducibility

Methods:

  • transform

    Adds character masking to the data.

  • transform_all

    Adds character masking to multiple data points using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py
101
102
103
104
105
106
107
108
109
110
111
112
def __init__(self, probability: float = 0.1, mask: str = "*", seed: int = 42) -> None:
    """Initialize the text masker.

    Args:
        probability: Probability of masking each character
        mask: Character to use for masking
        seed: Random seed for reproducibility
    """
    super().__init__()
    self.probability = probability
    self.mask = mask
    self.seed = seed

transform

transform(data: str) -> str

Adds character masking to the data.

Parameters:

  • data (str) –

    the data to be transformed

Returns:

  • transformed_data ( str ) –

    the transformed data point

Source code in src/stimulus/data/transform/data_transformation_generators.py
114
115
116
117
118
119
120
121
122
123
124
def transform(self, data: str) -> str:
    """Adds character masking to the data.

    Args:
        data (str): the data to be transformed

    Returns:
        transformed_data (str): the transformed data point
    """
    np.random.seed(self.seed)
    return "".join([c if np.random.rand() > self.probability else self.mask for c in data])

transform_all

transform_all(data: list) -> list

Adds character masking to multiple data points using multiprocessing.

Parameters:

  • data (list) –

    the data to be transformed

Returns:

  • transformed_data ( list ) –

    the transformed data points

Source code in src/stimulus/data/transform/data_transformation_generators.py
126
127
128
129
130
131
132
133
134
135
136
137
138
def transform_all(self, data: list) -> list:
    """Adds character masking to multiple data points using multiprocessing.

    Args:
        data (list): the data to be transformed


    Returns:
        transformed_data (list): the transformed data points
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.starmap(self.transform, function_specific_input)