data_transformation_generators ¶

This file contains noise generators classes for generating various types of noise.

Classes:

AbstractAugmentationGenerator –

Abstract class for augmentation generators.
AbstractDataTransformer –

Abstract class for data transformers.
AbstractNoiseGenerator –

Abstract class for noise generators.
GaussianChunk –

Subset data around a random midpoint.
GaussianNoise –

Add Gaussian noise to data.
ReverseComplement –

Reverse complement biological sequences.
UniformTextMasker –

Mask characters in text.

AbstractAugmentationGenerator ¶

AbstractAugmentationGenerator()

Bases: AbstractDataTransformer

Abstract class for augmentation generators.

All augmentation function should have the seed in it. This is because the multiprocessing of them could unset the seed.

Methods:

transform –

Transforms a single data point.
transform_all –

Transforms a list of data points.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self) -> None:
    """Initialize the augmentation generator."""
    super().__init__()
    self.add_row = True

transform `abstractmethod` ¶

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

data (Any) –

the data to be transformed

Returns:

transformed_data ( Any ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all `abstractmethod` ¶

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

data (list) –

the data to be transformed

Returns:

transformed_data ( list ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

AbstractDataTransformer ¶

AbstractDataTransformer()

Bases: ABC

Abstract class for data transformers.

Data transformers implement in_place or augmentation transformations. Whether it is in_place or augmentation is specified in the "add_row" attribute (should be True or False and set in children classes constructor)

Child classes should override the transform and transform_all methods.

transform_all should always return a list

Both methods should take an optional seed argument set to None by default to be compliant with stimulus' core principle of reproducibility. Seed should be initialized through np.random.seed(seed) in the method implementation.

Attributes:

add_row (bool) –

whether the transformer adds rows to the data

Methods:

transform –

transforms a data point
transform_all –

transforms a list of data points

Methods:

transform –

Transforms a single data point.
transform_all –

Transforms a list of data points.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self) -> None:
    """Initialize the data transformer."""
    self.add_row: bool = False
    self.seed: int = 42

transform `abstractmethod` ¶

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

data (Any) –

the data to be transformed

Returns:

transformed_data ( Any ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all `abstractmethod` ¶

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

data (list) –

the data to be transformed

Returns:

transformed_data ( list ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

AbstractNoiseGenerator ¶

AbstractNoiseGenerator()

Bases: AbstractDataTransformer

Abstract class for noise generators.

All noise function should have the seed in it. This is because the multiprocessing of them could unset the seed.

Methods:

transform –

Transforms a single data point.
transform_all –

Transforms a list of data points.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self) -> None:
    """Initialize the noise generator."""
    super().__init__()
    self.add_row = False

transform `abstractmethod` ¶

transform(data: Any) -> Any

Transforms a single data point.

This is an abstract method that should be implemented by the child class.

Parameters:

data (Any) –

the data to be transformed

Returns:

transformed_data ( Any ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform(self, data: Any) -> Any:
    """Transforms a single data point.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (Any): the data to be transformed

    Returns:
        transformed_data (Any): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

transform_all `abstractmethod` ¶

transform_all(data: list) -> list

Transforms a list of data points.

This is an abstract method that should be implemented by the child class.

Parameters:

data (list) –

the data to be transformed

Returns:

transformed_data ( list ) –

the transformed data

Source code in src/stimulus/data/transform/data_transformation_generators.py

@abstractmethod
def transform_all(self, data: list) -> list:
    """Transforms a list of data points.

    This is an abstract method that should be implemented by the child class.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data
    """
    #  np.random.seed(self.seed)
    raise NotImplementedError

GaussianChunk ¶

GaussianChunk(
    chunk_size: int, seed: int = 42, std: float = 1
)

Bases: AbstractAugmentationGenerator

Subset data around a random midpoint.

This augmentation strategy chunks the input sequences, for which the middle positions are obtained through a gaussian distribution.

In concrete, it changes the middle position (ie. peak summit) to another position. This position is chosen based on a gaussian distribution, so the region close to the middle point are more likely to be chosen than the rest. Then a chunk with size chunk_size around the new middle point is returned. This process will be repeated for each sequence with transform_all.

Methods:

transform –

chunk a single list
transform_all –

chunks multiple lists

Parameters:

chunk_size (int) –

Size of chunks to extract
seed (int, default: 42 ) –

Random seed for reproducibility
std (float, default: 1 ) –

Standard deviation for the Gaussian distribution

Methods:

transform –

Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.
transform_all –

Adds chunks to multiple lists using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self, chunk_size: int, seed: int = 42, std: float = 1) -> None:
    """Initialize the Gaussian chunk generator.

    Args:
        chunk_size: Size of chunks to extract
        seed: Random seed for reproducibility
        std: Standard deviation for the Gaussian distribution
    """
    super().__init__()
    self.chunk_size = chunk_size
    self.seed = seed
    self.std = std

transform ¶

transform(data: str) -> str

Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.

Parameters:

data (str) –

the sequence to be transformed

Returns:

transformed_data ( str ) –

the chunk of the sequence

Raises:

AssertionError –

if the input data is shorter than the chunk size

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform(self, data: str) -> str:
    """Chunks a sequence of size chunk_size from the middle position +/- a value obtained through a gaussian distribution.

    Args:
        data (str): the sequence to be transformed

    Returns:
        transformed_data (str): the chunk of the sequence

    Raises:
        AssertionError: if the input data is shorter than the chunk size
    """
    np.random.seed(self.seed)

    # make sure that the data is longer than chunk_size otherwise raise an error
    if len(data) <= self.chunk_size:
        raise ValueError("The input data is shorter than the chunk size")

    # Get the middle position of the input sequence
    middle_position = len(data) // 2

    # Change the middle position by a value obtained through a gaussian distribution
    new_middle_position = int(middle_position + np.random.normal(0, self.std))

    # Get the start and end position of the chunk
    start_position = new_middle_position - self.chunk_size // 2
    end_position = new_middle_position + self.chunk_size // 2

    # if the start position is negative, set it to 0
    start_position = max(start_position, 0)

    # Get the chunk of size chunk_size from the start position if the end position is smaller than the length of the data
    if end_position < len(data):
        return data[start_position : start_position + self.chunk_size]
    # Otherwise return the chunk of the sequence from the end of the sequence of size chunk_size
    return data[-self.chunk_size :]

transform_all ¶

transform_all(data: list) -> list

Adds chunks to multiple lists using multiprocessing.

Parameters:

data (list) –

the sequences to be transformed

Returns:

transformed_data ( list ) –

the transformed sequences

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform_all(self, data: list) -> list:
    """Adds chunks to multiple lists using multiprocessing.

    Args:
        data (list): the sequences to be transformed

    Returns:
        transformed_data (list): the transformed sequences
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.starmap(self.transform, function_specific_input)

GaussianNoise ¶

GaussianNoise(
    mean: float = 0, std: float = 1, seed: int = 42
)

Bases: AbstractNoiseGenerator

Add Gaussian noise to data.

This noise generator adds Gaussian noise to float values.

Methods:

transform –

adds noise to a single data point
transform_all –

adds noise to a list of data points

Parameters:

mean (float, default: 0 ) –

Mean of the Gaussian noise
std (float, default: 1 ) –

Standard deviation of the Gaussian noise
seed (int, default: 42 ) –

Random seed for reproducibility

Methods:

transform –

Adds Gaussian noise to a single point of data.
transform_all –

Adds Gaussian noise to a list of data points.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self, mean: float = 0, std: float = 1, seed: int = 42) -> None:
    """Initialize the Gaussian noise generator.

    Args:
        mean: Mean of the Gaussian noise
        std: Standard deviation of the Gaussian noise
        seed: Random seed for reproducibility
    """
    super().__init__()
    self.mean = mean
    self.std = std
    self.seed = seed

transform ¶

transform(data: float) -> float

Adds Gaussian noise to a single point of data.

Parameters:

data (float) –

the data to be transformed

Returns:

transformed_data ( float ) –

the transformed data point

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform(self, data: float) -> float:
    """Adds Gaussian noise to a single point of data.

    Args:
        data (float): the data to be transformed

    Returns:
        transformed_data (float): the transformed data point
    """
    np.random.seed(self.seed)
    return data + np.random.normal(self.mean, self.std)

transform_all ¶

transform_all(data: list) -> list

Adds Gaussian noise to a list of data points.

Parameters:

data (list) –

the data to be transformed

Returns:

transformed_data ( list ) –

the transformed data points

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform_all(self, data: list) -> list:
    """Adds Gaussian noise to a list of data points.

    Args:
        data (list): the data to be transformed

    Returns:
        transformed_data (list): the transformed data points
    """
    np.random.seed(self.seed)
    return list(np.array(data) + np.random.normal(self.mean, self.std, len(data)))

ReverseComplement ¶

ReverseComplement(sequence_type: str = 'DNA')

Bases: AbstractAugmentationGenerator

Reverse complement biological sequences.

This augmentation strategy reverse complements the input nucleotide sequences.

Methods:

transform –

reverse complements a single data point
transform_all –

reverse complements a list of data points

Raises:

ValueError –

if the type of the sequence is not DNA or RNA

Parameters:

sequence_type (str, default: 'DNA' ) –

Type of sequence ('DNA' or 'RNA')

Methods:

transform –

Returns the reverse complement of a list of string data using the complement_mapping.
transform_all –

Reverse complement multiple data points using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self, sequence_type: str = "DNA") -> None:
    """Initialize the reverse complement generator.

    Args:
        sequence_type: Type of sequence ('DNA' or 'RNA')
    """
    super().__init__()
    if sequence_type not in ("DNA", "RNA"):
        raise ValueError(
            "Currently only DNA and RNA sequences are supported. Update the class ReverseComplement to support other types.",
        )
    if sequence_type == "DNA":
        self.complement_mapping = str.maketrans("ATCG", "TAGC")
    elif sequence_type == "RNA":
        self.complement_mapping = str.maketrans("AUCG", "UAGC")

transform ¶

transform(data: str) -> str

Returns the reverse complement of a list of string data using the complement_mapping.

Parameters:

data (str) –

the sequence to be transformed

Returns:

transformed_data ( str ) –

the reverse complement of the sequence

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform(self, data: str) -> str:
    """Returns the reverse complement of a list of string data using the complement_mapping.

    Args:
        data (str): the sequence to be transformed

    Returns:
        transformed_data (str): the reverse complement of the sequence
    """
    return data.translate(self.complement_mapping)[::-1]

transform_all ¶

transform_all(data: list) -> list

Reverse complement multiple data points using multiprocessing.

Parameters:

data (list) –

the sequences to be transformed

Returns:

transformed_data ( list ) –

the reverse complement of the sequences

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform_all(self, data: list) -> list:
    """Reverse complement multiple data points using multiprocessing.

    Args:
        data (list): the sequences to be transformed

    Returns:
        transformed_data (list): the reverse complement of the sequences
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.map(self.transform, function_specific_input)

UniformTextMasker ¶

UniformTextMasker(
    probability: float = 0.1,
    mask: str = "*",
    seed: int = 42,
)

Bases: AbstractNoiseGenerator

Mask characters in text.

This noise generators replace characters with a masking character with a given probability.

Methods:

transform –

adds character masking to a single data point
transform_all –

adds character masking to a list of data points

Parameters:

probability (float, default: 0.1 ) –

Probability of masking each character
mask (str, default: '*' ) –

Character to use for masking
seed (int, default: 42 ) –

Random seed for reproducibility

Methods:

transform –

Adds character masking to the data.
transform_all –

Adds character masking to multiple data points using multiprocessing.

Source code in src/stimulus/data/transform/data_transformation_generators.py

def __init__(self, probability: float = 0.1, mask: str = "*", seed: int = 42) -> None:
    """Initialize the text masker.

    Args:
        probability: Probability of masking each character
        mask: Character to use for masking
        seed: Random seed for reproducibility
    """
    super().__init__()
    self.probability = probability
    self.mask = mask
    self.seed = seed

transform ¶

transform(data: str) -> str

Adds character masking to the data.

Parameters:

data (str) –

the data to be transformed

Returns:

transformed_data ( str ) –

the transformed data point

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform(self, data: str) -> str:
    """Adds character masking to the data.

    Args:
        data (str): the data to be transformed

    Returns:
        transformed_data (str): the transformed data point
    """
    np.random.seed(self.seed)
    return "".join([c if np.random.rand() > self.probability else self.mask for c in data])

transform_all ¶

transform_all(data: list) -> list

Adds character masking to multiple data points using multiprocessing.

Parameters:

data (list) –

the data to be transformed

Returns:

transformed_data ( list ) –

the transformed data points

Source code in src/stimulus/data/transform/data_transformation_generators.py

def transform_all(self, data: list) -> list:
    """Adds character masking to multiple data points using multiprocessing.

    Args:
        data (list): the data to be transformed


    Returns:
        transformed_data (list): the transformed data points
    """
    with mp.Pool(mp.cpu_count()) as pool:
        function_specific_input = list(data)
        return pool.starmap(self.transform, function_specific_input)

data_transformation_generators ¶

AbstractAugmentationGenerator ¶

transform abstractmethod ¶

transform_all abstractmethod ¶

AbstractDataTransformer ¶

transform abstractmethod ¶

transform_all abstractmethod ¶

AbstractNoiseGenerator ¶

transform abstractmethod ¶

transform_all abstractmethod ¶

GaussianChunk ¶

transform ¶

transform_all ¶

GaussianNoise ¶

transform ¶

transform_all ¶

ReverseComplement ¶

transform ¶

transform_all ¶

UniformTextMasker ¶

transform ¶

transform_all ¶

Feedback

transform `abstractmethod` ¶

transform_all `abstractmethod` ¶

transform `abstractmethod` ¶

transform_all `abstractmethod` ¶

transform `abstractmethod` ¶

transform_all `abstractmethod` ¶