Skip to content

Datasets

Datasets

cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)

Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.

Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).

Returns:

  • list

    n randomly sampled points in d-dimensional space all at least lower away from each other.

Source code in src/simnetpy/datasets/distributions.py
def cluster_centers(n, d, lower=1, higher=2, rng=None, init=None):
    """Sample n point in d-dimensional space. Points will be between
    (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0)
    unless otherwise specified. Sampling done sequentially with next center proposed,
    rejected if too close to all others and resampled until accepted.
    If sampling large number of points & time taken is large increase size of higher.

    Note: 2*higher is not actual distance upper bound. higher controls size of box around previous center
    that we sample a proposal point. Each center sampled from box with sides of size 2*higher.
    Args:
        n (int): number of points to generate
        d (int): number of dimensions
        lower (float, optional): Lower bound of distances to accept. All points will be
                            at least lower away from each other. Defaults to 1.
        higher (float, optional): Size of box around previous center to sample from. Defaults to 2.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.
        init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled
        but if none at least one will be the origin (0,0,...,0).

    Returns:
        list: n randomly sampled points in d-dimensional space all at least lower away from each other.
    """
    if rng is None:
        rng = np.random.default_rng()

    if init is None:
        init = np.zeros(d)
    centers = []
    x = init + uniform_sampler(d, std=higher, rng=rng)
    centers.append(x)

    while len(centers) < n:
        # sample point in random direction
        x = uniform_sampler(d, std=higher, rng=rng)

        # find random center and move away in direction x
        i = rng.integers(0, len(centers))
        x = centers[i] + x

        carray = np.array(centers)
        dist = np.linalg.norm(carray - x, axis=1, ord=2)

        if np.all(dist > lower):
            centers.append(x)
    rng.shuffle(centers)  # shuffle so first cluster is not necessarily close to init
    return centers

mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)

alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem

Source code in src/simnetpy/datasets/distributions.py
def mixed_categorical_clusters(
    N,
    d,
    nclusters=3,
    sizes="equal",
    alpha=5,
    beta=1,
    nlevels=5,
    return_skew_factors=True,
    rng=None,
):
    """alpha and beta control shape of skew factor distribution
    higher max(abs(a,b))>1 -> less flat more peaked distribution
    a<b -> skewed below 0.5
    a>b -> skewed above 0.5
    so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83
    and 1,5 means average skew will be ~0.16
    lower average skew means noisier data and harder clustering problem

    """

    if rng is None:
        rng = np.random.default_rng()

    if isinstance(sizes, str):
        assert sizes.lower() in [
            "equal",
            "random",
            "roughly_equal",
        ], "if specifying method sizes must be one of [equal, random, roughly_equal]"
        sizes = split_data_into_clusters(N, nclusters, method=sizes)
    elif sizes is None:
        sizes = split_data_into_clusters(N, nclusters, method="equal")

    assert sizes.shape[0] == nclusters, "nclusters and sizes must match"
    assert sizes.sum() == N, f"sizes must add up to N {sizes.sum()} != {N}"

    # N = sizes.sum()

    # generate skew distribution & sample
    rv = stats.beta(a=alpha, b=beta)
    skew_factors = rv.rvs(size=d, random_state=rng)

    # generate features
    X = np.zeros((N, d))
    for i in range(d):
        si = skew_factors[i]
        if si >= 0.5 and si <= 1:
            skew = (
                2 * ((nlevels - 1) / nlevels) * si + (2 - nlevels) / nlevels
            )  # map skew factor from [0.5, 1] to [1/nlevels, 1] i.e. si=0.5, skew=1/nlevels
            X[:, i] = mixed_categorical_cluster_feature(
                sizes, nlevels=nlevels, min_largest_cat=skew, rng=rng
            )
        elif si >= 0 and si < 0.5:
            skew = (
                2 * ((1 - nlevels) / nlevels) * si + 1
            )  # map skew factor from [0.5, 0] to [1/nlevels, 1] i.e si=0 -> skew = 1.0
            X[:, i] = single_categorical_feature(
                N, nlevels=nlevels, min_largest_cat=skew, rng=rng
            )
        else:
            raise ValueError("Error beta distribution ill defined")

    # generate labels
    y = np.zeros(N)
    total = 0
    for i, size in enumerate(sizes):
        y[total : total + size] = i
        total += size

    dataset = Bunch(y=y, X=X)
    if return_skew_factors:
        dataset["skew_factors"] = skew_factors

    return dataset

multivariate_guassian(N, center, std=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

  • N (int) –

    number of points to sample

  • center (ndarray) –

    d-dimensional point

  • std (float, default: 1 ) –

    Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1.

  • rng (default_rng, default: None ) –

    user seeded random number generator. Defaults to None.

Returns:

  • np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py
def multivariate_guassian(N, center, std=1, rng=None):
    """Sample N point from a multivariate guassian with mean at center and
    Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred
    from user passed center.

    Args:
        N (int): number of points to sample
        center (np.ndarray): d-dimensional point
        std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.

    Returns:
        np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix
    """
    if rng is None:
        rng = np.random.default_rng()

    if not isinstance(std, np.ndarray):
        d = center.shape[0]
        COV = std * np.eye(d)
    else:
        COV = std

    X = rng.multivariate_normal(center, COV, size=N)
    return X

multivariate_t(N, center, std=1, df=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

  • N (int) –

    number of points to sample

  • center (ndarray) –

    d-dimensional point

  • std (float, default: 1 ) –

    Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix

  • df (float, default: 1 ) –

    Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.

  • rng (default_rng, default: None ) –

    user seeded random number generator. Defaults to None.

Returns:

  • np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py
def multivariate_t(N, center, std=1, df=1, rng=None):
    """Sample N point from a multivariate guassian with mean at center and
    Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred
    from user passed center.

    Args:
        N (int): number of points to sample
        center (np.ndarray): d-dimensional point
        std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1.
                                Also accepts numpy covariance matrix
        df (float, optional): Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.

    Returns:
        np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix
    """
    if rng is None:
        rng = np.random.default_rng()

    if not isinstance(std, np.ndarray):
        d = center.shape[0]
        COV = std * np.eye(d)
    else:
        COV = std

    # X = rng.multivariate_normal(center, COV, size=N)
    # dist = stats.multivariate_t(center, shape=COV, df=df, seed=rng)
    # X = dist.rvs(size=size)
    X = stats.multivariate_t.rvs(loc=center, shape=COV, df=df, size=N, random_state=rng)

    return X

distributions

cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)

Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.

Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).

Returns:

  • list

    n randomly sampled points in d-dimensional space all at least lower away from each other.

Source code in src/simnetpy/datasets/distributions.py
def cluster_centers(n, d, lower=1, higher=2, rng=None, init=None):
    """Sample n point in d-dimensional space. Points will be between
    (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0)
    unless otherwise specified. Sampling done sequentially with next center proposed,
    rejected if too close to all others and resampled until accepted.
    If sampling large number of points & time taken is large increase size of higher.

    Note: 2*higher is not actual distance upper bound. higher controls size of box around previous center
    that we sample a proposal point. Each center sampled from box with sides of size 2*higher.
    Args:
        n (int): number of points to generate
        d (int): number of dimensions
        lower (float, optional): Lower bound of distances to accept. All points will be
                            at least lower away from each other. Defaults to 1.
        higher (float, optional): Size of box around previous center to sample from. Defaults to 2.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.
        init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled
        but if none at least one will be the origin (0,0,...,0).

    Returns:
        list: n randomly sampled points in d-dimensional space all at least lower away from each other.
    """
    if rng is None:
        rng = np.random.default_rng()

    if init is None:
        init = np.zeros(d)
    centers = []
    x = init + uniform_sampler(d, std=higher, rng=rng)
    centers.append(x)

    while len(centers) < n:
        # sample point in random direction
        x = uniform_sampler(d, std=higher, rng=rng)

        # find random center and move away in direction x
        i = rng.integers(0, len(centers))
        x = centers[i] + x

        carray = np.array(centers)
        dist = np.linalg.norm(carray - x, axis=1, ord=2)

        if np.all(dist > lower):
            centers.append(x)
    rng.shuffle(centers)  # shuffle so first cluster is not necessarily close to init
    return centers

mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)

alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem

Source code in src/simnetpy/datasets/distributions.py
def mixed_categorical_clusters(
    N,
    d,
    nclusters=3,
    sizes="equal",
    alpha=5,
    beta=1,
    nlevels=5,
    return_skew_factors=True,
    rng=None,
):
    """alpha and beta control shape of skew factor distribution
    higher max(abs(a,b))>1 -> less flat more peaked distribution
    a<b -> skewed below 0.5
    a>b -> skewed above 0.5
    so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83
    and 1,5 means average skew will be ~0.16
    lower average skew means noisier data and harder clustering problem

    """

    if rng is None:
        rng = np.random.default_rng()

    if isinstance(sizes, str):
        assert sizes.lower() in [
            "equal",
            "random",
            "roughly_equal",
        ], "if specifying method sizes must be one of [equal, random, roughly_equal]"
        sizes = split_data_into_clusters(N, nclusters, method=sizes)
    elif sizes is None:
        sizes = split_data_into_clusters(N, nclusters, method="equal")

    assert sizes.shape[0] == nclusters, "nclusters and sizes must match"
    assert sizes.sum() == N, f"sizes must add up to N {sizes.sum()} != {N}"

    # N = sizes.sum()

    # generate skew distribution & sample
    rv = stats.beta(a=alpha, b=beta)
    skew_factors = rv.rvs(size=d, random_state=rng)

    # generate features
    X = np.zeros((N, d))
    for i in range(d):
        si = skew_factors[i]
        if si >= 0.5 and si <= 1:
            skew = (
                2 * ((nlevels - 1) / nlevels) * si + (2 - nlevels) / nlevels
            )  # map skew factor from [0.5, 1] to [1/nlevels, 1] i.e. si=0.5, skew=1/nlevels
            X[:, i] = mixed_categorical_cluster_feature(
                sizes, nlevels=nlevels, min_largest_cat=skew, rng=rng
            )
        elif si >= 0 and si < 0.5:
            skew = (
                2 * ((1 - nlevels) / nlevels) * si + 1
            )  # map skew factor from [0.5, 0] to [1/nlevels, 1] i.e si=0 -> skew = 1.0
            X[:, i] = single_categorical_feature(
                N, nlevels=nlevels, min_largest_cat=skew, rng=rng
            )
        else:
            raise ValueError("Error beta distribution ill defined")

    # generate labels
    y = np.zeros(N)
    total = 0
    for i, size in enumerate(sizes):
        y[total : total + size] = i
        total += size

    dataset = Bunch(y=y, X=X)
    if return_skew_factors:
        dataset["skew_factors"] = skew_factors

    return dataset

multivariate_guassian(N, center, std=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

  • N (int) –

    number of points to sample

  • center (ndarray) –

    d-dimensional point

  • std (float, default: 1 ) –

    Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1.

  • rng (default_rng, default: None ) –

    user seeded random number generator. Defaults to None.

Returns:

  • np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py
def multivariate_guassian(N, center, std=1, rng=None):
    """Sample N point from a multivariate guassian with mean at center and
    Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred
    from user passed center.

    Args:
        N (int): number of points to sample
        center (np.ndarray): d-dimensional point
        std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.

    Returns:
        np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix
    """
    if rng is None:
        rng = np.random.default_rng()

    if not isinstance(std, np.ndarray):
        d = center.shape[0]
        COV = std * np.eye(d)
    else:
        COV = std

    X = rng.multivariate_normal(center, COV, size=N)
    return X

multivariate_t(N, center, std=1, df=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

  • N (int) –

    number of points to sample

  • center (ndarray) –

    d-dimensional point

  • std (float, default: 1 ) –

    Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix

  • df (float, default: 1 ) –

    Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.

  • rng (default_rng, default: None ) –

    user seeded random number generator. Defaults to None.

Returns:

  • np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py
def multivariate_t(N, center, std=1, df=1, rng=None):
    """Sample N point from a multivariate guassian with mean at center and
    Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred
    from user passed center.

    Args:
        N (int): number of points to sample
        center (np.ndarray): d-dimensional point
        std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1.
                                Also accepts numpy covariance matrix
        df (float, optional): Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.

    Returns:
        np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix
    """
    if rng is None:
        rng = np.random.default_rng()

    if not isinstance(std, np.ndarray):
        d = center.shape[0]
        COV = std * np.eye(d)
    else:
        COV = std

    # X = rng.multivariate_normal(center, COV, size=N)
    # dist = stats.multivariate_t(center, shape=COV, df=df, seed=rng)
    # X = dist.rvs(size=size)
    X = stats.multivariate_t.rvs(loc=center, shape=COV, df=df, size=N, random_state=rng)

    return X

uniform_sampler(d, std=1, rng=None)

sample random location in d dimensional box with max value of std and min value -std on any axis

Parameters:

  • d (int) –

    number of dimensions

  • std (float, default: 1 ) –

    max abs value on any axis. Defaults to 1.

  • rng (default_rng, default: None ) –

    user seeded random number generator. Defaults to None.

Returns:

  • np.ndarray: random point in d dimensional space with max abs value of std on any dimension

Source code in src/simnetpy/datasets/distributions.py
def uniform_sampler(d, std=1, rng=None):
    """sample random location in d dimensional box with
    max value of std and min value -std on any axis

    Args:
        d (int): number of dimensions
        std (float, optional): max abs value on any axis. Defaults to 1.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.

    Returns:
        np.ndarray: random point in d dimensional space with max abs value of std on any dimension
    """
    if rng is None:
        rng = np.random.default_rng()

    a = [1, -1]

    direction = rng.choice(a, size=d)
    r = rng.uniform(0, std, size=d)
    return direction * r

multi_mod