Datasets

`cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)`

Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.

Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).

Returns:

list –

n randomly sampled points in d-dimensional space all at least lower away from each other.

Source code in src/simnetpy/datasets/distributions.py

def cluster_centers(n, d, lower=1, higher=2, rng=None, init=None):
    """Sample n point in d-dimensional space. Points will be between
    (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0)
    unless otherwise specified. Sampling done sequentially with next center proposed,
    rejected if too close to all others and resampled until accepted.
    If sampling large number of points & time taken is large increase size of higher.

    Note: 2*higher is not actual distance upper bound. higher controls size of box around previous center
    that we sample a proposal point. Each center sampled from box with sides of size 2*higher.
    Args:
        n (int): number of points to generate
        d (int): number of dimensions
        lower (float, optional): Lower bound of distances to accept. All points will be
                            at least lower away from each other. Defaults to 1.
        higher (float, optional): Size of box around previous center to sample from. Defaults to 2.
        rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None.
        init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled
        but if none at least one will be the origin (0,0,...,0).

    Returns:
        list: n randomly sampled points in d-dimensional space all at least lower away from each other.
    """
    if rng is None:
        rng = np.random.default_rng()

    if init is None:
        init = np.zeros(d)
    centers = []
    x = init + uniform_sampler(d, std=higher, rng=rng)
    centers.append(x)

    while len(centers) < n:
        # sample point in random direction
        x = uniform_sampler(d, std=higher, rng=rng)

        # find random center and move away in direction x
        i = rng.integers(0, len(centers))
        x = centers[i] + x

        carray = np.array(centers)
        dist = np.linalg.norm(carray - x, axis=1, ord=2)

        if np.all(dist > lower):
            centers.append(x)
    rng.shuffle(centers)  # shuffle so first cluster is not necessarily close to init
    return centers

`mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)`

alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem

Source code in src/simnetpy/datasets/distributions.py

469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
def mixed_categorical_clusters( N, d, nclusters=3, sizes="equal", alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None, ): """alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a<b -> skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem """ if rng is None: rng = np.random.default_rng() if isinstance(sizes, str): assert sizes.lower() in [ "equal", "random", "roughly_equal", ], "if specifying method sizes must be one of [equal, random, roughly_equal]" sizes = split_data_into_clusters(N, nclusters, method=sizes) elif sizes is None: sizes = split_data_into_clusters(N, nclusters, method="equal") assert sizes.shape[0] == nclusters, "nclusters and sizes must match" assert sizes.sum() == N, f"sizes must add up to N {sizes.sum()} != {N}" # N = sizes.sum() # generate skew distribution & sample rv = stats.beta(a=alpha, b=beta) skew_factors = rv.rvs(size=d, random_state=rng) # generate features X = np.zeros((N, d)) for i in range(d): si = skew_factors[i] if si >= 0.5 and si <= 1: skew = ( 2 * ((nlevels - 1) / nlevels) * si + (2 - nlevels) / nlevels ) # map skew factor from [0.5, 1] to [1/nlevels, 1] i.e. si=0.5, skew=1/nlevels X[:, i] = mixed_categorical_cluster_feature( sizes, nlevels=nlevels, min_largest_cat=skew, rng=rng ) elif si >= 0 and si < 0.5: skew = ( 2 * ((1 - nlevels) / nlevels) * si + 1 ) # map skew factor from [0.5, 0] to [1/nlevels, 1] i.e si=0 -> skew = 1.0 X[:, i] = single_categorical_feature( N, nlevels=nlevels, min_largest_cat=skew, rng=rng ) else: raise ValueError("Error beta distribution ill defined") # generate labels y = np.zeros(N) total = 0 for i, size in enumerate(sizes): y[total : total + size] = i total += size dataset = Bunch(y=y, X=X) if return_skew_factors: dataset["skew_factors"] = skew_factors return dataset

multivariate_guassian(N, center, std=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

N (int) –

number of points to sample

center (ndarray) –

d-dimensional point

std (float, default: 1 ) –

Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1.

rng (default_rng, default: None ) –

user seeded random number generator. Defaults to None.

Returns:

–

np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
def multivariate_guassian(N, center, std=1, rng=None): """Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center. Args: N (int): number of points to sample center (np.ndarray): d-dimensional point std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. Returns: np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix """ if rng is None: rng = np.random.default_rng() if not isinstance(std, np.ndarray): d = center.shape[0] COV = std * np.eye(d) else: COV = std X = rng.multivariate_normal(center, COV, size=N) return X

multivariate_t(N, center, std=1, df=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

N (int) –

number of points to sample

center (ndarray) –

d-dimensional point

std (float, default: 1 ) –

Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix

df (float, default: 1 ) –

Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.

rng (default_rng, default: None ) –

user seeded random number generator. Defaults to None.

Returns:

–

np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
def multivariate_t(N, center, std=1, df=1, rng=None): """Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center. Args: N (int): number of points to sample center (np.ndarray): d-dimensional point std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix df (float, optional): Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. Returns: np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix """ if rng is None: rng = np.random.default_rng() if not isinstance(std, np.ndarray): d = center.shape[0] COV = std * np.eye(d) else: COV = std # X = rng.multivariate_normal(center, COV, size=N) # dist = stats.multivariate_t(center, shape=COV, df=df, seed=rng) # X = dist.rvs(size=size) X = stats.multivariate_t.rvs(loc=center, shape=COV, df=df, size=N, random_state=rng) return X

distributions

cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)

Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.

Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).

Returns:

list –

n randomly sampled points in d-dimensional space all at least lower away from each other.

Source code in src/simnetpy/datasets/distributions.py

88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
def cluster_centers(n, d, lower=1, higher=2, rng=None, init=None): """Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher. Note: 2*higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2*higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0). Returns: list: n randomly sampled points in d-dimensional space all at least lower away from each other. """ if rng is None: rng = np.random.default_rng() if init is None: init = np.zeros(d) centers = [] x = init + uniform_sampler(d, std=higher, rng=rng) centers.append(x) while len(centers) < n: # sample point in random direction x = uniform_sampler(d, std=higher, rng=rng) # find random center and move away in direction x i = rng.integers(0, len(centers)) x = centers[i] + x carray = np.array(centers) dist = np.linalg.norm(carray - x, axis=1, ord=2) if np.all(dist > lower): centers.append(x) rng.shuffle(centers) # shuffle so first cluster is not necessarily close to init return centers

mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)

alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem

Source code in src/simnetpy/datasets/distributions.py

469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544
def mixed_categorical_clusters( N, d, nclusters=3, sizes="equal", alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None, ): """alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a<b -> skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem """ if rng is None: rng = np.random.default_rng() if isinstance(sizes, str): assert sizes.lower() in [ "equal", "random", "roughly_equal", ], "if specifying method sizes must be one of [equal, random, roughly_equal]" sizes = split_data_into_clusters(N, nclusters, method=sizes) elif sizes is None: sizes = split_data_into_clusters(N, nclusters, method="equal") assert sizes.shape[0] == nclusters, "nclusters and sizes must match" assert sizes.sum() == N, f"sizes must add up to N {sizes.sum()} != {N}" # N = sizes.sum() # generate skew distribution & sample rv = stats.beta(a=alpha, b=beta) skew_factors = rv.rvs(size=d, random_state=rng) # generate features X = np.zeros((N, d)) for i in range(d): si = skew_factors[i] if si >= 0.5 and si <= 1: skew = ( 2 * ((nlevels - 1) / nlevels) * si + (2 - nlevels) / nlevels ) # map skew factor from [0.5, 1] to [1/nlevels, 1] i.e. si=0.5, skew=1/nlevels X[:, i] = mixed_categorical_cluster_feature( sizes, nlevels=nlevels, min_largest_cat=skew, rng=rng ) elif si >= 0 and si < 0.5: skew = ( 2 * ((1 - nlevels) / nlevels) * si + 1 ) # map skew factor from [0.5, 0] to [1/nlevels, 1] i.e si=0 -> skew = 1.0 X[:, i] = single_categorical_feature( N, nlevels=nlevels, min_largest_cat=skew, rng=rng ) else: raise ValueError("Error beta distribution ill defined") # generate labels y = np.zeros(N) total = 0 for i, size in enumerate(sizes): y[total : total + size] = i total += size dataset = Bunch(y=y, X=X) if return_skew_factors: dataset["skew_factors"] = skew_factors return dataset

multivariate_guassian(N, center, std=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

N (int) –

number of points to sample

center (ndarray) –

d-dimensional point

std (float, default: 1 ) –

Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1.

rng (default_rng, default: None ) –

user seeded random number generator. Defaults to None.

Returns:

–

np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
def multivariate_guassian(N, center, std=1, rng=None): """Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center. Args: N (int): number of points to sample center (np.ndarray): d-dimensional point std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. Returns: np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix """ if rng is None: rng = np.random.default_rng() if not isinstance(std, np.ndarray): d = center.shape[0] COV = std * np.eye(d) else: COV = std X = rng.multivariate_normal(center, COV, size=N) return X

multivariate_t(N, center, std=1, df=1, rng=None)

Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.

Parameters:

N (int) –

number of points to sample

center (ndarray) –

d-dimensional point

std (float, default: 1 ) –

Standard deviation along any axis. Covariance is std*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix

df (float, default: 1 ) –

Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.

rng (default_rng, default: None ) –

user seeded random number generator. Defaults to None.

Returns:

–

np.ndarray: N samples from multi-dimensional guassian centered at center. (N x d) matrix

Source code in src/simnetpy/datasets/distributions.py

55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
def multivariate_t(N, center, std=1, df=1, rng=None): """Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center. Args: N (int): number of points to sample center (np.ndarray): d-dimensional point std (float, optional): Standard deviation along any axis. Covariance is `std`*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix df (float, optional): Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. Returns: np.ndarray: N samples from multi-dimensional guassian centered at `center`. (N x d) matrix """ if rng is None: rng = np.random.default_rng() if not isinstance(std, np.ndarray): d = center.shape[0] COV = std * np.eye(d) else: COV = std # X = rng.multivariate_normal(center, COV, size=N) # dist = stats.multivariate_t(center, shape=COV, df=df, seed=rng) # X = dist.rvs(size=size) X = stats.multivariate_t.rvs(loc=center, shape=COV, df=df, size=N, random_state=rng) return X

uniform_sampler(d, std=1, rng=None)

sample random location in d dimensional box with max value of std and min value -std on any axis

Parameters:

d (int) –

number of dimensions

std (float, default: 1 ) –

max abs value on any axis. Defaults to 1.

rng (default_rng, default: None ) –

user seeded random number generator. Defaults to None.

Returns:

–

np.ndarray: random point in d dimensional space with max abs value of std on any dimension

Source code in src/simnetpy/datasets/distributions.py

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
def uniform_sampler(d, std=1, rng=None): """sample random location in d dimensional box with max value of std and min value -std on any axis Args: d (int): number of dimensions std (float, optional): max abs value on any axis. Defaults to 1. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. Returns: np.ndarray: random point in d dimensional space with max abs value of std on any dimension """ if rng is None: rng = np.random.default_rng() a = [1, -1] direction = rng.choice(a, size=d) r = rng.uniform(0, std, size=d) return direction * r

multi_mod

Datasets

Datasets

`cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)`

`mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)`

`multivariate_guassian(N, center, std=1, rng=None)`

`multivariate_t(N, center, std=1, df=1, rng=None)`

`distributions`

`cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)`

`mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)`

`multivariate_guassian(N, center, std=1, rng=None)`

`multivariate_t(N, center, std=1, df=1, rng=None)`

`uniform_sampler(d, std=1, rng=None)`

`multi_mod`