Datasets
Datasets
cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)
Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.
Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).
Returns:
-
list
–n randomly sampled points in d-dimensional space all at least lower away from each other.
Source code in src/simnetpy/datasets/distributions.py
mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)
alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem
Source code in src/simnetpy/datasets/distributions.py
469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 |
|
multivariate_guassian(N, center, std=1, rng=None)
Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.
Parameters:
-
N
(int
) –number of points to sample
-
center
(ndarray
) –d-dimensional point
-
std
(float
, default:1
) –Standard deviation along any axis. Covariance is
std
*Identity Matrix. Defaults to 1. -
rng
(default_rng
, default:None
) –user seeded random number generator. Defaults to None.
Returns:
-
–
np.ndarray: N samples from multi-dimensional guassian centered at
center
. (N x d) matrix
Source code in src/simnetpy/datasets/distributions.py
multivariate_t(N, center, std=1, df=1, rng=None)
Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.
Parameters:
-
N
(int
) –number of points to sample
-
center
(ndarray
) –d-dimensional point
-
std
(float
, default:1
) –Standard deviation along any axis. Covariance is
std
*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix -
df
(float
, default:1
) –Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.
-
rng
(default_rng
, default:None
) –user seeded random number generator. Defaults to None.
Returns:
-
–
np.ndarray: N samples from multi-dimensional guassian centered at
center
. (N x d) matrix
Source code in src/simnetpy/datasets/distributions.py
distributions
cluster_centers(n, d, lower=1, higher=2, rng=None, init=None)
Sample n point in d-dimensional space. Points will be between (lower, ~2*higher) distance from all other points. Initial center will be (0,0, ..,0) unless otherwise specified. Sampling done sequentially with next center proposed, rejected if too close to all others and resampled until accepted. If sampling large number of points & time taken is large increase size of higher.
Note: 2higher is not actual distance upper bound. higher controls size of box around previous center that we sample a proposal point. Each center sampled from box with sides of size 2higher. Args: n (int): number of points to generate d (int): number of dimensions lower (float, optional): Lower bound of distances to accept. All points will be at least lower away from each other. Defaults to 1. higher (float, optional): Size of box around previous center to sample from. Defaults to 2. rng (np.random.default_rng, optional): user seeded random number generator. Defaults to None. init (np.ndarray, optional): Location of first sample. Defaults to None. Note points are shuffled but if none at least one will be the origin (0,0,...,0).
Returns:
-
list
–n randomly sampled points in d-dimensional space all at least lower away from each other.
Source code in src/simnetpy/datasets/distributions.py
mixed_categorical_clusters(N, d, nclusters=3, sizes='equal', alpha=5, beta=1, nlevels=5, return_skew_factors=True, rng=None)
alpha and beta control shape of skew factor distribution higher max(abs(a,b))>1 -> less flat more peaked distribution a skewed below 0.5 a>b -> skewed above 0.5 so 5,1 means on average skew passed to ordinal feature generator will be centered on a/a+b~0.83 and 1,5 means average skew will be ~0.16 lower average skew means noisier data and harder clustering problem
Source code in src/simnetpy/datasets/distributions.py
469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 |
|
multivariate_guassian(N, center, std=1, rng=None)
Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.
Parameters:
-
N
(int
) –number of points to sample
-
center
(ndarray
) –d-dimensional point
-
std
(float
, default:1
) –Standard deviation along any axis. Covariance is
std
*Identity Matrix. Defaults to 1. -
rng
(default_rng
, default:None
) –user seeded random number generator. Defaults to None.
Returns:
-
–
np.ndarray: N samples from multi-dimensional guassian centered at
center
. (N x d) matrix
Source code in src/simnetpy/datasets/distributions.py
multivariate_t(N, center, std=1, df=1, rng=None)
Sample N point from a multivariate guassian with mean at center and Covariance of std*Identity (i.e. circular guassian). Dimensions of guassian inferred from user passed center.
Parameters:
-
N
(int
) –number of points to sample
-
center
(ndarray
) –d-dimensional point
-
std
(float
, default:1
) –Standard deviation along any axis. Covariance is
std
*Identity Matrix. Defaults to 1. Also accepts numpy covariance matrix -
df
(float
, default:1
) –Degrees of freedom of the distribution. Defaults to 1. If np.inf results are multivariate normal.
-
rng
(default_rng
, default:None
) –user seeded random number generator. Defaults to None.
Returns:
-
–
np.ndarray: N samples from multi-dimensional guassian centered at
center
. (N x d) matrix
Source code in src/simnetpy/datasets/distributions.py
uniform_sampler(d, std=1, rng=None)
sample random location in d dimensional box with max value of std and min value -std on any axis
Parameters:
-
d
(int
) –number of dimensions
-
std
(float
, default:1
) –max abs value on any axis. Defaults to 1.
-
rng
(default_rng
, default:None
) –user seeded random number generator. Defaults to None.
Returns:
-
–
np.ndarray: random point in d dimensional space with max abs value of std on any dimension