Clustering
Clustering
clustering
leiden_single_component_clustering(g, obj_func='modularity', beta=0.01, nsamples=15)
cluster graph using leiden method. Uses event sampling to identify resolution parameter. resolution with maximum modularity is selected. Event sampling gives sequence of resolution paramater that have more range of cluster numbers than a linear or logarithmic sequence. evenly covers one single cluster to all nodes are a separate cluster.
Parameters:
-
g
(Graph
) –graph to cluster
-
obj_func
(str
, default:'modularity'
) –function for leiden method. Defaults to 'modularity'. Other option CPM
-
beta
(float
, default:0.01
) –randomness in leiden algorithm (only used in refinement step). Defaults to 0.01.
-
nsamples
(int
, default:15
) –number of samples to use to approximate event curve in event sampling. Defaults to 50. Higher more accurate but incredibly slow in large networks with many different events.
Returns:
-
–
np.ndarray: cluster labels
Source code in src/simnetpy/clustering/clustering.py
sbm_clustering(g, deg_corr=False, wait=10, nbreaks=2, beta=np.inf, mcmc_niter=10)
Fit stochastic block model on g.
Parameters:
-
g
(Graph or Graph
) –graph to cluster
-
deg_corr
(bool
, default:False
) –flag to use degree corrected model. Defaults to False.
-
wait
(int
, default:10
) –number of steps required without record breaking event to end mcmc_equilibrate. Defaults to 10.
-
nbreaks
(int
, default:2
) –number of times
wait
steps need to happen consecutively. Defaults to 2. i.e. wait steps have to happen nbreaks times without a better state occuring. -
beta
(float (or np.inf)
, default:inf
) –inverse temperature. controls types of proposal moves. beta=1 concentrates on more plausible moves, beta=np.inf performs completely random moves. Defaults to 1. exact detail: epsilon in equation 14) https://arxiv.org/pdf/2003.07070.pdf & epsilon in equation 3) https://journals.aps.org/pre/pdf/10.1103/PhysRevE.89.012804
-
mcmc_niter
(int
, default:10
) –number of gibbs sweeps to use in mcmc_merge_split. Defaults to 10. Higher values give better proposal moves i.e. quality of each swap improves but time spent on each step in monte carlo should be minimised. Discussion found in page 7 https://arxiv.org/pdf/2003.07070.pdf (parameter M used to estimate xhat)
Returns:
-
–
np.ndarray: cluster labels
Source code in src/simnetpy/clustering/clustering.py
spectral_clustering(g, laplacian='lrw', cmetric='cosine', max_clusters=50, min_clusters=2)
perform spectral clustering on graph on laplacian created from adjacency matrix. First Spectral decomp on laplacian. Then uses eigengap to identify number of clusters K. Finally, clusters using K-means with user specified metric.
cmet
Parameters:
-
g
(Graph or ndarray
) –graph to cluster (also accepts adjacency matrices)
-
laplacian
(str
, default:'lrw'
) –Select laplacian from random walk
lrw
, symmetriclsym
, unnormalisedl
or adjacencya
. Defaults to 'lrw'. -
cmetric
(str
, default:'cosine'
) –metric to use in Kmeans cluster step. Any scipy pdist string or callable accepted. Defaults to 'cosine'.
-
max_clusters
(int
, default:50
) –max number of clusters to accept. Defaults to 50.
-
min_clusters
(int
, default:2
) –min number of clusters to accept. Defaults to 2 (min=1 may not work).
Returns:
-
–
np.ndarray: cluster labels
Source code in src/simnetpy/clustering/clustering.py
event_sampling
resolution_event_samples(g, n=15, plot=False, n_to_approx_beta=50, return_dict=False)
Function to identify a good sequence of resolutions for resolution parameter in leiden clustering. Samples are evenly spaced over the difference levels of hierarchy. From all nodes in single cluster to each node in individual cluster. Implementation of the method described in paper doi: 10.1038/s41598-018-21352-7
Parameters:
-
g
(Graph
) –igraph network to be clustered
-
n
(int
, default:15
) –length of sequence of resolutions to find. Defaults to 15.
-
plot
(bool
, default:False
) –Flag to plot beta-gamma event sample curve (as in the paper). Defaults to False.
-
n_to_approx_beta
(int
, default:50
) –In large networks, the number of events can be very large. Finding all beta events can take a long time. This is the number of samples used to appoximate the event curve. Defaults to 50. Note: 50 other samples are used in the approximation although these are kept fixed so total is n_to_approx_beta+50.
-
return_dict
(bool
, default:False
) –Flag to return beta samples as well as gamma samples. Defaults to False.
Returns:
-
_type_
–description
Source code in src/simnetpy/clustering/event_sampling.py
sample_events(Q, A, P, n_to_approx_beta=100)
Function to approximate beta event curve. Speeds computation for large networks with a large number of events.
Parameters:
-
Q
(ndarray
) –A/P matrix pre-calculated and in squareform (i.e. just triu entries)
-
A
(ndarray
) –Adjacency matrix
-
P
(ndarray
) –expected adjacency matrix (configuration model assumed - k_i*k_j/2m where k_i is degree of ith node and m is number of edges in network)
-
n_to_approx_beta
(int
, default:100
) –This is the number of samples used to appoximate the event curve. Defaults to 50. Note: 50 other samples are used in the approximation although these are kept fixed so total is n_to_approx_beta+50 Defaults to 100.
Returns:
-
–
events, bevents: sequence of precalculated gamma and beta events.
Source code in src/simnetpy/clustering/event_sampling.py
quality
cluster_quality(g, y)
Return stats describing cluster quality - conductance - modularity - triad participation ratio and "community goodness" - separability - density - clustering coefficient note: ideas of cluster quality and community goodness taken from https://dl.acm.org/doi/abs/10.1145/2350190.2350193
Parameters:
-
g
(_type_
) –description
-
y
(_type_
) –description
Source code in src/simnetpy/clustering/quality.py
triangle_participation_ratio(g)
calculate triad particpant ratio for a graph. TPR is the fraction of nodes in g that belong in a triad.
Parameters:
-
g
(Graph
) –graph to find
Returns:
-
float
–fraction of nodes in a triad
Source code in src/simnetpy/clustering/quality.py
spectral
Extension of spectral cluster class developed in https://pypi.org/project/spectralcluster/ Allow the passing of adjacency matrices. Adjusted laplacian and eigengap parameters to accept strings. Removed a lot of functionality relating to constraint options and refinement preprocessing.
Spectral
Bases: SpectralClusterer
Source code in src/simnetpy/clustering/spectral.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
|
predict_from_aff(X=None, S=None, metric='euclidean', norm=True)
Perform spectral clustering on an affinity matrix. Note affinity matrix should have form where larger values indicate higher similarity i.e. opposite of distance Args: X: embedding/feature matrix n x d np.ndarray S: pairwise affinity (Similarity) matrix n x n np.ndarray. metric: metric to be used in pairwise distance norm: wether to normalise Aff to be 0 mean 1 std Returns: labels: numpy array of shape (n_samples,)