Skip to content

Clustering

Clustering

clustering

leiden_single_component_clustering(g, obj_func='modularity', beta=0.01, nsamples=15)

cluster graph using leiden method. Uses event sampling to identify resolution parameter. resolution with maximum modularity is selected. Event sampling gives sequence of resolution paramater that have more range of cluster numbers than a linear or logarithmic sequence. evenly covers one single cluster to all nodes are a separate cluster.

Parameters:

  • g (Graph) –

    graph to cluster

  • obj_func (str, default: 'modularity' ) –

    function for leiden method. Defaults to 'modularity'. Other option CPM

  • beta (float, default: 0.01 ) –

    randomness in leiden algorithm (only used in refinement step). Defaults to 0.01.

  • nsamples (int, default: 15 ) –

    number of samples to use to approximate event curve in event sampling. Defaults to 50. Higher more accurate but incredibly slow in large networks with many different events.

Returns:

  • np.ndarray: cluster labels

Source code in src/simnetpy/clustering/clustering.py
def leiden_single_component_clustering(g, obj_func='modularity', beta=0.01, nsamples=15):
    """cluster graph using leiden method. Uses event sampling to identify resolution parameter.
    resolution with maximum modularity is selected. Event sampling gives sequence of resolution 
    paramater that have more range of cluster numbers than a linear or logarithmic sequence. evenly covers
    one single cluster to all nodes are a separate cluster.  

    Args:
        g (ig.Graph): graph to cluster
        obj_func (str, optional): function for leiden method. Defaults to 'modularity'. Other option CPM
        beta (float, optional): randomness in leiden algorithm (only used in refinement step). Defaults to 0.01.
        nsamples (int, optional): number of samples to use to approximate event curve in event sampling. Defaults to 50.
                                    Higher more accurate but incredibly slow in large networks with many different events.

    Returns:
        np.ndarray: cluster labels
    """    
    gammasq = resolution_event_samples(g,n=nsamples)
    y_pred, gamma_max = find_max_mod_gamma_clstr(gammasq, g, obj_func, beta)
    return y_pred

sbm_clustering(g, deg_corr=False, wait=10, nbreaks=2, beta=np.inf, mcmc_niter=10)

Fit stochastic block model on g.

Parameters:

  • g (Graph or Graph) –

    graph to cluster

  • deg_corr (bool, default: False ) –

    flag to use degree corrected model. Defaults to False.

  • wait (int, default: 10 ) –

    number of steps required without record breaking event to end mcmc_equilibrate. Defaults to 10.

  • nbreaks (int, default: 2 ) –

    number of times wait steps need to happen consecutively. Defaults to 2. i.e. wait steps have to happen nbreaks times without a better state occuring.

  • beta (float (or np.inf), default: inf ) –

    inverse temperature. controls types of proposal moves. beta=1 concentrates on more plausible moves, beta=np.inf performs completely random moves. Defaults to 1. exact detail: epsilon in equation 14) https://arxiv.org/pdf/2003.07070.pdf & epsilon in equation 3) https://journals.aps.org/pre/pdf/10.1103/PhysRevE.89.012804

  • mcmc_niter (int, default: 10 ) –

    number of gibbs sweeps to use in mcmc_merge_split. Defaults to 10. Higher values give better proposal moves i.e. quality of each swap improves but time spent on each step in monte carlo should be minimised. Discussion found in page 7 https://arxiv.org/pdf/2003.07070.pdf (parameter M used to estimate xhat)

Returns:

  • np.ndarray: cluster labels

Source code in src/simnetpy/clustering/clustering.py
def sbm_clustering(g, deg_corr=False, wait=10, nbreaks=2, beta=np.inf, mcmc_niter=10):
    """Fit stochastic block model on g. 

    Args:
        g (gt.Graph or ig.Graph): graph to cluster
        deg_corr (bool, optional): flag to use degree corrected model. Defaults to False.
        wait (int, optional): number of steps required without record breaking event to end mcmc_equilibrate.
                                Defaults to 10.
        nbreaks (int, optional): number of times `wait` steps need to happen consecutively. Defaults to 2.
                                i.e. wait steps have to happen nbreaks times without a better state occuring.
        beta (float (or np.inf), optional): inverse temperature. controls types of proposal moves. beta=1 concentrates 
                    on more plausible moves, beta=np.inf performs completely random moves. Defaults to 1.
                    exact detail: epsilon in equation 14) https://arxiv.org/pdf/2003.07070.pdf
                                & epsilon in equation 3) https://journals.aps.org/pre/pdf/10.1103/PhysRevE.89.012804
        mcmc_niter (int, optional): number of gibbs sweeps to use in mcmc_merge_split. Defaults to 10.
                                Higher values give better proposal moves i.e. quality of each swap improves but 
                                time spent on each step in monte carlo should be minimised.
                                Discussion found in page 7 https://arxiv.org/pdf/2003.07070.pdf (parameter M used to estimate xhat)

    Returns:
        np.ndarray: cluster labels
    """
    if gt == "NOT INSTALLED":
        raise ImportError("The graph-tool module is not installed. SBM clustering unavailable. Use\n\t conda install "+\
                "-c conda-forge graph-tool\nto install in a conda environment."+\
                " See https://git.skewed.de/count0/graph-tool/-/wikis/installation-instructions "+\
                "for help.")

    if not isinstance(g, gt.Graph):
        g = g.to_graph_tool()

    state = gt.minimize_blockmodel_dl(g, state_args={'deg_corr':deg_corr})
    gt.mcmc_equilibrate(state, wait=wait, nbreaks=nbreaks, mcmc_args=dict(niter=mcmc_niter, beta=beta))

    y_pred = state.get_blocks().get_array() #state.b.a
    y_pred = relabel(y_pred)
    return y_pred

spectral_clustering(g, laplacian='lrw', cmetric='cosine', max_clusters=50, min_clusters=2)

perform spectral clustering on graph on laplacian created from adjacency matrix. First Spectral decomp on laplacian. Then uses eigengap to identify number of clusters K. Finally, clusters using K-means with user specified metric.

cmet

Parameters:

  • g (Graph or ndarray) –

    graph to cluster (also accepts adjacency matrices)

  • laplacian (str, default: 'lrw' ) –

    Select laplacian from random walk lrw, symmetric lsym, unnormalised l or adjacency a. Defaults to 'lrw'.

  • cmetric (str, default: 'cosine' ) –

    metric to use in Kmeans cluster step. Any scipy pdist string or callable accepted. Defaults to 'cosine'.

  • max_clusters (int, default: 50 ) –

    max number of clusters to accept. Defaults to 50.

  • min_clusters (int, default: 2 ) –

    min number of clusters to accept. Defaults to 2 (min=1 may not work).

Returns:

  • np.ndarray: cluster labels

Source code in src/simnetpy/clustering/clustering.py
def spectral_clustering(g, laplacian='lrw', cmetric='cosine', max_clusters=50, min_clusters=2):
    """perform spectral clustering on graph on laplacian created from adjacency matrix. First Spectral decomp on laplacian. 
    Then uses eigengap to identify number of clusters K. Finally, clusters using K-means with user specified metric.

    cmet

    Args:
        g (ig.Graph or np.ndarray): graph to cluster (also accepts adjacency matrices)
        laplacian (str, optional): Select laplacian from random walk `lrw`, symmetric `lsym`,
                    unnormalised `l` or adjacency `a`. Defaults to 'lrw'.
        cmetric (str, optional): metric to use in Kmeans cluster step. Any scipy pdist string or callable 
                                accepted. Defaults to 'cosine'.
        max_clusters (int, optional): max number of clusters to accept. Defaults to 50.
        min_clusters (int, optional): min number of clusters to accept. Defaults to 2 (min=1 may not work).

    Returns:
        np.ndarray: cluster labels
    """
    if not isinstance(g, np.ndarray):
        A = g.get_adjacency_sparse().toarray()
    else:
        A = g

    spect = Spectral(laplacian_type=laplacian, custom_dist=cmetric, min_clusters= min_clusters, max_clusters=max_clusters)
    y_pred = spect.predict_from_adj(A)
    return y_pred

event_sampling

resolution_event_samples(g, n=15, plot=False, n_to_approx_beta=50, return_dict=False)

Function to identify a good sequence of resolutions for resolution parameter in leiden clustering. Samples are evenly spaced over the difference levels of hierarchy. From all nodes in single cluster to each node in individual cluster. Implementation of the method described in paper doi: 10.1038/s41598-018-21352-7

Parameters:

  • g (Graph) –

    igraph network to be clustered

  • n (int, default: 15 ) –

    length of sequence of resolutions to find. Defaults to 15.

  • plot (bool, default: False ) –

    Flag to plot beta-gamma event sample curve (as in the paper). Defaults to False.

  • n_to_approx_beta (int, default: 50 ) –

    In large networks, the number of events can be very large. Finding all beta events can take a long time. This is the number of samples used to appoximate the event curve. Defaults to 50. Note: 50 other samples are used in the approximation although these are kept fixed so total is n_to_approx_beta+50.

  • return_dict (bool, default: False ) –

    Flag to return beta samples as well as gamma samples. Defaults to False.

Returns:

  • _type_

    description

Source code in src/simnetpy/clustering/event_sampling.py
def resolution_event_samples(g, n=15, plot=False, n_to_approx_beta=50, return_dict=False):
    """Function to identify a good sequence of resolutions for resolution parameter in leiden clustering.
    Samples are evenly spaced over the difference levels of hierarchy. From all nodes in single cluster to each node in individual cluster.
    Implementation of the method described in paper doi: 10.1038/s41598-018-21352-7

    Args:
        g (ig.Graph): igraph network to be clustered
        n (int, optional): length of sequence of resolutions to find. Defaults to 15.
        plot (bool, optional): Flag to plot beta-gamma event sample curve (as in the paper). Defaults to False.
        n_to_approx_beta (int, optional): In large networks, the number of events can be very large. Finding all beta events can take a long time. 
                                This is the number of samples used to appoximate the event curve. Defaults to 50.
                                Note: 50 other samples are used in the approximation although these are kept fixed so total is n_to_approx_beta+50.

        return_dict (bool, optional): Flag to return beta samples as well as gamma samples. Defaults to False.

    Returns:
        _type_: _description_
    """

    A = g.get_adjacency_sparse()
    A = A.toarray()
    np.fill_diagonal(A,0) # don't want self loops

    D = A.sum(axis=0)


    P = np.outer(D,D)/(D.sum())
    np.fill_diagonal(P, 0)

    P = squareform(P)
    A = squareform(A)

    Q = A/P
    ymax = Q.max()
    ymin = find_ymin(g)

    Bmin = betaf(ymin, A, P)
    Bmax = betaf(ymax, A, P)

    events, bevents = sample_events(Q, A, P, n_to_approx_beta=n_to_approx_beta)

    beta_samples = np.linspace(Bmin,Bmax, num=n, endpoint=False)
    samples = np.array([gammaf(B, A, P, events, bevents) for B in beta_samples])

    if plot:
        yseq = np.logspace(np.log(ymin),np.log10(ymax), num=n_to_approx_beta)
        bseq = np.array([betaf(yy,A,P) for yy in yseq])

        fig = plot_beta_curve(samples, beta_samples, yseq, bseq)

    if return_dict:
        {'gamma_samples': samples, 'beta_samples': beta_samples}
    else:
        return samples

sample_events(Q, A, P, n_to_approx_beta=100)

Function to approximate beta event curve. Speeds computation for large networks with a large number of events.

Parameters:

  • Q (ndarray) –

    A/P matrix pre-calculated and in squareform (i.e. just triu entries)

  • A (ndarray) –

    Adjacency matrix

  • P (ndarray) –

    expected adjacency matrix (configuration model assumed - k_i*k_j/2m where k_i is degree of ith node and m is number of edges in network)

  • n_to_approx_beta (int, default: 100 ) –

    This is the number of samples used to appoximate the event curve. Defaults to 50. Note: 50 other samples are used in the approximation although these are kept fixed so total is n_to_approx_beta+50 Defaults to 100.

Returns:

  • events, bevents: sequence of precalculated gamma and beta events.

Source code in src/simnetpy/clustering/event_sampling.py
def sample_events(Q, A, P, n_to_approx_beta=100):
    """
    Function to approximate beta event curve. Speeds computation for large networks with a large number of events.

    Args:
        Q (np.ndarray): A/P matrix pre-calculated and in squareform (i.e. just triu entries)
        A (np.ndarray): Adjacency matrix
        P (np.ndarray): expected adjacency matrix 
                (configuration model assumed - k_i*k_j/2m where k_i is degree of ith node and m is number of edges in network)
        n_to_approx_beta (int, optional): This is the number of samples used to appoximate the event curve. Defaults to 50.
                                Note: 50 other samples are used in the approximation although these are kept fixed 
                                so total is n_to_approx_beta+50 Defaults to 100.

    Returns:
        events, bevents: sequence of precalculated gamma and beta events.
    """
    # Find unique events
    events = np.unique(Q)[1:] # ignore 0 

    # Event curves are very skewed with large number of events occuring in highest 50%
    # This is also the area of least interest as most communities are single or isolated nodes.
    # we take look each percentile between 50 and 100 i.e. 51%, 52%, ...
    qq = np.linspace(0.5,1, num=51)
    upper_vals = np.array([np.quantile(events,q) for q in qq])

    # we only want to approximate the lower events with nsample points
    mid = np.median(events)
    events = events[events < mid]


    step = events.shape[0] // n_to_approx_beta
    # if there are less than nsample events step=0 which raises error.
    # use stepsize of 1 in this case. equivalent to n_to_approx_beta = events.shape[0]
    if not step:
        step = 1
    # sample an event after every step events (step dictated by n_to_approx_beta)
    events = events[::step]

    # calculate beta value for each event
    events = np.concatenate((events, upper_vals))
    bevents = np.array([betaf(e, A, P) for e in events])

    return events, bevents

quality

cluster_quality(g, y)

Return stats describing cluster quality - conductance - modularity - triad participation ratio and "community goodness" - separability - density - clustering coefficient note: ideas of cluster quality and community goodness taken from https://dl.acm.org/doi/abs/10.1145/2350190.2350193

Parameters:

  • g (_type_) –

    description

  • y (_type_) –

    description

Source code in src/simnetpy/clustering/quality.py
def cluster_quality(g, y):
    """Return stats describing cluster quality 
    - conductance
    - modularity
    - triad participation ratio
    and "community goodness"
    - separability
    - density
    - clustering coefficient
    note: ideas of cluster quality and community goodness 
    taken from https://dl.acm.org/doi/abs/10.1145/2350190.2350193

    Args:
        g (_type_): _description_
        y (_type_): _description_
    """
    tpr = avg_triangle_participation_ratio(g, y)
    mod = g.modularity(y)
    cond = avg_conductance(g, y)
    dens = avg_density(g, y)
    sep = avg_separability(g, y)
    cc = avg_clustering(g, y)

    ddict= {'mod':mod, 'cond':cond, 'tpr':tpr,
            'sep':sep, 'density':dens, 'cc':cc}
    return ddict

triangle_participation_ratio(g)

calculate triad particpant ratio for a graph. TPR is the fraction of nodes in g that belong in a triad.

Parameters:

  • g (Graph) –

    graph to find

Returns:

  • float

    fraction of nodes in a triad

Source code in src/simnetpy/clustering/quality.py
def triangle_participation_ratio(g):
    """calculate triad particpant ratio for a graph.
    TPR is the fraction of nodes in g that belong in a triad.

    Args:
        g (ig.Graph): graph to find 

    Returns:
        float: fraction of nodes in a triad
    """
    vintriad = 0
    for v in g.vs:
        v_nbrs = g.neighbors(v.index)
        vs = set(v_nbrs) - {v.index}
        gen_degree = Counter(len(vs & (set(g.neighbors(w)) - {w})) for w in vs)
        ntriangle = sum(k*val for k, val in gen_degree.items())
        if ntriangle:
            vintriad +=1
    tp_ratio = vintriad/g.vcount()
    return tp_ratio

spectral

Extension of spectral cluster class developed in https://pypi.org/project/spectralcluster/ Allow the passing of adjacency matrices. Adjusted laplacian and eigengap parameters to accept strings. Removed a lot of functionality relating to constraint options and refinement preprocessing.

Spectral

Bases: SpectralClusterer

Source code in src/simnetpy/clustering/spectral.py
class Spectral(sc.SpectralClusterer):
    def __init__(self, 
                min_clusters=2,
                max_clusters=10, 
                laplacian_type='lrw', 
                stop_eigenvalue=1e-2, 
                custom_dist="cosine", 
                eigengap_type='ratio', **kwds):
        ltype = LTYPES[laplacian_type.lower()] # lookup laplacian type class, A/L/Lrw/Lsym (all get lower cased)
        egaptype = EIGENGAPCOMP[eigengap_type] # lookup eigengap comp type

        super().__init__(min_clusters=min_clusters,
                max_clusters=max_clusters, 
                laplacian_type=ltype, 
                stop_eigenvalue=stop_eigenvalue, 
                custom_dist=custom_dist, 
                eigengap_type=egaptype, **kwds)

    def predict_from_adj(self, A, laplacian_type=None):
        if laplacian_type is not None:
            self.laplacian_type = LTYPES[laplacian_type.lower()]

        n_samples = A.shape[0]

        constraint_matrix=None # ignore constraint for now. not clear what use is.
        eigenvectors, n_clusters, _ = self._compute_eigenvectors_ncluster(
            A, constraint_matrix)

        if self.min_clusters is not None:
            n_clusters = max(n_clusters, self.min_clusters)

        # Get spectral embeddings.
        spectral_embeddings = eigenvectors[:, :n_clusters]

        if self.row_wise_renorm:
            # Perform row wise re-normalization.
            rows_norm = np.linalg.norm(spectral_embeddings, axis=1, ord=2)
            spectral_embeddings = spectral_embeddings / np.reshape(
                rows_norm, (n_samples, 1))

        # Run clustering algorithm on spectral embeddings. This defaults
        # to customized K-means.
        # kmeans is raising future warning. rather than rewrite entire class. I added a simple warning filter.
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            labels = self.post_eigen_cluster_function(
                spectral_embeddings=spectral_embeddings,
                n_clusters=n_clusters,
                custom_dist=self.custom_dist,
                max_iter=self.max_iter)
        return labels


    def predict_from_aff(self, X=None, S=None, metric='euclidean', norm=True):
        """Perform spectral clustering on an affinity matrix. 
        Note affinity matrix should have form where larger values indicate higher similarity i.e. opposite of distance 
        Args:
            X:      embedding/feature matrix n x d np.ndarray
            S:    pairwise affinity (Similarity) matrix n x n np.ndarray. 
            metric: metric to be used in pairwise distance
            norm:   wether to normalise Aff to be 0 mean 1 std
        Returns:
            labels: numpy array of shape (n_samples,)
        """
        self.laplacian_type = LTYPES['a'] # set laplacian type as affinity
        if (X is None) and (S is None):
            raise ValueError('One of X or S must be specified. Note if both are specified then S is used.')

        if S is None:
            affinity = self.affinity_matrix(X, metric=metric, norm=norm)
        else:
            affinity = S

        num_embeddings = affinity.shape[0]

        # # Compute affinity matrix.
        # affinity = self.affinity_function(embeddings)
        constraint_matrix = None

        eigenvectors, n_clusters, _ = self._compute_eigenvectors_ncluster(
                affinity, constraint_matrix)

        if self.min_clusters is not None:
            n_clusters = max(n_clusters, self.min_clusters)

        # Get spectral embeddings.
        spectral_embeddings = eigenvectors[:, :n_clusters]

        if self.row_wise_renorm:
            # Perform row wise re-normalization.
            rows_norm = np.linalg.norm(spectral_embeddings, axis=1, ord=2)
            spectral_embeddings = spectral_embeddings / np.reshape(
                rows_norm, (num_embeddings, 1))

        # Run clustering algorithm on spectral embeddings. This defaults
        # to a customized implementation of K-means. However, the implementation was created on old sklearn version.
        # -> kmeans raises a Future Warning for n_init parameter.  Rather than rewrite entire class. I added a simple warning filter.
        with warnings.catch_warnings():
            warnings.simplefilter('ignore')
            labels = self.post_eigen_cluster_function(
                spectral_embeddings=spectral_embeddings,
                n_clusters=n_clusters,
                custom_dist=self.custom_dist,
                max_iter=self.max_iter)
        return labels

    @staticmethod
    def affinity_matrix(X, metric='euclidean', norm=True):
        S = pairwise_sim(X, method=metric, norm=norm)
        S = -S  # want higher values to be more similar
        return S

predict_from_aff(X=None, S=None, metric='euclidean', norm=True)

Perform spectral clustering on an affinity matrix. Note affinity matrix should have form where larger values indicate higher similarity i.e. opposite of distance Args: X: embedding/feature matrix n x d np.ndarray S: pairwise affinity (Similarity) matrix n x n np.ndarray. metric: metric to be used in pairwise distance norm: wether to normalise Aff to be 0 mean 1 std Returns: labels: numpy array of shape (n_samples,)

Source code in src/simnetpy/clustering/spectral.py
def predict_from_aff(self, X=None, S=None, metric='euclidean', norm=True):
    """Perform spectral clustering on an affinity matrix. 
    Note affinity matrix should have form where larger values indicate higher similarity i.e. opposite of distance 
    Args:
        X:      embedding/feature matrix n x d np.ndarray
        S:    pairwise affinity (Similarity) matrix n x n np.ndarray. 
        metric: metric to be used in pairwise distance
        norm:   wether to normalise Aff to be 0 mean 1 std
    Returns:
        labels: numpy array of shape (n_samples,)
    """
    self.laplacian_type = LTYPES['a'] # set laplacian type as affinity
    if (X is None) and (S is None):
        raise ValueError('One of X or S must be specified. Note if both are specified then S is used.')

    if S is None:
        affinity = self.affinity_matrix(X, metric=metric, norm=norm)
    else:
        affinity = S

    num_embeddings = affinity.shape[0]

    # # Compute affinity matrix.
    # affinity = self.affinity_function(embeddings)
    constraint_matrix = None

    eigenvectors, n_clusters, _ = self._compute_eigenvectors_ncluster(
            affinity, constraint_matrix)

    if self.min_clusters is not None:
        n_clusters = max(n_clusters, self.min_clusters)

    # Get spectral embeddings.
    spectral_embeddings = eigenvectors[:, :n_clusters]

    if self.row_wise_renorm:
        # Perform row wise re-normalization.
        rows_norm = np.linalg.norm(spectral_embeddings, axis=1, ord=2)
        spectral_embeddings = spectral_embeddings / np.reshape(
            rows_norm, (num_embeddings, 1))

    # Run clustering algorithm on spectral embeddings. This defaults
    # to a customized implementation of K-means. However, the implementation was created on old sklearn version.
    # -> kmeans raises a Future Warning for n_init parameter.  Rather than rewrite entire class. I added a simple warning filter.
    with warnings.catch_warnings():
        warnings.simplefilter('ignore')
        labels = self.post_eigen_cluster_function(
            spectral_embeddings=spectral_embeddings,
            n_clusters=n_clusters,
            custom_dist=self.custom_dist,
            max_iter=self.max_iter)
    return labels