Skip to content

Similarity

Similarity

merging

similarity

chi_square(pu, pv)

Chi-square distance or histogram distance. d(u,v) = 1/2 \sum_i=0^N (p(u_i) - p(v_i))**2/(p(u_i)+p(v_i))

Parameters:

  • pu (array) –

    vector of probabilities of observing elements of u

  • pv (array) –

    vector of probabilities of observing elements of u

Returns:

  • float

    distance between u and v

Source code in src/simnetpy/similarity/similarity.py
def chi_square(pu, pv):
    """Chi-square distance or histogram distance.
    d(u,v) = 1/2 \sum_i=0^N (p(u_i) - p(v_i))**2/(p(u_i)+p(v_i))

    Args:
        pu (np.array): vector of probabilities of observing elements of u
        pv (np.array): vector of probabilities of observing elements of u

    Returns:
        float: distance between u and v
    """
    x = pu + pv
    x[x==0] = 1 # if u and v can't be negative so if u+v=0 then u is the same as v and distance is 0. 
    # set totals to 1 in this case to allow computation.

    d = ((pu - pv)**2/(x)).sum()
    d /= 2
    return d

multi_modal_similarity(data, N, method, idxmap=None, norm=True)

Compute pairwise similarity for each data modality Uses same metric on each modality.

Note all matrices in data must be NxN if idxmap not specified Args: data (dict): dictionary of Nmodality feature matrices (N x d_i np.ndarrays) N (int): number of individuals in pairwise calculation idxmap (dict): dictionary contain index of each matrix in larger set

Returns:

  • np.ndarray: (Nmodality, N, N) pairwise similarity matrix

Source code in src/simnetpy/similarity/similarity.py
def multi_modal_similarity(data, N, method, idxmap=None, norm=True):
    """Compute pairwise similarity for each data modality
    Uses same metric on each modality. 

    Note all matrices in data must be NxN if idxmap not specified
    Args:
        data (dict): dictionary of Nmodality feature matrices (N x d_i np.ndarrays)
        N (int): number of individuals in pairwise calculation
        idxmap (dict): dictionary contain index of each matrix in larger set

    Returns:
        np.ndarray: (Nmodality, N, N) pairwise similarity matrix
    """
    Nm = len(data) # number modalities
    S = np.empty((Nm,N,N), dtype=np.float32)
    S.fill(np.nan)

    for i,(key, X) in enumerate(data.items()):
        D = pairwise_sim(X,method, norm=norm)
        if idxmap is not None:
            idx = idxmap[key]
            j,k = np.meshgrid(idx,idx)
            S[i, j, k] = D 
        else:
            S[i,:,:] = D
    return S

partial_mm_similarity(data, metric, norm=True, snf_aff=False, K=20, mu=0.5)

Calculate multi-modal similarity where rows in certain modalities might be mssing. Can use normal distance metric of Affinity proposed in SNF (Similarity Network Fusion).

Note: Function returns a m x N x N dissimilarity matrix. If affinity then S = -(Affinity Matrix)

Parameters:

  • data (list) –

    array of data matrices for each modality. Each array should be N x d. if data missing include NaN rows in input.

  • metric (_type_) –

    metric to use in distance calculation.

  • norm (bool, default: True ) –

    description. Defaults to True.

  • snf_aff (bool, default: False ) –

    description. Defaults to False.

  • K (int, default: 20 ) –

    description. Defaults to 20.

  • mu (float, default: 0.5 ) –

    description. Defaults to 0.5.

Returns:

  • _type_

    description

Source code in src/simnetpy/similarity/similarity.py
def partial_mm_similarity(data, metric, norm=True, snf_aff=False, K=20, mu=0.5):
    """Calculate multi-modal similarity where rows in certain modalities might be mssing. 
    Can use normal distance metric of Affinity proposed in SNF (Similarity Network Fusion).

    Note: Function returns a m x N x N dissimilarity matrix. If affinity then S = -(Affinity Matrix)

    Args:
        data (list): array of data matrices for each modality. Each array should be N x d. 
                    if data missing include NaN rows in input.
        metric (_type_): metric to use in distance calculation.
        norm (bool, optional): _description_. Defaults to True.
        snf_aff (bool, optional): _description_. Defaults to False.
        K (int, optional): _description_. Defaults to 20.
        mu (float, optional): _description_. Defaults to 0.5.

    Returns:
        _type_: _description_
    """
    Nm = len(data)
    N, d = data[0].shape
    S = np.empty((Nm, N, N))
    S.fill(np.nan)
    for i, X in enumerate(data):
        idx = utils.non_nan_indices(X)
        if snf_aff:
            D = -snf_affinity(X[idx,:], metric=metric, K=K, mu=mu)    
        else:
            D = pairwise_sim(X[idx,:], metric, norm=norm)
        j,k = np.meshgrid(idx,idx)
        S[i, j, k] = D
    return S

threshold

combined_adj(D, K, t)

Create a network from a dissimilarity matrix through a mixture of KNN and global threshold.

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • K (int) –

    Number of Neighbours to find for each individual

  • t (float) –

    0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

Returns:

  • np.ndarray: Adjacency matrix of 0 and 1s

Source code in src/simnetpy/similarity/threshold.py
def combined_adj(D, K, t):
    """Create a network from a dissimilarity matrix through a mixture of KNN and global
    threshold.

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        K (int): Number of Neighbours to find for each individual
        t (float): 0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

    Returns:
        np.ndarray: Adjacency matrix of 0 and 1s
    """
    A_k = knn_adj(D, K)

    A_t = threshold_adj(D, t)
    A = A_k + A_t
    A[A>1] = 1
    return A

knn_adj(D, K)

Create a network from a dissimilarity matrix by finding top K most similar neighbours for each node. Note: uses brute force algorithm. Checks all possible values. Slow for large matrices

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • K (int) –

    Number of Neighbours to find for each individual

Returns:

  • np.ndarray: Adjacency matrix of 0 and 1s

Source code in src/simnetpy/similarity/threshold.py
def knn_adj(D,K):
    """Create a network from a dissimilarity matrix by finding top K most similar neighbours for each node. 
    Note: uses brute force algorithm. Checks all possible values. Slow for large matrices

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        K (int): Number of Neighbours to find for each individual

    Returns:
        np.ndarray: Adjacency matrix of 0 and 1s
    """
    assert isinstance(K, (int, np.integer)), "K must be an integer"

    # Note D is distance matrix not affinity
    # lower values => more similar.
    # output from pairwise_sim has diagonals with minimum value 
    np.fill_diagonal(D, D.max()) # don't want to include diagonals

    A = np.zeros(D.shape)
    # find K nearest neighbours for each individual 
    nn = np.argsort(D, axis=1)[:,:K]
    # add edge (set value = 1) for each nearest neighbour
    np.put_along_axis(A,nn, 1, axis=1) 
    # make symmetric
    A = A.T + A
    # if two individuals both have each other as nn then value will be 2. set == 1
    A[A>1] = 1
    return A

log_skewed_knn_adj(D, K, statNN=10, stat='mean', Kquantile=1.0)

Create a network from a dissimilarity matrix through a mixture of KNN and global threshold.

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • K (int) –

    Control number of Neighbours in neighbour distribution. Coupled with Kquantile. e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max.

  • statNN (int, default: 10 ) –

    Number of neighbours in stat calc. Defaults to 10.

  • stat (str, default: 'mean' ) –

    Stat to calculate from neighest neighbours, one of mean, median, std. Defaults to 'mean'.

  • Kquantile (float, default: 1.0 ) –

    Quantile of stat dist to map K to. Defaults to 1.0.

Returns:

  • np.ndarray: Adjacency matrix of 0 and 1s

Source code in src/simnetpy/similarity/threshold.py
def log_skewed_knn_adj(D, K, statNN=10, stat='mean', Kquantile=1.0):
    """Create a network from a dissimilarity matrix through a mixture of KNN and global
    threshold.

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        K (int): Control number of Neighbours in neighbour distribution. Coupled with Kquantile. 
            e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max. 
        statNN (int, optional): Number of neighbours in stat calc. Defaults to 10.
        stat (str, optional): Stat to calculate from neighest neighbours, one of mean, median, std. 
                            Defaults to 'mean'.
        Kquantile (float, optional): Quantile of stat dist to map K to. Defaults to 1.0.

    Returns:
        np.ndarray: Adjacency matrix of 0 and 1s
    """
    assert isinstance(K,int), "K must be an integer"

    np.fill_diagonal(D,D.max())
    NN = nn_distribution(D, K, statNN=statNN, stat=stat, Kquantile=Kquantile, mapping='log')
    D_sorted = np.argsort(D, axis=1)
    A = np.zeros(D.shape)
    for i, nn in enumerate(NN):
        idx = D_sorted[i,:nn]
        A[i, idx] = 1

    A = A.T + A
    A[A>1] = 1
    return A

network_from_sim_mat(D, method='knn', **kwargs)

function to sparsify dissimilarity matrix into adjacency

Parameters:

  • D (ndarray) –

    nxn Dissimilarity matrix. Smaller => more similar

  • method (str, default: 'knn' ) –

    method to use to sparsify matrix. one of [knn, threshold, combined, skewed_knn]. Defaults to 'knn'.

  • **kwargs

    keyword arguments for sparsifying functions

Returns:

  • ig.Graph: Graph created from similarity matrix

Source code in src/simnetpy/similarity/threshold.py
def network_from_sim_mat(D, method='knn', **kwargs):
    """
    function to sparsify dissimilarity matrix into adjacency 

    Args:
        D (np.ndarray): nxn Dissimilarity matrix. Smaller => more similar
        method (str, optional): method to use to sparsify matrix. 
                    one of [knn, threshold, combined, skewed_knn]. Defaults to 'knn'.
        **kwargs: keyword arguments for sparsifying functions

    Returns:
        ig.Graph: Graph created from similarity matrix
    """
    A = sparsify_sim_matrix(D, method=method, **kwargs)
    g = mat2graph(A)
    return g

nn_distribution(D, K, statNN=10, stat='mean', Kquantile=1.0, mapping='linear')

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • K (int) –

    Control number of Neighbours in neighbour distribution. Coupled with Kquantile. e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max.

  • statNN (int, default: 10 ) –

    Number of neighbours in stat calc. Defaults to 10.

  • stat (str, default: 'mean' ) –

    Stat to calculate from neighest neighbours, one of mean, median, std. Defaults to 'mean'.

  • Kquantile (float, default: 1.0 ) –

    Quantile of stat dist to map K to. Defaults to 1.0.

Returns:

  • _type_

    description

Source code in src/simnetpy/similarity/threshold.py
def nn_distribution(D, K, statNN=10, stat='mean', Kquantile=1.0, mapping='linear'):
    """

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        K (int): Control number of Neighbours in neighbour distribution. Coupled with Kquantile. 
            e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max. 
        statNN (int, optional): Number of neighbours in stat calc. Defaults to 10.
        stat (str, optional): Stat to calculate from neighest neighbours, one of mean, median, std. 
                            Defaults to 'mean'.
        Kquantile (float, optional): Quantile of stat dist to map K to. Defaults to 1.0.

    Returns:
        _type_: _description_
    """
    stat = stat.lower()
    assert stat in ['mean', 'median', 'std'], 'Must be one of mean, median, std'
    assert Kquantile <= 1.0 and Kquantile>=0, "Must be between 0 and 1"
    assert mapping in ['linear', 'log'], "mapping must be one of [linear, log]"
    S = neighbour_stat(D, k=statNN, stat=stat) # find average similarity amongst statNN closest neighbours
    V = - (S - S.min())/(S.max()-S.min()) + 1 # normalise to 0, 1. Note: close to 1 means larger number of similar neighbours

    # We map the distribution from to 0,Kmax so that the number of neighbours each node is assigned goes from 0, Kmax
    Kq = np.quantile(V, Kquantile)
    Kmax = int(K/Kq) 
    if mapping == 'linear':
        V = Kmax*V # normalise to 0,K
        NN = np.digitize(V, bins=np.arange(Kmax))
    elif mapping == 'log':
        upper = np.log(Kmax+1)
        V = upper*V # normalise to 0,log(Kmax+1)
        V = np.exp(V) # V now in [1, Kmax+1]
        NN = np.digitize(V-1, bins=np.arange(Kmax)) # digitize maps x in [0-1] to 1 
                                        #i.e. rounds up so need -1 to get correct range
    return NN

skewed_knn_adj(D, K, statNN=10, stat='mean', Kquantile=1.0)

Create a network from a dissimilarity matrix through a mixture of KNN and global threshold.

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • K (int) –

    Control number of Neighbours in neighbour distribution. Coupled with Kquantile. e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max.

  • statNN (int, default: 10 ) –

    Number of neighbours in stat calc. Defaults to 10.

  • stat (str, default: 'mean' ) –

    Stat to calculate from neighest neighbours, one of mean, median, std. Defaults to 'mean'.

  • Kquantile (float, default: 1.0 ) –

    Quantile of stat dist to map K to. Defaults to 1.0.

Returns:

  • np.ndarray: Adjacency matrix of 0 and 1s

Source code in src/simnetpy/similarity/threshold.py
def skewed_knn_adj(D, K, statNN=10, stat='mean', Kquantile=1.0):
    """Create a network from a dissimilarity matrix through a mixture of KNN and global
    threshold.

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        K (int): Control number of Neighbours in neighbour distribution. Coupled with Kquantile. 
            e.g. K=5, Kquantile=0.5 means 5 will be mean of distribution. kquantile=1.0 mean 5 will be max. 
        statNN (int, optional): Number of neighbours in stat calc. Defaults to 10.
        stat (str, optional): Stat to calculate from neighest neighbours, one of mean, median, std. 
                            Defaults to 'mean'.
        Kquantile (float, optional): Quantile of stat dist to map K to. Defaults to 1.0.

    Returns:
        np.ndarray: Adjacency matrix of 0 and 1s
    """
    assert isinstance(K,int), "K must be an integer"

    np.fill_diagonal(D,D.max())
    NN = nn_distribution(D, K, statNN=statNN, stat=stat, Kquantile=Kquantile, mapping='linear')
    D_sorted = np.argsort(D, axis=1)
    A = np.zeros(D.shape)
    for i, nn in enumerate(NN):
        idx = D_sorted[i,:nn]
        A[i, idx] = 1

    A = A.T + A
    A[A>1] = 1
    return A

sparsify_sim_matrix(D, method='knn', **kwargs)

function to sparsify dissimilarity matrix into adjacency

Parameters:

  • D (ndarray) –

    nxn Dissimilarity matrix. Smaller => more similar

  • method (str, default: 'knn' ) –

    method to use to sparsify matrix. one of [knn, threshold, combined, skewed_knn]. Defaults to 'knn'.

  • **kwargs

    keyword arguments for sparsifying functions

Returns:

  • np.ndarray: nxn symmetric Adjacency matrix of 0s and 1s

Source code in src/simnetpy/similarity/threshold.py
def sparsify_sim_matrix(D, method='knn', **kwargs):
    """
    function to sparsify dissimilarity matrix into adjacency 

    Args:
        D (np.ndarray): nxn Dissimilarity matrix. Smaller => more similar
        method (str, optional): method to use to sparsify matrix. 
                    one of [knn, threshold, combined, skewed_knn]. Defaults to 'knn'.
        **kwargs: keyword arguments for sparsifying functions

    Returns:
        np.ndarray: nxn symmetric Adjacency matrix of 0s and 1s
    """
    fsparser = {'knn': knn_adj, 'threshold':threshold_adj, 
        'combined':combined_adj, 'skewed_knn':skewed_knn_adj, 
        'log_skewed_knn':log_skewed_knn_adj}
    A = fsparser[method](D, **kwargs)
    return A

threshold_adj(D, t)

Threshold dissimilarity matrix using quantile of values. Assumes distance. Edges retained are values below smallest t%.

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • t (float) –

    0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

Returns:

  • np.ndarray: Adjacency matrix of 0 and 1s

Source code in src/simnetpy/similarity/threshold.py
def threshold_adj(D, t):
    """Threshold dissimilarity matrix using quantile of values. Assumes distance. Edges retained are values below
    smallest t%.

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        t (float): 0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

    Returns:
        np.ndarray: Adjacency matrix of 0 and 1s
    """
    A = np.zeros(D.shape)
    # d = np.triu(D).flatten()
    np.fill_diagonal(D,0) # coversion to squareform requires 0 diagonals
    d = squareform(D)
    t = np.quantile(d, t)
    A[D < t] = 1
    return A

threshold_graph(D, t)

Threshold dissimilarity matrix using quantile of values and create a igraph network. Assumes distance. Edges retained are values below smallest t%.

Parameters:

  • D (ndarray) –

    nxn dissimilarity matrix. smaller values => more similar.

  • t (float) –

    0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

Returns:

  • ig.Graph: Graph created from thresholding connections.

Source code in src/simnetpy/similarity/threshold.py
def threshold_graph(D,t):
    """Threshold dissimilarity matrix using quantile of values and create a igraph network. Assumes distance. Edges retained are values below
    smallest t%.

    Args:
        D (np.ndarray): nxn dissimilarity matrix. smaller values => more similar.
        t (float): 0 to 1 top 100*t% of edges to keep. 0.01 means top 1% most similar connections.

    Returns:
        ig.Graph: Graph created from thresholding connections.
    """
    A = threshold_adj(D,t)
    g = mat2graph(A)
    return g