Using `SimNetPy`¶

In [6]:

Copied!





%load_ext autoreload
%autoreload 2
import simnetpy as sn
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint
%load_ext autoreload
%autoreload 2
import simnetpy as sn
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Example Usage¶

Here we show some example usages of the simnetpy package.

Generating Data¶

Create some data using a mixed multi-guassian. Calculate similarity using the euclidean metric. Create a network by selecting the K-Nearest Neighbours (K=10) of each data point.

In [19]:

Copied!





N = 100
sizes=np.array([34,33,33])
d = 2
dataset = sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes)

# calculate pairwise similarity
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)

# Create igraph Igraph from matrix
gg = sn.network_from_sim_mat(S, method='knn', K=10)
N = 100
sizes=np.array([34,33,33])
d = 2
dataset = sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes)

# calculate pairwise similarity
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)

# Create igraph Igraph from matrix
gg = sn.network_from_sim_mat(S, method='knn', K=10)

We can look at the statistics of the graph:

n - number of nodes
E - number of edges
Nc - number of connected components
ncmax - number of nodes in the largest connected component.
assortativity - degree mixing within the network
avg_degree - Mean Degree of each vertex.
max_degree - Max Degree of all vertices.
median_degree - Median Degree of each vertex.
avg_path_length - Average length of shortest path between all vertices.
avglocalcc - Mean local clustering coefficient
density - network density (portion of possible edges)
globalcc - Global clustering coefficient

In [20]:

Copied!





# print graph stats
print('Graph Statistics:')
pprint(gg.graph_stats())
# display(gg.graph_stats())
# print graph stats
print('Graph Statistics:')
pprint(gg.graph_stats())
# display(gg.graph_stats())

Graph Statistics:
{'E': 627,
 'Nc': 1,
 'assortivity': 0.13012665522902636,
 'avg_degree': 12.54,
 'avg_path_length': 4.084848484848485,
 'avglocalcc': 0.6728595432775001,
 'density': 0.12666666666666665,
 'diameter': 10,
 'globalcc': 0.6379537085744345,
 'max_degree': 21,
 'median_degree': 12.0,
 'n': 100,
 'ncmax': 100}

Cluster the graph using spectral clustering. The number of clusters is selected using the eigengap ratio heuristic.

In [17]:

Copied!

# cluster
ylabels = sn.clustering.spectral_clustering(gg, laplacian='lrw')
# cluster
ylabels = sn.clustering.spectral_clustering(gg, laplacian='lrw')

Evaluate the accuracy using:

Adjusted Mutual Information
Adjusted Rand Index
Completeness
Homogeneity
V-Measure

See sklearn.metrics for details.

In [16]:

Copied!





# cluster accuracy
cacc = sn.clustering.cluster_accuracy(dataset.y, ylabels)
print('\nPredicted Cluster Accuracy:')
pprint(cacc)
# cluster accuracy
cacc = sn.clustering.cluster_accuracy(dataset.y, ylabels)
print('\nPredicted Cluster Accuracy:')
pprint(cacc)

Predicted Cluster Accuracy:
{'Npred': 3,
 'Ntrue': 3,
 'ami': 0.846420811615343,
 'ari': 0.8829150927802232,
 'complete': 0.8494134300611329,
 'homo': 0.8491790791640881,
 'vm': 0.8492962384461837}

Evaluate the cluster quality using:

Conductance
Density
Modularity
Separability
Triad Participation Ratio

See Defining and Evaluating Network Communities based on Ground-truth, Yang and Leskovec, 2012 for definitions.

In [ ]:

Copied!





# predicted cluster quality
cqual = sn.clustering.cluster_quality(gg, ylabels)
print('\nPredicted Cluster Quality:')
pprint(cqual)
# predicted cluster quality
cqual = sn.clustering.cluster_quality(gg, ylabels)
print('\nPredicted Cluster Quality:')
pprint(cqual)

Predicted Cluster Quality:
{'cc': 0.6257503295835909,
 'cond': 0.06369122572576431,
 'density': 0.36662521802464876,
 'mod': 0.603128505047268,
 'sep': 9.055860805860805,
 'tpr': 1.0}

Thanks to the known ground truth cluster labels we can not only evaluate the quality of the predicted clusters but how well the network encodes the cluster quality.

In [18]:

Copied!





# true cluster quality
cqual_ytrue = sn.clustering.cluster_quality(gg, dataset.y)
print('\nGround Truth Cluster Quality:')
pprint(cqual_ytrue)
# true cluster quality
cqual_ytrue = sn.clustering.cluster_quality(gg, dataset.y)
print('\nGround Truth Cluster Quality:')
pprint(cqual_ytrue)

Ground Truth Cluster Quality:
{'cc': 0.625365265135371,
 'cond': 0.0966136809881984,
 'density': 0.3537953060011884,
 'mod': 0.5701322404262138,
 'sep': 5.4187917425622345,
 'tpr': 0.9901960784313726}

Comparing Networks¶

Create some data. We create studentt and Gaussian distributed data with each of the cluster settings:

Equal 3/10/30 - 3/10/30 equally sized clusters.
Single Large - 10 clusters; 1 large cluster containing >50% of nodes, 7 small clusters (1-5%) and 2 medium clusters (10%, 20%).
Mixed Sizes - 10 clusters of mixed sizes; 3 larger clusters (20-30% of nodes), 2 medium clusters (5-10%) and 5 smaller clusters (1-5%)

In [22]:

Copied!





N = 500
d = 2
std = 1
cluster_settings = sn.datasets.single_mod_cluster_problems(N)
distypes = ['guassian', 'studentt']
rng = np.random.default_rng(seed=1871702)

data = {distype: {name: sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes, distype=distype, rng=rng) for name, sizes in cluster_settings.items()} for distype in distypes}
print(f'Distributions: {list(data.keys())}')
print(f'Cluster Problems: {list(data["guassian"].keys())}')
N = 500
d = 2
std = 1
cluster_settings = sn.datasets.single_mod_cluster_problems(N)
distypes = ['guassian', 'studentt']
rng = np.random.default_rng(seed=1871702)

data = {distype: {name: sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes, distype=distype, rng=rng) for name, sizes in cluster_settings.items()} for distype in distypes}
print(f'Distributions: {list(data.keys())}')
print(f'Cluster Problems: {list(data["guassian"].keys())}')

Distributions: ['guassian', 'studentt']
Cluster Problems: ['equal_3', 'equal_10', 'equal_30', 'single_large', 'mixed_sizes']

Select the distribution and cluster type.

In [9]:

Copied!





distype = 'guassian'
clstr_name = 'mixed_sizes'
# clstr_name = 'equal_10'

dataset = data[distype][clstr_name]
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)
distype = 'guassian'
clstr_name = 'mixed_sizes'
# clstr_name = 'equal_10'

dataset = data[distype][clstr_name]
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)

Plot the networks. Parameters have been chosen so the density of each network is roughly equal.

In [24]:

Copied!





namedict = {'knn':'KNN','threshold':'Threshold','combined':'Combined', 'skewed_knn':'Linear Skewed KNN', 'log_skewed_knn':'Log Skewed KNN'}

# Settings for graph creation. 
ggs = [{'method':'knn', 'K':8} , {'method':'threshold', 't':0.02}, 
{'method':'combined', 'K':4, 't':0.0175}, {'method':'skewed_knn', 'K':9}, {'method':'log_skewed_knn', 'K':11}]

# Create networks and scale by max degree
max_deg = []
graphs = {}
for gdict in ggs:
    gg = sn.network_from_sim_mat(S, **gdict)
    graphs[gdict['method']] = gg
    # print(np.mean(gg.degree()))
    max_deg.append(np.max(gg.degree()))
max_deg = np.array(max_deg)
max_deg = max_deg / max_deg.max()

# plotting parameters
alpha = 0.8
edge_alpha=0.6
markersize=8
base_size = 2.5

sn.utils.set_science_style()
fig, axs = plt.subplots(2,3,dpi=300, figsize=(9,6))

# Data
ax = axs[0, 0]
sn.plotting.plot_data_col_by_cluster(dataset.X, dataset.y, PCA=False, marker='.', markersize=base_size, ax=ax, alpha=alpha)
ax.set_title('Data')

# Graphs
for i, (ax, gname) in enumerate(zip(axs.flatten()[1:], graphs.keys())):
    g = graphs[gname]
    prop = max_deg[i]
    msize = markersize*prop
    sn.graph.network_plot_col_by_cluster(g, dataset.X, dataset.y, PCA=False, ax=ax, markersize=msize, 
                    min_markersize=base_size, node_alpha=alpha, edge_alpha=edge_alpha, scale_marker=True)
    ax.set_title(f'{namedict[gname]}')

# f= f'network_comp-n{N}-{distype}-{clstr_name}_nodensity.png'
# p = figfolder / f
# sn.utils.save_mpl_figure(fig, savepath=p, svg=False, dpi=300)
plt.show()
namedict = {'knn':'KNN','threshold':'Threshold','combined':'Combined', 'skewed_knn':'Linear Skewed KNN', 'log_skewed_knn':'Log Skewed KNN'}

# Settings for graph creation. 
ggs = [{'method':'knn', 'K':8} , {'method':'threshold', 't':0.02}, 
{'method':'combined', 'K':4, 't':0.0175}, {'method':'skewed_knn', 'K':9}, {'method':'log_skewed_knn', 'K':11}]

# Create networks and scale by max degree
max_deg = []
graphs = {}
for gdict in ggs:
    gg = sn.network_from_sim_mat(S, **gdict)
    graphs[gdict['method']] = gg
    # print(np.mean(gg.degree()))
    max_deg.append(np.max(gg.degree()))
max_deg = np.array(max_deg)
max_deg = max_deg / max_deg.max()

# plotting parameters
alpha = 0.8
edge_alpha=0.6
markersize=8
base_size = 2.5

sn.utils.set_science_style()
fig, axs = plt.subplots(2,3,dpi=300, figsize=(9,6))

# Data
ax = axs[0, 0]
sn.plotting.plot_data_col_by_cluster(dataset.X, dataset.y, PCA=False, marker='.', markersize=base_size, ax=ax, alpha=alpha)
ax.set_title('Data')

# Graphs
for i, (ax, gname) in enumerate(zip(axs.flatten()[1:], graphs.keys())):
    g = graphs[gname]
    prop = max_deg[i]
    msize = markersize*prop
    sn.graph.network_plot_col_by_cluster(g, dataset.X, dataset.y, PCA=False, ax=ax, markersize=msize, 
                    min_markersize=base_size, node_alpha=alpha, edge_alpha=edge_alpha, scale_marker=True)
    ax.set_title(f'{namedict[gname]}')

# f= f'network_comp-n{N}-{distype}-{clstr_name}_nodensity.png'
# p = figfolder / f
# sn.utils.save_mpl_figure(fig, savepath=p, svg=False, dpi=300)
plt.show()

In [ ]:

Using SimNetPy¶

Example Usage¶

Generating Data¶

Comparing Networks¶

Using `SimNetPy`¶