Using SimNetPy
¶
%load_ext autoreload
%autoreload 2
import simnetpy as sn
import numpy as np
import matplotlib.pyplot as plt
from pprint import pprint
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
Example Usage¶
Here we show some example usages of the simnetpy
package.
Generating Data¶
Create some data using a mixed multi-guassian. Calculate similarity using the euclidean metric. Create a network by selecting the K-Nearest Neighbours (K=10) of each data point.
N = 100
sizes=np.array([34,33,33])
d = 2
dataset = sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes)
# calculate pairwise similarity
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)
# Create igraph Igraph from matrix
gg = sn.network_from_sim_mat(S, method='knn', K=10)
We can look at the statistics of the graph:
- n - number of nodes
- E - number of edges
- Nc - number of connected components
- ncmax - number of nodes in the largest connected component.
- assortativity - degree mixing within the network
- avg_degree - Mean Degree of each vertex.
- max_degree - Max Degree of all vertices.
- median_degree - Median Degree of each vertex.
- avg_path_length - Average length of shortest path between all vertices.
- avglocalcc - Mean local clustering coefficient
- density - network density (portion of possible edges)
- globalcc - Global clustering coefficient
# print graph stats
print('Graph Statistics:')
pprint(gg.graph_stats())
# display(gg.graph_stats())
Graph Statistics: {'E': 627, 'Nc': 1, 'assortivity': 0.13012665522902636, 'avg_degree': 12.54, 'avg_path_length': 4.084848484848485, 'avglocalcc': 0.6728595432775001, 'density': 0.12666666666666665, 'diameter': 10, 'globalcc': 0.6379537085744345, 'max_degree': 21, 'median_degree': 12.0, 'n': 100, 'ncmax': 100}
Cluster the graph using spectral clustering. The number of clusters is selected using the eigengap ratio heuristic.
# cluster
ylabels = sn.clustering.spectral_clustering(gg, laplacian='lrw')
Evaluate the accuracy using:
- Adjusted Mutual Information
- Adjusted Rand Index
- Completeness
- Homogeneity
- V-Measure
See sklearn.metrics for details.
# cluster accuracy
cacc = sn.clustering.cluster_accuracy(dataset.y, ylabels)
print('\nPredicted Cluster Accuracy:')
pprint(cacc)
Predicted Cluster Accuracy: {'Npred': 3, 'Ntrue': 3, 'ami': 0.846420811615343, 'ari': 0.8829150927802232, 'complete': 0.8494134300611329, 'homo': 0.8491790791640881, 'vm': 0.8492962384461837}
Evaluate the cluster quality using:
- Conductance
- Density
- Modularity
- Separability
- Triad Participation Ratio
See Defining and Evaluating Network Communities based on Ground-truth, Yang and Leskovec, 2012 for definitions.
# predicted cluster quality
cqual = sn.clustering.cluster_quality(gg, ylabels)
print('\nPredicted Cluster Quality:')
pprint(cqual)
Predicted Cluster Quality: {'cc': 0.6257503295835909, 'cond': 0.06369122572576431, 'density': 0.36662521802464876, 'mod': 0.603128505047268, 'sep': 9.055860805860805, 'tpr': 1.0}
Thanks to the known ground truth cluster labels we can not only evaluate the quality of the predicted clusters but how well the network encodes the cluster quality.
# true cluster quality
cqual_ytrue = sn.clustering.cluster_quality(gg, dataset.y)
print('\nGround Truth Cluster Quality:')
pprint(cqual_ytrue)
Ground Truth Cluster Quality: {'cc': 0.625365265135371, 'cond': 0.0966136809881984, 'density': 0.3537953060011884, 'mod': 0.5701322404262138, 'sep': 5.4187917425622345, 'tpr': 0.9901960784313726}
Comparing Networks¶
Create some data. We create studentt and Gaussian distributed data with each of the cluster settings:
- Equal 3/10/30 - 3/10/30 equally sized clusters.
- Single Large - 10 clusters; 1 large cluster containing >50% of nodes, 7 small clusters (1-5%) and 2 medium clusters (10%, 20%).
- Mixed Sizes - 10 clusters of mixed sizes; 3 larger clusters (20-30% of nodes), 2 medium clusters (5-10%) and 5 smaller clusters (1-5%)
N = 500
d = 2
std = 1
cluster_settings = sn.datasets.single_mod_cluster_problems(N)
distypes = ['guassian', 'studentt']
rng = np.random.default_rng(seed=1871702)
data = {distype: {name: sn.datasets.mixed_multi_numeric(len(sizes), d, N, sizes=sizes, distype=distype, rng=rng) for name, sizes in cluster_settings.items()} for distype in distypes}
print(f'Distributions: {list(data.keys())}')
print(f'Cluster Problems: {list(data["guassian"].keys())}')
Distributions: ['guassian', 'studentt'] Cluster Problems: ['equal_3', 'equal_10', 'equal_30', 'single_large', 'mixed_sizes']
Select the distribution and cluster type.
distype = 'guassian'
clstr_name = 'mixed_sizes'
# clstr_name = 'equal_10'
dataset = data[distype][clstr_name]
S = sn.pairwise_sim(dataset.X, metric='euclidean', norm=True)
Plot the networks. Parameters have been chosen so the density of each network is roughly equal.
namedict = {'knn':'KNN','threshold':'Threshold','combined':'Combined', 'skewed_knn':'Linear Skewed KNN', 'log_skewed_knn':'Log Skewed KNN'}
# Settings for graph creation.
ggs = [{'method':'knn', 'K':8} , {'method':'threshold', 't':0.02},
{'method':'combined', 'K':4, 't':0.0175}, {'method':'skewed_knn', 'K':9}, {'method':'log_skewed_knn', 'K':11}]
# Create networks and scale by max degree
max_deg = []
graphs = {}
for gdict in ggs:
gg = sn.network_from_sim_mat(S, **gdict)
graphs[gdict['method']] = gg
# print(np.mean(gg.degree()))
max_deg.append(np.max(gg.degree()))
max_deg = np.array(max_deg)
max_deg = max_deg / max_deg.max()
# plotting parameters
alpha = 0.8
edge_alpha=0.6
markersize=8
base_size = 2.5
sn.utils.set_science_style()
fig, axs = plt.subplots(2,3,dpi=300, figsize=(9,6))
# Data
ax = axs[0, 0]
sn.plotting.plot_data_col_by_cluster(dataset.X, dataset.y, PCA=False, marker='.', markersize=base_size, ax=ax, alpha=alpha)
ax.set_title('Data')
# Graphs
for i, (ax, gname) in enumerate(zip(axs.flatten()[1:], graphs.keys())):
g = graphs[gname]
prop = max_deg[i]
msize = markersize*prop
sn.graph.network_plot_col_by_cluster(g, dataset.X, dataset.y, PCA=False, ax=ax, markersize=msize,
min_markersize=base_size, node_alpha=alpha, edge_alpha=edge_alpha, scale_marker=True)
ax.set_title(f'{namedict[gname]}')
# f= f'network_comp-n{N}-{distype}-{clstr_name}_nodensity.png'
# p = figfolder / f
# sn.utils.save_mpl_figure(fig, savepath=p, svg=False, dpi=300)
plt.show()