Clustering Algorithms¶
Clustering algorithms detect structured regions of varying size and compactness. The choice of algorithm depends on network scale, density, and biological context, influencing the resolution and stability of the resulting modules. Every clustering function in RISK accepts a fraction_shortest_edges argument—a rank-based fraction (0, 1] of the shortest edges retained before the algorithm runs—and returns a sparse cluster-assignment matrix (csr_matrix) that can be reused across multiple statistical tests.
Algorithm Overview¶
| Algorithm | Speed | Primary use | When/Why (assumptions & notes) |
|---|---|---|---|
| Louvain | Fast | Default, scalable to very large networks | Greedy modularity optimization (Blondel et al., 2008); efficient for >10⁴ nodes; may produce disconnected subclusters. |
| Leiden | Fast | Improved Louvain with better resolution | Guarantees well-connected communities; more stable than Louvain (Traag et al., 2019); slightly higher runtime. |
| Markov Clustering | Medium | Detect smaller, compact complexes | Flow-based algorithm (Van Dongen, 2008); good for protein complexes or tightly connected submodules. |
| Walktrap | Medium | Hierarchical detection in mid-sized graphs | Random-walk based (Pons & Latapy, 2005); effective for local structure; slower on >10⁴ nodes. |
| Greedy Modularity | Fast | Coarse partitioning | Optimizes modularity via agglomeration; very fast but suffers from resolution limit (Newman, 2004). |
| Label Propagation | Fast | Quick heuristic | Unsupervised label spreading; no objective function; non-deterministic and unstable (Raghavan et al., 2007). |
| Spinglass | Slow | Small networks; theoretical interest | Statistical mechanics approach (Reichardt & Bornholdt, 2006); finds communities by simulating spin states; computationally intensive. |
Choosing an algorithm: quick guidance¶
- For large networks (>10⁴ nodes): Louvain or Leiden (fast, scalable).
- For small, compact complexes: Markov Clustering (protein complexes, submodules).
- For hierarchical/local structure: Walktrap (medium size).
- For exploratory speed: Greedy or Label Propagation (but less precise).
- For research/theory on small graphs: Spinglass (rarely used in practice).
Louvain Clustering¶
Greedy modularity optimization that scales well to large graphs and serves as the default starting point.
When to use:
- Large networks where speed and scalability are essential.
- Baseline clustering for downstream statistical comparisons.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Rank-based edge filtering (0, 1]; defaults to0.5.resolution(float): Modularity resolution; lower values yield larger clusters. Defaults to0.1.random_seed(int): Seed for reproducibility. Defaults to888.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_louvain = risk.cluster_louvain(
network=network,
fraction_shortest_edges=0.25,
resolution=0.5,
random_seed=888,
)
Leiden Clustering¶
Connectivity-refined variant of Louvain that produces better separated clusters while remaining fast.
When to use:
- You need Louvain-like speed but improved connectivity guarantees.
- Stability between repeated runs is important.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.resolution(float): Resolution parameter; defaults to1.0.random_seed(int): Seed for reproducibility. Defaults to888.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_leiden = risk.cluster_leiden(
network=network,
fraction_shortest_edges=0.25,
resolution=1.0,
random_seed=888,
)
Greedy Modularity Clustering¶
Agglomerative optimization that quickly produces coarse partitions—useful for exploratory analysis or as a lightweight baseline.
When to use:
- Rapid, coarse-grained partitions are sufficient.
- You want to benchmark other algorithms against a simple baseline.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_greedy = risk.cluster_greedy(
network=network,
fraction_shortest_edges=0.25,
)
Label Propagation Clustering¶
Heuristic label spreading approach that offers a non-parametric, fast partitioning of the network.
When to use:
- You need a parameter-free, extremely fast method.
- Exploratory analysis where minor instability is acceptable.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_labelprop = risk.cluster_labelprop(
network=network,
fraction_shortest_edges=0.25,
)
Markov Clustering (MCL)¶
Flow-based clustering that captures compact complexes by simulating random walks with expansion and inflation steps.
When to use:
- Detecting tight-knit protein complexes or submodules.
- Medium-sized networks where higher resolution structure is needed.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_markov = risk.cluster_markov(
network=network,
fraction_shortest_edges=0.25,
)
Walktrap Clustering¶
Hierarchical clustering that groups vertices visited together during short random walks.
When to use:
- Mid-sized graphs where local community structure is important.
- Scenarios where a dendrogram-like hierarchy is desirable.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.
Returns:
csr_matrix: Sparse cluster-assignment matrix.
clusters_walktrap = risk.cluster_walktrap(
network=network,
fraction_shortest_edges=0.25,
)
Spinglass Clustering¶
Simulated annealing approach inspired by statistical mechanics—slower but effective for small graphs where fine-grained clusters matter.
When to use:
- Research settings that explore objective functions from statistical physics.
- Small networks where runtime is less critical.
Parameters:
network(nx.Graph): The network to which the clustering is applied.fraction_shortest_edges(float): Edge filtering fraction. Defaults to0.5.
clusters_spinglass = risk.cluster_spinglass(
network=network,
fraction_shortest_edges=0.25,
)