Skip to content

Clustering Algorithms

Clustering algorithms detect structured regions of varying size and compactness. The choice of algorithm depends on network scale, density, and biological context, influencing the resolution and stability of the resulting modules. Every clustering function in RISK accepts a fraction_shortest_edges argument—a rank-based fraction (0, 1] of the shortest edges retained before the algorithm runs—and returns a sparse cluster-assignment matrix (csr_matrix) that can be reused across multiple statistical tests.


Algorithm Overview

Algorithm Speed Primary use When/Why (assumptions & notes)
Louvain Fast Default, scalable to very large networks Greedy modularity optimization (Blondel et al., 2008); efficient for >10⁴ nodes; may produce disconnected subclusters.
Leiden Fast Improved Louvain with better resolution Guarantees well-connected communities; more stable than Louvain (Traag et al., 2019); slightly higher runtime.
Markov Clustering Medium Detect smaller, compact complexes Flow-based algorithm (Van Dongen, 2008); good for protein complexes or tightly connected submodules.
Walktrap Medium Hierarchical detection in mid-sized graphs Random-walk based (Pons & Latapy, 2005); effective for local structure; slower on >10⁴ nodes.
Greedy Modularity Fast Coarse partitioning Optimizes modularity via agglomeration; very fast but suffers from resolution limit (Newman, 2004).
Label Propagation Fast Quick heuristic Unsupervised label spreading; no objective function; non-deterministic and unstable (Raghavan et al., 2007).
Spinglass Slow Small networks; theoretical interest Statistical mechanics approach (Reichardt & Bornholdt, 2006); finds communities by simulating spin states; computationally intensive.

Choosing an algorithm: quick guidance

  • For large networks (>10⁴ nodes): Louvain or Leiden (fast, scalable).
  • For small, compact complexes: Markov Clustering (protein complexes, submodules).
  • For hierarchical/local structure: Walktrap (medium size).
  • For exploratory speed: Greedy or Label Propagation (but less precise).
  • For research/theory on small graphs: Spinglass (rarely used in practice).

Louvain Clustering

Greedy modularity optimization that scales well to large graphs and serves as the default starting point.

When to use:

  • Large networks where speed and scalability are essential.
  • Baseline clustering for downstream statistical comparisons.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Rank-based edge filtering (0, 1]; defaults to 0.5.
  • resolution (float): Modularity resolution; lower values yield larger clusters. Defaults to 0.1.
  • random_seed (int): Seed for reproducibility. Defaults to 888.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_louvain = risk.cluster_louvain(
    network=network,
    fraction_shortest_edges=0.25,
    resolution=0.5,
    random_seed=888,
)

Leiden Clustering

Connectivity-refined variant of Louvain that produces better separated clusters while remaining fast.

When to use:

  • You need Louvain-like speed but improved connectivity guarantees.
  • Stability between repeated runs is important.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.
  • resolution (float): Resolution parameter; defaults to 1.0.
  • random_seed (int): Seed for reproducibility. Defaults to 888.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_leiden = risk.cluster_leiden(
    network=network,
    fraction_shortest_edges=0.25,
    resolution=1.0,
    random_seed=888,
)

Greedy Modularity Clustering

Agglomerative optimization that quickly produces coarse partitions—useful for exploratory analysis or as a lightweight baseline.

When to use:

  • Rapid, coarse-grained partitions are sufficient.
  • You want to benchmark other algorithms against a simple baseline.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_greedy = risk.cluster_greedy(
    network=network,
    fraction_shortest_edges=0.25,
)

Label Propagation Clustering

Heuristic label spreading approach that offers a non-parametric, fast partitioning of the network.

When to use:

  • You need a parameter-free, extremely fast method.
  • Exploratory analysis where minor instability is acceptable.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_labelprop = risk.cluster_labelprop(
    network=network,
    fraction_shortest_edges=0.25,
)

Markov Clustering (MCL)

Flow-based clustering that captures compact complexes by simulating random walks with expansion and inflation steps.

When to use:

  • Detecting tight-knit protein complexes or submodules.
  • Medium-sized networks where higher resolution structure is needed.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_markov = risk.cluster_markov(
    network=network,
    fraction_shortest_edges=0.25,
)

Walktrap Clustering

Hierarchical clustering that groups vertices visited together during short random walks.

When to use:

  • Mid-sized graphs where local community structure is important.
  • Scenarios where a dendrogram-like hierarchy is desirable.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.

Returns: csr_matrix: Sparse cluster-assignment matrix.

clusters_walktrap = risk.cluster_walktrap(
    network=network,
    fraction_shortest_edges=0.25,
)

Spinglass Clustering

Simulated annealing approach inspired by statistical mechanics—slower but effective for small graphs where fine-grained clusters matter.

When to use:

  • Research settings that explore objective functions from statistical physics.
  • Small networks where runtime is less critical.

Parameters:

  • network (nx.Graph): The network to which the clustering is applied.
  • fraction_shortest_edges (float): Edge filtering fraction. Defaults to 0.5.
clusters_spinglass = risk.cluster_spinglass(
    network=network,
    fraction_shortest_edges=0.25,
)

Next Step

Statistical Methods