-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
282 lines (247 loc) · 17.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Bayan Project</title>
<link rel="stylesheet" type="text/css" href="style.css">
</head>
<body>
<header>
<div class="container">
<h1 class="logo">Bayan Project</h1>
<nav>
<ul>
<li><a href="#about" class = "false">Home</a></li>
<li><a href="#collaborators" class="false">Collaborators</a></li>
</ul>
</nav>
</div>
</header>
<section id = "about">
<h1 id = "about-title">What is Bayan project about?</h1>
<div id = "about-div">
<h2>
Aim
</h2>
<p>
<ul>
<li>
Developing a reliable and accurate algorithm for descriptive community detection in small and mid-sized networks, with optimality guarantees.
</li>
</ul>
</p>
<h2>
Community Detection
</h2>
<ul id="cd-ul">
<li>
Descriptive community detection is the process of clustering the nodes of a network (graph) based on the structural patterns. It is an unsupervised task with a wide range of applications in life sciences (e.g., biological systems), social sciences (offline/online social networks), physical sciences (VLSI design, complex systems), mathematical sciences (agent-based models, computer vision), and even health sciences (drug discovery) <a href="https://en.wikipedia.org/wiki/Community_structure">(Wikipedia). </a>
<img src="6meupu.gif" id = "cd-gif"> </img>
</li>
<li>
There are several existing methods (algorithms) for approaching this computational problem. One family of algorithms attempts to find communities by maximizing <em>modularity</em>: a network-level quantity indicating the fraction of intra-community edges minus the expected fraction under a degree-preserving null model.
</li>
</ul>
<ul>
</ul>
</div>
</section>
<section id = "current-work">
<h1 id="curr-work">Do we need yet another community detection algorithm? Yes, and here's why:</h1>
<div id="curr-work-div">
<ul>
<li>
In the <a href="https://doi.org/10.1016/j.jocs.2024.102283">first part of this project</a>, we solved the following exact Integer Programming (IP) model for a testbed of over 100 real and random networks with reasonably modular structures. We evaluated nine existing modularity-based community detection methods, <em> <a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">CNM (Greedy)</a>, <a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Louvain</a>, <a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Combo</a>, <a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Belief</a></em>, <em><a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Leiden</a></em>, <em><a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Paris</a></em>, <em><a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">EdMot</a></em>, <em><a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Leicht Newman (LN)</a></em>, and <em><a href="https://github.com/Alexander-Belyi/GNNS">Graph neural network (GNN)</a></em>.<br>
</li>
<img src="ip_formulation.png">
<br>
Notation for the <a href="https://doi.org/10.1080/15427951.2014.950875"> IP model</a>: The input is graph G=(V,E) with |V|=n nodes, |E|=m edges, and adjacency matrix entries a_{ij}. d_i is the degree of node i, gamma is the resolution parameter. The term associated with each pair (i,j) is b_{ij} and referred to as the modularity matrix entry for (i,j). Using the binary decision variable x_{ij} for each pair of distinct nodes (i,j), their cluster membership is either the same (represented by x_{ij}=0) or different (represented by x_{ij}=1). K(i,j) indicates a minimum-cardinality separating set for the nodes i,j. Accordingly, the problem of maximizing the modularity of input graph G can be formulated as the IP model above <a href="https://doi.org/10.1080/15427951.2014.950875">(Dinh and Thai 2015)</a>.
<br><br>
<li>
We investigated the extent to which current modularity-based heuristics succeeds in maximizing modularity by evaluating <br>(1) the ratio of their output modularity to the maximum modularity for a given graph (GOP) and <br>(2) the similarity between their output partition and all modularity-maximizing partitions of that graph using Adjusted Mutual Information (AMI).<br> The following figure shows a scatterplot of GOP on the y-axis and AMI on the x-axis for each algorithm (and their variations).<br>
</li>
<img src="scatter.png"
width="650">
<br>
<li>
Looking at the y-axis values in the figure, we observe that there is a substantial variation in the values of GOP (i.e., extent of sub-optimality) returned by the nine heuristic algorithms. The x-axis values in the figure show the considerable dissimilarity between the sub-optimal partitions and any optimal partition for these networks. Focusing on the position of data points, we observe that they are mostly located above their corresponding 45-degree line. This indicates that sub-optimal partitions tend to be disproportionally dissimilar to any optimal partition. This result goes against the naive viewpoint that close-to-maximum modularity partitions are also close to an optimal partition.
<li>
The average modularity-based heuristic algorithm returns optimal partitions for only 43.9% of these reasonably modular networks. Taken together with the low success rates of heuristic algorithms in maximizing modularity, our results indicate a crucial mismatch between the design philosophy of modularity maximization heuristics for community detection and their performance: Heuristic modularity maximization algorithms rarely return an optimal partition or a partition resembling an optimal partition, even on networks with a reasonably modular structure.
</li>
<li>
The Bayan algorithm solves this fundamental problem and provides optimal partitions and partitions within a guaranteed proximity to an optimal partition (exact solutions and approximate solutions for modularity maximization).
</li>
</ul>
</div>
</section>
<section id = "current-work">
<h1 id="curr-work">How does Bayan compare with other community detection methods w.r.t. retrieving planted partitions?</h1>
<div id="curr-work-div">
<ul>
<li>
In the <a href="https://arxiv.org/pdf/2209.04562">second part of this project</a>, we compared 30 community detection methods in a testbed of structurally diverse and randomly generated LFR and ABCD benchmark graphs. We do not assume that modularity is a silver bullet for community detection. Regardless of using modularity or other approaches, community detection methods can be compared based on their capability in retrieving the ground-truth communities (a.k.a. planted partitions) of LFR and ABCD benchmark graphs using AMI. We generate LFR and ABCD graphs with different values for the mixing parameter (fraction of inter-community edges). Increasing the mixing parameter (mu) increases the disassociation of the structure and ground-truth communities.
</li>
<li>
This short animation shows an <a href="https://github.com/bkamins/ABCDGraphGenerator.jl">ABCD benchmark graph</a> which was generated using a small mixing parameter. The (ground-truth) planted partition is shown by the node colors in the animation. In this standard retrieval test (based on LFR or ABCD benchmarks), each algorithm take the uncolored graph and clusters the nodes based on the structure. The algorithm whose partition is more similar to the planted partition is going to have a higher AMI and therefore a better rank compared to other algorithms.
</li>
<img src="abcd.gif"
width="650">
<br>
<li>
We evaluated the AMI between the ground-truth planted partitions and the partitions returned from each algorithm,
<em> <a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Greedy (CNM)</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Louvain</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Combo</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Belief</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Leiden</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Paris</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">CPM</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Chinese whispers</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">EdMot</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Genetic Algorithm (GA)</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Gemsec</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Infomap</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">k-cut</a>,
<a href="https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.label_propagation.asyn_lpa_communities.html">asynchronous label propagation</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">semi-synchronous label propagation</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Walktrap</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Reichardt Bornholdt with Erdos-Renyi (RB)</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Reichardt Bornholdt with configuration model (LN)</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Significant scales</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Surprise</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Weighted Community Clustering (WCC)</a>,
<a href="https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.community.kernighan_lin.kernighan_lin_bisection.html#networkx.algorithms.community.kernighan_lin.kernighan_lin_bisection">Kernighan-Lin bisection</a>,
<a href="https://cdlib.readthedocs.io/en/latest/bibliography.html#algorithms">Diffusion Entropy Reduced (DER)</a>,
<a href="https://github.com/barahona-research-group/PyGenStability">Markov stability with random selection</a>,
<a href="https://github.com/barahona-research-group/PyGenStability">Markov stability with minimum NVI</a>,
<a href="https://graph-tool.skewed.de/static/doc/demos/inference/inference.html">Stochastic Block Model (SBM)</a>,
<a href="https://graph-tool.skewed.de/static/doc/demos/inference/inference.html">SBM with Markov chain Monte Carlo (MCMC)</a>,
<a href="https://graph-tool.skewed.de/static/doc/demos/inference/inference.html">Bayesian Planted Partition (BPP)</a>, and
<a href="https://graph-tool.skewed.de/static/doc/demos/inference/inference.html">BPP with MCMC</a>.
</em><br>
</li>
<li>
For the two Markov stability algorithms (MR and MV), we have used the <em>PyGenStability</em> library (version 0.2.2). For the four inferential methods (SB, SM, BP, and BM), we have used the <em>graph_tool</em> library (version 2.57). For the two algorithms, Kernighan-Lin and asynchronous label propagation, we have used the <em>NetworkX</em> library (version 3.1). For the remaining 21 algorithm, we have used the Community Discovery library (<em>CDlib</em>) (version 0.2.6).<br>
</li>
<img src="ranking.png"
width="650">
<li>
The figure above ranks the 30 algorithms based on AMI in five experiment settings (each with a specific mu parameter value) averaged over 100 LFR graphs.
</li>
<li>
Bayan has a relatively high and stable performance rank over different values of the mu parameter. These results reaffirm that the practical relevance of modularity as an objective function, despite its <a href="https://skewed.de/tiago/posts/modularity-harmful/">theoretical flaws</a>, is yet to be challenged by a practical method that is more accurate and stable. The descriptive ranking results are validated by rigorous statistical tests on the performance of the algorithms (read the Bayan preprint for more info).
</li>
</ul>
</div>
</section>
<section id = "current-work">
<h1 id="curr-work">Interested to learn more? Check out our three peer-reviewed papers below.</h1>
<div id="curr-work-div">
<ul>
<li>
Samin Aref, Mahdi Mostajabdaveh, and Hriday Chheda. "Heuristic modularity maximization algorithms for community detection rarely return an optimal partition or anything similar." in Computational Science - ICCS 2023, LNCS 10476, Springer, 2023 <a href="https://doi.org/10.1007/978-3-031-36027-5_48">doi.org/10.1007/978-3-031-36027-5_48</a> <a href="https://arxiv.org/pdf/2302.14698">Preprint</a>
</li>
<li>
Samin Aref and Mahdi Mostajabdaveh. "Analyzing modularity maximization in approximation, heuristic, and graph neural network algorithms for community detection." Journal of Computational Science, 2024
<a href="https://doi.org/10.1016/j.jocs.2024.102283">doi.org/10.1016/j.jocs.2024.102283</a>
<a href="https://arxiv.org/pdf/2310.10898">Preprint</a>
</li>
<li>
Samin Aref, Mahdi Mostajabdaveh, and Hriday Chheda. "The Bayan algorithm: detecting communities in networks through exact and approximate optimization of modularity." Physical Review E, 2024 <a href="https://arxiv.org/pdf/2209.04562">Preprint</a>
</li>
</section>
<section id = "current-work">
<h1 id="curr-work">You can start using Bayan in Python today.</h1>
<div id="curr-work-div">
<ul>
<li>
The Bayan algorithm is publicly available through the <a href="https://pypi.org/project/bayanpy/">package installer for Python (pip) </a>. </li>
<li>
Resources for the Bayan algorithm in Python: <br>
- <a href="https://colab.research.google.com/drive/1_Clxp5FcEYJVn_w7Q6zm8bX0t1TEg7m_?usp=sharing">Minimal working examples of Bayan in Google Colab</a> <br>
- <a href="https://github.com/saref/bayan">Bayan GitHub Repository</a> <br>
- <a href="https://pypi.org/project/bayanpy/
">Bayan library on pypi</a> <br>
</li>
<li>
Running Bayan for optimization models with more than 2000 variables or constraints (input graphs with more than 44 nodes) requires a free academic license to be installed for the mathematical solver Gurobi (which solves the linear programming relaxation problems within the Bayan algorithm). You may follow the instructions in the <a href="https://github.com/saref/bayan">GitHub Repository</a> to request and install a free academic license for Gurobi if your usage is not commercial and you are affiliated with an academic institution.
</li>
<li>
The Bayan algorithm can be used as follows in Jupyter Python:
</li>
<img src="python_example.png">
</section>
<section id = "collaborators">
<!-- <div id="collab-div"> -->
<h1 id = "collab">Project Members</h1>
<!-- </div> -->
<div id = "personal-samin">
<img src="prof_samin_aref.jpeg" id="myphoto" >
<h1 id = "nametag">Prof. Samin Aref</h1>
<p id = "student">Department of Mechanical & Industrial Engineering, University of Toronto</p>
<p id = "univ"></p>
</a>
<a href="mailto:[email protected]">
<img src="gmailicon.png" id = "gmailicon">
</a>
<a href="https://www.linkedin.com/in/saref/">
<img src="linkedinicon.png" id = "linkedinicon">
</a>
<a href="https://saref.github.io/">
<img src="www.jpg" id = "linkedinicon">
</a>
</div>
<div id = "personal-mahdi" >
<img src="Mahdi.jpg" id="myphoto" >
<h1 id = "nametag">Mahdi Mostajabdaveh, Ph.D.</h1>
<p id = "student">Department of Mathematical & Industrial Engineering, Polytechnique Montreal</p>
<p id = "univ"> </p
<a href="http://mahdimostajab.com/">
<p id="suds"></p>
<a href="mailto:[email protected]">
<img src="gmailicon.png" id = "gmailicon">
</a>
<a href="https://www.linkedin.com/in/mahdi-mostajabdaveh/">
<img src="linkedinicon.png" id = "linkedinicon">
</a>
</div>
<div id = "personal-hriday">
<img src="Display.jpg" id="myphoto" >
<h1 id = "nametag">Hriday Chheda</h1>
<p id = "student">Department of Mechanical & Industrial Engineering, University of Toronto</p>
<p id = "univ"></p>
</a>
<a href="mailto:[email protected]">
<img src="gmailicon.png" id = "gmailicon">
</a>
<a href="https://www.linkedin.com/in/hriday-chheda-264871134/">
<img src="linkedinicon.png" id = "linkedinicon">
</a>
</div>
</section>
<div style="clear: both;"></div>
<section id="nameTheory">
<h1 id="name-h">Origin of the algorithm name "Bayan"?</h1>
<div id="bayan-name-div">
<ul id="name-ul">
<img src="bayan_village.png" id="bayan-img">
<li>
Bayan is a village in the Southeast of Iran located at the midpoint of the hometowns of the developers of this project.
</li>
<li>
Bayan is a small "<em>community</em>" of 22 people in 5 families as of the 2016 census.
</li>
</ul>
</div>
</section>
<section id="acknowledgment">
<h1 id="name-h">Acknowledgments</h1>
<div id="acknowledgment-div">
<ul id="name-ul">
<img src="dsilogo.png" id="dsi-img" height="100"><br>
The Bayan project is supported by the Data Sciences Institute at the University of Toronto.
</ul>
</div>
</section>
</body>
</html>