2 %% Copyright 2007, 2008, 2009 Elsevier Ltd
4 %% This file is part of the 'Elsarticle Bundle'.
5 %% ---------------------------------------------
7 %% It may be distributed under the conditions of the LaTeX Project Public
8 %% License, either version 1.2 of this license or (at your option) any
9 %% later version. The latest version of this license is in
10 %% http://www.latex-project.org/lppl.txt
11 %% and version 1.2 or later is part of all distributions of LaTeX
12 %% version 1999/12/01 or later.
14 %% The list of all files belonging to the 'Elsarticle Bundle' is
15 %% given in the file `manifest.txt'.
18 %% Template article for Elsevier's document class `elsarticle'
19 %% with numbered style bibliographic references
22 \documentclass[preprint,12pt]{elsarticle}
24 %% Use the option review to obtain double line spacing
25 %% \documentclass[authoryear,preprint,review,12pt]{elsarticle}
27 %% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn
28 %% for a journal layout:
29 %% \documentclass[final,1p,times]{elsarticle}
30 %% \documentclass[final,1p,times,twocolumn]{elsarticle}
31 %% \documentclass[final,3p,times]{elsarticle}
32 %% \documentclass[final,3p,times,twocolumn]{elsarticle}
33 %% \documentclass[final,5p,times]{elsarticle}
34 %% \documentclass[final,5p,times,twocolumn]{elsarticle}
36 %% For including figures, graphicx.sty has been loaded in
37 %% elsarticle.cls. If you prefer to use the old commands
38 %% please give \usepackage{epsfig}
40 %% The amssymb package provides various useful mathematical symbols
42 %% The amsthm package provides extended theorem environments
43 %% \usepackage{amsthm}
45 %% The lineno packages adds line numbers. Start line numbering with
46 %% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
47 %% for the whole article with \linenumbers.
48 %% \usepackage{lineno}
51 %% \usepackage[pdftex]{graphicx}
54 \pgfplotsset{width=9cm}
55 \pgfplotsset{compat=1.8}
58 \usepackage{subcaption}
60 \usepackage{algorithm}
61 \usepackage{algpseudocode}
64 \usepackage{amsthm,amssymb}
65 \renewcommand{\qedsymbol}{\rule{0.7em}{0.7em}}
67 \newtheorem{theorem}{Theorem}[subsection]
68 \newtheorem{corollary}{Corollary}[theorem]
69 \newtheorem{claim}[theorem]{Claim}
71 \newtheorem{definition}{Definition}[subsection]
72 \newtheorem{notation}{Notation}[subsection]
73 \newtheorem{example}{Example}[subsection]
74 \usetikzlibrary{decorations.markings}
75 \let\oldproofname=\proofname
76 %% \renewcommand{\proofname}{\rm\bf{Proof:}}
78 \captionsetup{font=normalsize}
80 \journal{Discrete Applied Mathematics}
86 %% Title, authors and addresses
88 %% use the tnoteref command within \title for footnotes;
89 %% use the tnotetext command for theassociated footnote;
90 %% use the fnref command within \author or \address for footnotes;
91 %% use the fntext command for theassociated footnote;
92 %% use the corref command within \author for corresponding author footnotes;
93 %% use the cortext command for theassociated footnote;
94 %% use the ead command for the email address,
95 %% and the form \ead[url] for the home page:
96 %% \title{Title\tnoteref{label1}}
97 %% \tnotetext[label1]{}
98 %% \author{Name\corref{cor1}\fnref{label2}}
99 %% \ead{email address}
100 %% \ead[url]{home page}
103 %% \address{Address\fnref{label3}}
106 \title{Improved Algorithms for Matching Biological Graphs}
108 %% use optional labels to link authors explicitly to addresses:
109 %% \author[label1,label2]{}
110 %% \address[label1]{}
111 %% \address[label2]{}
113 \author{Alp{\'a}r J{\"u}ttner and P{\'e}ter Madarasi}
115 \address{Dept of Operations Research, ELTE}
118 Subgraph isomorphism is a well-known NP-Complete problem, while its
119 special case, the graph isomorphism problem is one of the few problems
120 in NP neither known to be in P nor NP-Complete. Their appearance in
121 many fields of application such as pattern analysis, computer vision
122 questions and the analysis of chemical and biological systems has
123 fostered the design of various algorithms for handling special graph
126 This paper presents VF2++, a new algorithm based on the original VF2,
127 which runs significantly faster on most test cases and performs
128 especially well on special graph classes stemming from biological
129 questions. VF2++ handles graphs of thousands of nodes in practically
130 near linear time including preprocessing. Not only is it an improved
131 version of VF2, but in fact, it is by far the fastest existing
132 algorithm especially on biological graphs.
134 The reason for VF2++' superiority over VF2 is twofold. Firstly, taking
135 into account the structure and the node labeling of the graph, VF2++
136 determines a state order in which most of the unfruitful branches of
137 the search space can be pruned immediately. Secondly, introducing more
138 efficient - nevertheless still easier to compute - cutting rules
139 reduces the chance of going astray even further.
141 In addition to the usual subgraph isomorphism, specialized versions
142 for induced subgraph isomorphism and for graph isomorphism are
143 presented. VF2++ has gained a runtime improvement of one order of
144 magnitude respecting induced subgraph isomorphism and a better
145 asymptotical behaviour in the case of graph isomorphism problem.
147 After having provided the description of VF2++, in order to evaluate
148 its effectiveness, an extensive comparison to the contemporary other
149 algorithms is shown, using a wide range of inputs, including both real
150 life biological and chemical datasets and standard randomly generated
153 The work was motivated and sponsored by QuantumBio Inc., and all the
154 developed algorithms are available as the part of the open source
155 LEMON graph and network optimization library
156 (http://lemon.cs.elte.hu).
160 %% keywords here, in the form: keyword \sep keyword
162 %% PACS codes here, in the form: \PACS code \sep code
164 %% MSC codes here, in the form: \MSC code \sep code
165 %% or \MSC[2008] code \sep code (2000 is the default)
174 \section{Introduction}
177 In the last decades, combinatorial structures, and especially graphs
178 have been considered with ever increasing interest, and applied to the
179 solution of several new and revised questions. The expressiveness,
180 the simplicity and the studiedness of graphs make them practical for
181 modelling and appear constantly in several seemingly independent
182 fields, such as bioinformatics and chemistry.
184 Complex biological systems arise from the interaction and cooperation
185 of plenty of molecular components. Getting acquainted with such
186 systems at the molecular level is of primary importance, since
187 protein-protein interaction, DNA-protein interaction, metabolic
188 interaction, transcription factor binding, neuronal networks, and
189 hormone signaling networks can be understood this way.
191 Many chemical and biological structures can easily be modeled
192 as graphs, for instance, a molecular structure can be
193 considered as a graph, whose nodes correspond to atoms and whose
194 edges to chemical bonds. The similarity and dissimilarity of
195 objects corresponding to nodes are incorporated to the model
196 by \emph{node labels}. Understanding such networks basically
197 requires finding specific subgraphs, thus calls for efficient
198 graph matching algorithms.
200 Other real-world fields related to some
201 variants of graph matching include pattern recognition
202 and machine vision \cite{HorstBunkeApplications}, symbol recognition
203 \cite{CordellaVentoSymbolRecognition}, face identification
204 \cite{JianzhuangYongFaceIdentification}. \\
206 Subgraph and induced subgraph matching problems are known to be
207 NP-Complete\cite{SubgraphNPC}, while the graph isomorphism problem is
208 one of the few problems in NP neither known to be in P nor
209 NP-Complete. Although polynomial time isomorphism algorithms are known
210 for various graph classes, like trees and planar
211 graphs\cite{PlanarGraphIso}, bounded valence
212 graphs\cite{BondedDegGraphIso}, interval graphs\cite{IntervalGraphIso}
213 or permutation graphs\cite{PermGraphIso}, and recently, an FPT algorithm has been presented for the coloured hypergraph isomorphism problem in \cite{ColoredHiperGraphIso}.
215 In the following, some algorithms based on other approaches are
216 summarized, which do not need any restrictions on the graphs. Even though,
217 an overall polynomial behaviour is not expectable from such an
218 alternative, it may often have good practical performance, in fact,
219 it might be the best choice even on a graph class for which polynomial
222 The first practically usable approach was due to
223 \emph{Ullmann}\cite{Ullmann} which is a commonly used depth-first
224 search based algorithm with a complex heuristic for reducing the
225 number of visited states. A major problem is its $\Theta(n^3)$ space
226 complexity, which makes it impractical in the case of big sparse
229 In a recent paper, Ullmann\cite{UllmannBit} presents an
230 improved version of this algorithm based on a bit-vector solution for
231 the binary Constraint Satisfaction Problem.
233 The \emph{Nauty} algorithm\cite{Nauty} transforms the two graphs to
234 a canonical form before starting to check for the isomorphism. It has
235 been considered as one of the fastest graph isomorphism algorithms,
236 although graph categories were shown in which it takes exponentially
237 many steps. This algorithm handles only the graph isomorphism problem.
239 The \emph{LAD} algorithm\cite{Lad} uses a depth-first search
240 strategy and formulates the matching as a Constraint Satisfaction
241 Problem to prune the search tree. The constraints are that the mapping
242 has to be injective and edge-preserving, hence it is possible to
243 handle new matching types as well.
245 The \emph{RI} algorithm\cite{RI} and its variations are based on a
246 state space representation. After reordering the nodes of the graphs,
247 it uses some fast executable heuristic checks without using any
248 complex pruning rules. It seems to run really efficiently on graphs
249 coming from biology, and won the International Contest on Pattern
250 Search in Biological Databases\cite{Content}.
252 The currently most commonly used algorithm is the
253 \emph{VF2}\cite{VF2}, the improved version of \emph{VF}\cite{VF}, which was
254 designed for solving pattern matching and computer vision problems,
255 and has been one of the best overall algorithms for more than a
256 decade. Although, it can't be up to new specialized algorithms, it is
257 still widely used due to its simplicity and space efficiency. VF2 uses
258 a state space representation and checks some conditions in each state
259 to prune the search tree.
261 Meanwhile, another variant called \emph{VF2 Plus}\cite{VF2Plus} has
262 been published. It is considered to be as efficient as the RI
263 algorithm and has a strictly better behavior on large graphs. The
264 main idea of VF2 Plus is to precompute a heuristic node order of the
265 small graph, in which the VF2 works more efficiently.
267 This paper introduces \emph{VF2++}, a new further improved algorithm
268 for the graph and (induced)subgraph isomorphism problem, which uses
269 efficient cutting rules and determines a node order in which VF2 runs
270 significantly faster on practical inputs.
272 This project was initiated and sponsored by QuantumBio
273 Inc.\cite{QUANTUMBIO} and the implementation --- along with a source
274 code --- has been published as a part of LEMON\cite{LEMON} open source
277 \section{Problem Statement}
278 This section provides a formal description of the problems to be
280 \subsection{Definitions}
282 Throughout the paper $G_{1}=(V_{1}, E_{1})$ and
283 $G_{2}=(V_{2}, E_{2})$ denote two undirected graphs.
286 $\mathcal{L}: (V_{1}\cup V_{2}) \longrightarrow K$ is a \textbf{node
287 label function}, where K is an arbitrary set. The elements in K
288 are the \textbf{node labels}. Two nodes, u and v are said to be
289 \textbf{equivalent} if $\mathcal{L}(u)=\mathcal{L}(v)$.
292 For the sake of simplicity, in this paper the graph, subgraph and induced subgraph isomorphisms are defined in a more general way.
294 \begin{definition}\label{sec:ismorphic}
295 $G_{1}$ and $G_{2}$ are \textbf{isomorphic} (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
296 V_{1} \longrightarrow V_{2}$ bijection, for which the
299 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and\\
300 $\forall u,v\in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in{E_{2}}$
305 $G_{1}$ is a \textbf{subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
306 V_{1}\longrightarrow V_{2}$ injection, for which the
309 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and\\
310 $\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Rightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
315 $G_{1}$ is an \textbf{induced subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists
316 \mathfrak{m}: V_{1}\longrightarrow V_{2}$ injection, for which the
319 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and
321 $\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow
322 (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
327 \subsection{Common problems}\label{sec:CommProb}
329 The focus of this paper is on two extensively studied topics, the
330 subgraph isomorphism and its variations. However, the following
331 problems also appear in many applications.
333 The \textbf{subgraph matching problem} is the following: is
334 $G_{1}$ isomorphic to any subgraph of $G_{2}$ by a given node
337 The \textbf{induced subgraph matching problem} asks the same about the
338 existence of an induced subgraph.
340 The \textbf{graph isomorphism problem} can be defined as induced
341 subgraph matching problem where the sizes of the two graphs are equal.
343 In addition to existence, it may be needed to show such a subgraph, or
344 it may be necessary to list all of them.
346 It should be noted that some authors misleadingly refer to the term
347 \emph{subgraph isomorphism problem} as an \emph{induced subgraph
348 isomorphism problem}.
350 The following sections give the descriptions of VF2, VF2++, VF2 Plus
351 and a particular comparison.
353 \section{The VF2 Algorithm}
354 This algorithm is the basis of both the VF2++ and the VF2 Plus. VF2
355 is able to handle all the variations mentioned in Section
356 \ref{sec:CommProb}. Although it can also handle directed graphs,
357 for the sake of simplicity, only the undirected case will be
361 \subsection{Common notations}
362 \indent Assume $G_{1}$ is searched in $G_{2}$. The following
363 definitions and notations will be used throughout the whole paper.
365 An injection $\mathfrak{m} : D \longrightarrow V_2$ is called (partial) \textbf{mapping}, where $D\subseteq V_1$.
369 $\mathfrak{D}(f)$ and $\mathfrak{R}(f)$ denote the domain and the range of a function $f$, respectively.
373 Mapping $\mathfrak{m}$ \textbf{covers} a node $u\in V_1\cup V_2$ if $u\in \mathfrak{D}(\mathfrak{m})\cup \mathfrak{R}(\mathfrak{m})$.
377 A mapping $\mathfrak{m}$ is $\mathbf{whole\ mapping}$ if $\mathfrak{m}$ covers all the
378 nodes of $V_{1}$, i.e. $\mathfrak{D}(\mathfrak{m})=V_1$.
382 Let \textbf{extend}$(\mathfrak{m},(u,v))$ denote the function $f : \mathfrak{D}(\mathfrak{m})\cup\{u\}\longrightarrow\mathfrak{R}(\mathfrak{m})\cup\{v\}$, for which $\forall w\in \mathfrak{D}(\mathfrak{m}) : \mathfrak{m}(w)=f(w)$ and $f(u)=v$ holds. Where $u\in V_1\setminus\mathfrak{D}(\mathfrak{m})$ and $v\in V_2\setminus\mathfrak{R}(\mathfrak{m})$, otherwise $extend(\mathfrak{m},(u,v))$ is undefined.
386 Throughout the paper, $\mathbf{PT}$ denotes a generic problem type
387 which can be substituted by any of the $\mathbf{ISO}$, $\mathbf{SUB}$
388 and $\mathbf{IND}$ problems.
392 Let $\mathfrak{m}$ be a mapping. A logical function $\mathbf{Cons_{PT}}$ is a
393 \textbf{consistency function by } $\mathbf{PT}$ if the following
394 holds. If there exists a whole mapping $w$ satisfying the requirements of $PT$, for which $\mathfrak{m}$ is exactly $w$ restricted to $\mathfrak{D}(\mathfrak{m})$.
398 Let $\mathfrak{m}$ be a mapping. A logical function $\mathbf{Cut_{PT}}$ is a
399 \textbf{cutting function by } $\mathbf{PT}$ if the following
400 holds. $\mathbf{Cut_{PT}(\mathfrak{m})}$ is false if there exists a sequence of extend operations, which results in a whole mapping satisfying the requirements of $PT$.
404 $\mathfrak{m}$ is said to be \textbf{consistent mapping by} $\mathbf{PT}$ if
405 $Cons_{PT}(\mathfrak{m})$ is true.
408 $Cons_{PT}$ and $Cut_{PT}$ will often be used in the following form.
410 Let $\mathbf{Cons_{PT}(p, \mathfrak{m})}:=Cons_{PT}(extend(\mathfrak{m},p))$, and
411 $\mathbf{Cut_{PT}(p, \mathfrak{m})}:=Cut_{PT}(extend(\mathfrak{m},p))$, where
412 $p\in{V_{1}\backslash\mathfrak{D}(\mathfrak{m}) \!\times\!V_{2}\backslash\mathfrak{R}(\mathfrak{m})}$.
415 $Cons_{PT}$ will be used to check the consistency of the already
416 covered nodes, while $Cut_{PT}$ is for looking ahead to recognize if
417 no whole consistent mapping can contain the current mapping.
419 \subsection{Overview of the algorithm}
420 VF2 uses a state space representation of mappings, $Cons_{PT}$ for
421 excluding inconsistency with the problem type and $Cut_{PT}$ for
422 pruning the search tree.
424 Algorithm~\ref{alg:VF2Pseu} is a high level description of
425 the VF2 matching algorithm. Each state of the matching process can
426 be associated with a mapping $\mathfrak{m}$. The initial state
427 is associated with a mapping $\mathfrak{m}$, for which
428 $\mathfrak{D}(\mathfrak{m})=\emptyset$, i.e. it starts with an empty mapping.
432 \algtext*{EndIf}%ne nyomtasson end if-et
434 \algtext*{EndProcedure}%ne nyomtasson ..
435 \caption{\hspace{0.5cm}$A\ high\ level\ description\ of\ VF2$}\label{alg:VF2Pseu}
436 \begin{algorithmic}[1]
438 \Procedure{VF2}{Mapping $\mathfrak{m}$, ProblemType $PT$}
439 \If{$\mathfrak{m}$ covers
440 $V_{1}$} \State Output($\mathfrak{m}$)
442 \State Compute the set $P_\mathfrak{m}$ of the pairs candidate for inclusion
443 in $\mathfrak{m}$ \ForAll{$p\in{P_\mathfrak{m}}$} \If{Cons$_{PT}$($p,\mathfrak{m}$) $\wedge$
444 $\neg$Cut$_{PT}$($p,\mathfrak{m}$)}
446 VF2($extend(\mathfrak{m},p)$, $PT$) \EndIf \EndFor \EndIf \EndProcedure
451 For the current mapping $\mathfrak{m}$, the algorithm computes $P_\mathfrak{m}$, the set of
452 candidate node pairs for adding to the current mapping $\mathfrak{m}_s$.
454 For each pair $p$ in $P_\mathfrak{m}$, $Cons_{PT}(p,\mathfrak{m})$ and
455 $Cut_{PT}(p,\mathfrak{m})$ are evaluated. If the former is true and
456 the latter is false, the whole process is recursively applied to
457 $extend(\mathfrak{m},p)$. Otherwise, $extend(\mathfrak{m},p)$ is not consistent by $PT$, or it
458 can be proved that $\mathfrak{m}$ can not be extended to a whole mapping.
460 In order to make sure of the correctness, see
462 Through consistent mappings, only consistent whole mappings can be
463 reached, and all the consistent whole mappings are reachable through
467 Note that a mapping may be reached in exponentially many different ways, since the
468 order of extensions does not influence the nascent mapping.
470 However, one may observe
473 \label{claim:claimTotOrd}
474 Let $\prec$ be an arbitrary total ordering relation on $V_{1}$. If
475 the algorithm ignores each $p=(u,v) \in P_\mathfrak{m}$, for which
477 $\exists (\tilde{u},\tilde{v})\in P_\mathfrak{m}: \tilde{u} \prec u$,
479 then no mapping can be reached more than once, and each whole mapping remains reachable.
482 Note that the cornerstone of the improvements to VF2 is a proper
483 choice of a total ordering.
485 \subsection{The candidate set}
486 \label{candidateComputingVF2}
487 Let $P_\mathfrak{m}$ be the set of the candidate pairs for inclusion in $\mathfrak{m}$.
490 Let $\mathbf{T_{1}(\mathfrak{m})}:=\{u \in V_{1}\backslash\mathfrak{D}(\mathfrak{m}) : \exists \tilde{u}\in{\mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}}\}$, and
491 $\mathbf{T_{2}(\mathfrak{m})} := \{v \in V_{2}\backslash\mathfrak{R}(\mathfrak{m}) : \exists\tilde{v}\in{\mathfrak{R}(\mathfrak{m}):(v,\tilde{v})\in E_{2}}\}$.
494 The set $P_\mathfrak{m}$ includes the pairs of uncovered neighbours of covered
495 nodes, and if there is not such a node pair, all the pairs containing
496 two uncovered nodes are added. Formally, let
500 T_{1}(\mathfrak{m})\times T_{2}(\mathfrak{m})&\hspace{-0.15cm}\text{if }
501 T_{1}(\mathfrak{m})\!\neq\!\emptyset\ \text{and }T_{2}(\mathfrak{m})\!\neq
502 \emptyset,\\ (V_{1}\!\setminus\!\mathfrak{D}(\mathfrak{m}))\!\times\!(V_{2}\!\setminus\!\mathfrak{R}(\mathfrak{m}))
503 &\hspace{-0.15cm}\text{otherwise}.
507 \subsection{Consistency}
508 Suppose $p=(u,v)$, where $u\in V_{1}$ and $v\in V_{2}$, $\mathfrak{m}$ is a consistent mapping by
509 $PT$. $Cons_{PT}(p,\mathfrak{m})$ checks whether
510 including pair $p$ into $\mathfrak{m}$ leads to a consistent mapping by $PT$.
512 For example, the consistency function of induced subgraph isomorphism is as follows.
514 Let $\mathbf{\Gamma_{1} (u)}:=\{\tilde{u}\in V_{1} :
515 (u,\tilde{u})\in E_{1}\}$, and $\mathbf{\Gamma_{2}
516 (v)}:=\{\tilde{v}\in V_{2} : (v,\tilde{v})\in E_{2}\}$, where $u\in V_{1}$ and $v\in V_{2}$.
519 $extend(\mathfrak{m},(u,v))$ is a consistent mapping by $IND$ $\Leftrightarrow
520 (\forall \tilde{u}\in \mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}
521 \Leftrightarrow (v,\mathfrak{m}(\tilde{u}))\in E_{2})$. The
522 following formulation gives an efficient way of calculating
525 $Cons_{IND}((u,v),\mathfrak{m}):=\mathcal{L}(u)\!\!=\!\!\mathcal{L}(v)\wedge(\forall \tilde{v}\in \Gamma_{2}(v)\cap\mathfrak{R}(\mathfrak{m}):(u,\mathfrak{m}^{-1}(\tilde{v}))\in E_{1})\wedge
526 (\forall \tilde{u}\in \Gamma_{1}(u)
527 \cap \mathfrak{D}(\mathfrak{m}):(v,\mathfrak{m}(\tilde{u}))\in E_{2})$ is a
528 consistency function in the case of $IND$.
531 \subsection{Cutting rules}
532 $Cut_{PT}(p,\mathfrak{m})$ is defined by a collection of efficiently
533 verifiable conditions. The requirement is that $Cut_{PT}(p,\mathfrak{m})$ can
534 be true only if it is impossible to extend $extend(\mathfrak{m},p)$ to a
537 As an example, the cutting function of induced subgraph isomorphism is presented.
539 Let $\mathbf{\tilde{T}_{1}}(\mathfrak{m}):=(V_{1}\backslash
540 \mathfrak{D}(\mathfrak{m}))\backslash T_{1}(\mathfrak{m})$, and
541 \\ $\mathbf{\tilde{T}_{2}}(\mathfrak{m}):=(V_{2}\backslash
542 \mathfrak{R}(\mathfrak{m}))\backslash T_{2}(\mathfrak{m})$.
546 $Cut_{IND}((u,v),\mathfrak{m}):= |\Gamma_{2} (v)\ \cap\ T_{2}(\mathfrak{m})| <
547 |\Gamma_{1} (u)\ \cap\ T_{1}(\mathfrak{m})| \vee |\Gamma_{2}(v)\cap
548 \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}(u)\cap
549 \tilde{T}_{1}(\mathfrak{m})|$ is a cutting function by $IND$.
552 \section{The VF2++ Algorithm}
553 Although any total ordering relation makes the search space of VF2 a
554 tree, its choice turns out to dramatically influence the number of
555 visited states. The goal is to determine an efficient one as quickly
558 The main reason for VF2++' superiority over VF2 is twofold. Firstly,
559 taking into account the structure and the node labeling of the graph,
560 VF2++ determines a state order in which most of the unfruitful
561 branches of the search space can be pruned immediately. Secondly,
562 introducing more efficient --- nevertheless still easier to compute
563 --- cutting rules reduces the chance of going astray even further.
565 In addition to the usual subgraph isomorphism, specialized versions
566 for induced subgraph isomorphism and for graph isomorphism have been
567 designed. VF2++ has gained a runtime improvement of one order of
568 magnitude respecting induced subgraph isomorphism and a better
569 asymptotical behaviour in the case of graph isomorphism problem.
571 Note that a weaker version of the cutting rules and the more efficient
572 candidate set calculating were described in \cite{VF2Plus}, too.
574 It should be noted that all the methods described in this section are
575 extendable to handle directed graphs and edge labels as well.\newline
577 The basic ideas and the detailed description of VF2++ are provided in
580 The goal is to find a matching order in which the algorithm is able to
581 recognize inconsistency or prune the infeasible branches on the
582 highest levels and goes deep only if it is needed.
585 Let $\mathbf{Conn_{H}(u)}:=|\Gamma_{1}(u)\cap H\}|$, that is the
586 number of neighbours of u which are in H, where $u\in V_{1} $ and
590 The principal question is the following. Suppose a mapping $\mathfrak{m}$ is
591 given. For which node of $T_{1}(\mathfrak{m})$ is the hardest to find a
592 consistent pair in $G_{2}$? The more covered neighbours a node in
593 $T_{1}(\mathfrak{m})$ has --- i.e. the largest $Conn_{\mathfrak{D}(\mathfrak{m})}$ it has
594 ---, the more rarely satisfiable consistency constraints for its pair
597 In biology, most of the graphs are sparse, thus several nodes in
598 $T_{1}(\mathfrak{m})$ may have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$, which makes
599 reasonable to define a secondary and a tertiary order between them.
600 The observation above proves itself to be as determining, that the
601 secondary ordering prefers nodes with the most uncovered neighbours
602 among which have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$ to increase
603 $Conn_{\mathfrak{D}(\mathfrak{m})}$ of uncovered nodes so much, as possible. The
604 tertiary ordering prefers nodes having the rarest uncovered labels.
606 Note that the secondary ordering is the same as the ordering by $deg$,
607 which is a static data in front of the above used.
609 These rules can easily result in a matching order which contains the
610 nodes of a long path successively, whose nodes may have low $Conn$ and
611 is easily matchable into $G_{2}$. To avoid that, a BFS order is
612 used, which provides the shortest possible paths.
615 In the following, some examples on which the VF2 may be slow are
616 described, although they are easily solvable by using a proper
620 Suppose $G_{1}$ can be mapped into $G_{2}$ in many ways
621 without node labels. Let $u\in V_{1}$ and $v\in V_{2}$.
623 $\mathcal{L}(u):=black$
625 $\mathcal{L}(v):=black$
627 $\mathcal{L}(\tilde{u}):=red \ \forall \tilde{u}\in (V_{1}\backslash
630 $\mathcal{L}(\tilde{v}):=red \ \forall \tilde{v}\in (V_{2}\backslash
634 Now, any mapping by $\mathcal{L}$ must contain $(u,v)$, since
635 $u$ is black and no node in $V_{2}$ has a black label except
636 $v$. If unfortunately $u$ were the last node which will get covered,
637 VF2 would check only in the last steps, whether $u$ can be matched to
640 However, had $u$ been the first matched node, u would have been
641 matched immediately to v, so all the mappings would have been
642 precluded in which node labels can not correspond.
646 Suppose there is no node label given, $G_{1}$ is a small graph and
647 can not be mapped into $G_{2}$ and $u\in V_{1}$.
649 Let $G'_{1}:=(V_{1}\cup
650 \{u'_{1},u'_{2},..,u'_{k}\},E_{1}\cup
651 \{(u,u'_{1}),(u'_{1},u'_{2}),..,(u'_{k-1},u'_{k})\})$, that is,
652 $G'_{1}$ is $G_{1}\cup \{ a\ k$ long path, which is disjoint
653 from $G_{1}$ and one of its starting points is connected to $u\in
656 Is there a subgraph of $G_{2}$, which is isomorph with
659 If unfortunately the nodes of the path were the first $k$ nodes in the
660 matching order, the algorithm would iterate through all the possible k
661 long paths in $G_{2}$, and it would recognize that no path can be
662 extended to $G'_{1}$.
664 However, had it started by the matching of $G_{1}$, it would not
665 have matched any nodes of the path.
668 These examples may look artificial, but the same problems also appear
669 in real-world instances, even though in a less obvious way.
671 \subsection{Preparations}
673 \label{claim:claimCoverFromLeft}
674 The total ordering relation uniquely determines a node order, in which
675 the nodes of $V_{1}$ will be covered by VF2. From the point of
676 view of the matching procedure, this means, that always the same node
677 of $G_{1}$ will be covered on the d-th level.
681 An order $(u_{\sigma(1)},u_{\sigma(2)},..,u_{\sigma(|V_{1}|)})$ of
682 $V_{1}$ is \textbf{matching order} if exists $\prec$ total
683 ordering relation, s.t. the VF2 with $\prec$ on the d-th level finds
684 pair for $u_{\sigma(d)}$ for all $d\in\{1,..,|V_{1}|\}$.
687 \begin{claim}\label{claim:MOclaim}
688 A total ordering is matching order iff the nodes of every component
689 form an interval in the node sequence, and every node connects to a
690 previous node in its component except the first node of each component.
693 To summing up, a total ordering always uniquely determines a matching
694 order, and every matching order can be determined by a total ordering,
695 however, more than one different total orderings may determine the
698 \subsection{Total ordering}
699 The matching order will be searched directly.
701 Let \textbf{F$_\mathcal{M}$(l)}$:=|\{v\in V_{2} :
702 l=\mathcal{L}(v)\}|-|\{u\in V_{1}\backslash \mathcal{M} : l=\mathcal{L}(u)\}|$ ,
703 where $l$ is a label and $\mathcal{M}\subseteq V_{1}$.
706 \begin{definition}Let $\mathbf{arg\ max}_{f}(S) :=\{u\in S : f(u)=max_{v\in S}\{f(v)\}\}$ and $\mathbf{arg\ min}_{f}(S) := arg\ max_{-f}(S)$, where $S$ is a finite set and $f:S\longrightarrow \mathbb{R}$.
711 \algtext*{EndProcedure}
714 \caption{\hspace{0.5cm}$The\ method\ of\ VF2++\ for\ determining\ the\ node\ order$}\label{alg:VF2PPPseu}
715 \begin{algorithmic}[1]
716 \Procedure{VF2++order}{} \State $\mathcal{M}$ := $\emptyset$
717 \Comment{matching order} \While{$V_{1}\backslash \mathcal{M}
718 \neq\emptyset$} \State $r\in$ arg max$_{deg}$ (arg
719 min$_{F_\mathcal{M}\circ \mathcal{L}}(V_{1}\backslash
720 \mathcal{M})$)\label{alg:findMin} \State Compute $T$, a BFS tree with
721 root node $r$. \For{$d=0,1,...,depth(T)$} \State $V_d$:=nodes of the
722 $d$-th level \State Process $V_d$ \Comment{See Algorithm
723 \ref{alg:VF2PPProcess1}} \EndFor
724 \EndWhile \EndProcedure
730 \algtext*{EndProcedure}%ne nyomtasson ..
732 \caption{\hspace{.5cm}$The\ method\ for\ processing\ a\ level\ of\ the\ BFS\ tree$}\label{alg:VF2PPProcess1}
733 \begin{algorithmic}[1]
734 \Procedure{VF2++ProcessLevel}{$V_{d}$} \While{$V_d\neq\emptyset$}
735 \State $m\in$ arg min$_{F_\mathcal{M}\circ\ \mathcal{L}}($ arg max$_{deg}($arg
736 max$_{Conn_{\mathcal{M}}}(V_{d})))$ \State $V_d:=V_d\backslash m$
737 \State Append node $m$ to the end of $\mathcal{M}$ \State Refresh
738 $F_\mathcal{M}$ \EndWhile \EndProcedure
742 Algorithm~\ref{alg:VF2PPPseu} is a high level description of the
743 matching order procedure of VF2++. It computes a BFS tree for each
744 component in ascending order of their rarest node labels and largest $deg$,
745 whose root vertex is the component's minimal
746 node. Algorithm~\ref{alg:VF2PPProcess1} is a method to process a level of the BFS tree, which appends the nodes of the current level in descending
747 lexicographic order by $(Conn_{\mathcal{M}},deg,-F_\mathcal{M})$ separately
748 to $\mathcal{M}$, and refreshes $F_\mathcal{M}$ immediately.
750 Claim~\ref{claim:MOclaim} shows that Algorithm~\ref{alg:VF2PPPseu}
751 provides a matching order.
754 \subsection{Cutting rules}
755 \label{VF2PPCuttingRules}
756 This section presents the cutting rules of VF2++, which are improved by using extra information coming from the node labels.
758 Let $\mathbf{\Gamma_{1}^{l}(u)}:=\{\tilde{u} : \mathcal{L}(\tilde{u})=l
759 \wedge \tilde{u}\in \Gamma_{1} (u)\}$ and
760 $\mathbf{\Gamma_{2}^{l}(v)}:=\{\tilde{v} : \mathcal{L}(\tilde{v})=l \wedge
761 \tilde{v}\in \Gamma_{2} (v)\}$, where $u\in V_{1}$, $v\in
762 V_{2}$ and $l$ is a label.
765 \subsubsection{Induced subgraph isomorphism}
767 \[LabCut_{IND}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function by IND.
769 \subsubsection{Graph isomorphism}
771 \[LabCut_{ISO}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!\neq\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| \neq |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function by ISO.
774 \subsubsection{Subgraph isomorphism}
776 \[LabCut_{SU\!B}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\] is a cutting function by SUB.
781 \section{Implementation details}
782 This section provides a detailed summary of an efficient
783 implementation of VF2++.
784 \subsubsection{Storing a mapping}
785 After fixing an arbitrary node order ($u_0, u_1, ..,
786 u_{|G_{1}|-1}$) of $G_{1}$, an array $M$ is usable to store
787 the current mapping in the following way.
791 v & if\ (u_i,v)\ is\ in\ the\ mapping\\ INV\!ALI\!D &
792 if\ no\ node\ has\ been\ mapped\ to\ u_i,
795 where $i\in\{0,1, ..,|G_{1}|-1\}$, $v\in V_{2}$ and $INV\!ALI\!D$
797 \subsubsection{Avoiding the recurrence}
798 The recursion of Algorithm~\ref{alg:VF2Pseu} can be realized
799 as a \textit{while loop}, which has a loop counter $depth$ denoting the
800 all-time depth of the recursion. Fixing a matching order, let $M$
801 denote the array storing the all-time mapping. Based on Claim~\ref{claim:claimCoverFromLeft},
802 $M$ is $INV\!ALI\!D$ from index $depth$+1 and not $INV\!ALI\!D$ before
803 $depth$. $M[depth]$ changes
804 while the state is being processed, but the property is held before
805 both stepping back to a predecessor state and exploring a successor
808 The necessary part of the candidate set is easily maintainable or
809 computable by following
810 Section~\ref{candidateComputingVF2}. A much faster method
811 has been designed for biological- and sparse graphs, see the next
814 \subsubsection{Calculating the candidates for a node}
815 Being aware of Claim~\ref{claim:claimCoverFromLeft}, the
816 task is not to maintain the candidate set, but to generate the
817 candidate nodes in $G_{2}$ for a given node $u\in V_{1}$. In
818 case of any of the three problem types and a mapping $\mathfrak{m}$, if a node $v\in
819 V_{2}$ is a potential pair of $u\in V_{1}$, then $\forall
820 u'\in \mathfrak{D}(\mathfrak{m}) : (u,u')\in
821 E_{1}\Rightarrow (v,\mathfrak{m}(u'))\in
822 E_{2}$. That is, each covered neighbour of $u$ has to be mapped to
823 a covered neighbour of $v$.
825 Having said that, an algorithm running in $\Theta(deg)$ time is
826 describable if there exists a covered node in the component containing
827 $u$, and a linear one otherwise.
830 \subsubsection{Determining the node order}
831 This section describes how the node order preprocessing method of
832 VF2++ can efficiently be implemented.
834 For using lookup tables, the node labels are associated with the
835 numbers $\{0,1,..,|K|-1\}$, where $K$ is the set of the labels. It
836 enables $F_\mathcal{M}$ to be stored in an array. At first, the node order
837 $\mathcal{M}=\emptyset$, so $F_\mathcal{M}[i]$ is the number of nodes
838 in $V_{1}$ having label $i$, which is easy to compute in
839 $\Theta(|V_{1}|)$ steps.
841 Representing $\mathcal{M}\subseteq V_{1}$ as an array of
842 size $|V_{1}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done inplace by swapping nodes.
844 \subsubsection{Cutting rules}
845 In Section~\ref{VF2PPCuttingRules}, the cutting rules were
846 described using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
847 and $\tilde T_{2}$, which are dependent on the all-time mapping
848 (i.e. on the all-time state). The aim is to check the labeled cutting
849 rules of VF2++ in $\Theta(deg)$ time.
851 Firstly, suppose that these four sets are given in such a way, that
852 checking whether a node is in a certain set takes constant time,
853 e.g. they are given by their 0-1 characteristic vectors. Let $L$ be an
854 initially zero integer lookup table of size $|K|$. After incrementing
855 $L[\mathcal{L}(u')]$ for all $u'\in \Gamma_{1}(u) \cap T_{1}(\mathfrak{m})$ and
856 decrementing $L[\mathcal{L}(v')]$ for all $v'\in\Gamma_{2} (v) \cap
857 T_{2}(s)$, the first part of the cutting rules is checkable in
858 $\Theta(deg)$ time by considering the proper signs of $L$. Setting $L$
859 to zero takes $\Theta(deg)$ time again, which makes it possible to use
860 the same table through the whole algorithm. The second part of the
861 cutting rules can be verified using the same method with $\tilde
862 T_{1}$ and $\tilde T_{2}$ instead of $T_{1}$ and
863 $T_{2}$. Thus, the overall complexity is $\Theta(deg)$.
865 Another integer lookup table storing the number of covered neighbours
866 of each node in $G_{2}$ gives all the information about the sets
867 $T_{2}$ and $\tilde T_{2}$, which is maintainable in
868 $\Theta(deg)$ time when a pair is added or substracted by incrementing
869 or decrementing the proper indices. A further improvement is that the
870 values of $L[\mathcal{L}(u')]$ in case of checking $u$ are dependent only on
871 $u$, i.e. on the size of the mapping, so for each $u\in V_{1}$ an
872 array of pairs (label, number of such labels) can be stored to skip
873 the maintaining operations. Note that these arrays are at most of size
876 Using similar techniques, the consistency function can be evaluated in
877 $\Theta(deg)$ steps, as well.
879 \section{Experimental results}
880 This section compares the performance of VF2++ and VF2 Plus. According to
881 our experience, both algorithms run faster than VF2 with orders of
882 magnitude, thus its inclusion was not reasonable.
884 The algorithms were implemented in C++ using the open source
885 LEMON graph and network optimization library\cite{LEMON}. The test were carried out on a linux based system with an Intel i7 X980 3.33 GHz CPU and 6 GB of RAM.
886 \subsection{Biological graphs}
887 The tests have been executed on a recent biological dataset created
888 for the International Contest on Pattern Search in Biological
889 Databases\cite{Content}, which has been constructed of molecule,
890 protein and contact map graphs extracted from the Protein Data
891 Bank\cite{ProteinDataBank}.
893 The molecule dataset contains small graphs with less than 100 nodes
894 and an average degree of less than 3. The protein dataset contains
895 graphs having 500-10 000 nodes and an average degree of 4, while the
896 contact map dataset contains graphs with 150-800 nodes and an average
899 In the following, both the induced subgraph isomorphism and the graph
900 isomorphism will be examined.
902 This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For runtime results, please see Figure~\ref{fig:bioIND}.
904 In an other experiment, the nodes of each graph in the database had been
905 shuffled, and an isomorphism between the shuffled and the original
906 graph was searched. The solution times are shown on Figure~\ref{fig:bioISO}.
912 \begin{subfigure}[b]{0.55\textwidth}
914 \begin{tikzpicture}[trim axis left, trim axis right]
915 \begin{axis}[title=Molecules IND,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
916 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
917 west},scaled x ticks = false,x tick label style={/pgf/number
918 format/1000 sep = \thinspace}]
919 %\addplot+[only marks] table {proteinsOrig.txt};
920 \addplot table {Orig/Molecules.32.txt}; \addplot[mark=triangle*,mark
921 size=1.8pt,color=red] table {VF2PPLabel/Molecules.32.txt};
924 \caption{In the case of molecules, the algorithms have
925 similar behaviour, but VF2++ is almost two times faster even on such
926 small graphs.} \label{fig:INDMolecule}
930 \begin{subfigure}[b]{0.55\textwidth}
932 \begin{tikzpicture}[trim axis left, trim axis right]
933 \begin{axis}[title=Contact maps IND,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
934 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
935 west},scaled x ticks = false,x tick label style={/pgf/number
936 format/1000 sep = \thinspace}]
937 %\addplot+[only marks] table {proteinsOrig.txt};
938 \addplot table {Orig/ContactMaps.128.txt};
939 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
940 {VF2PPLabel/ContactMaps.128.txt};
943 \caption{On contact maps, VF2++ runs almost in constant time, while VF2
944 Plus has a near linear behaviour.} \label{fig:INDContact}
950 \begin{subfigure}[b]{0.55\textwidth}
952 \begin{tikzpicture}[trim axis left, trim axis right]
953 \begin{axis}[title=Proteins IND,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
954 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
955 west},scaled x ticks = false,x tick label style={/pgf/number
956 format/1000 sep = \thinspace}] %\addplot+[only marks] table
957 {proteinsOrig.txt}; \addplot[mark=*,mark size=1.2pt,color=blue]
958 table {Orig/Proteins.256.txt}; \addplot[mark=triangle*,mark
959 size=1.8pt,color=red] table {VF2PPLabel/Proteins.256.txt};
962 \caption{Both the algorithms have linear behaviour on protein
963 graphs. VF2++ is more than 10 times faster than VF2
964 Plus.} \label{fig:INDProt}
969 \caption{\normalsize{Induced subgraph isomorphism on biological graphs}}\label{fig:bioIND}
976 \begin{subfigure}[b]{0.55\textwidth}
978 \begin{tikzpicture}[trim axis left, trim axis right]
979 \begin{axis}[title=Molecules ISO,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
980 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
981 west},scaled x ticks = false,x tick label style={/pgf/number
982 format/1000 sep = \thinspace}]
983 %\addplot+[only marks] table {proteinsOrig.txt};
984 \addplot table {Orig/moleculesIso.txt}; \addplot[mark=triangle*,mark
985 size=1.8pt,color=red] table {VF2PPLabel/moleculesIso.txt};
988 \caption{In the case of molecules, there is not such a significant
989 difference, but VF2++ seems to be faster as the number of nodes
990 increases.}\label{fig:ISOMolecule}
994 \begin{subfigure}[b]{0.55\textwidth}
996 \begin{tikzpicture}[trim axis left, trim axis right]
997 \begin{axis}[title=Contact maps ISO,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
998 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
999 west},scaled x ticks = false,x tick label style={/pgf/number
1000 format/1000 sep = \thinspace}]
1001 %\addplot+[only marks] table {proteinsOrig.txt};
1002 \addplot table {Orig/contactMapsIso.txt}; \addplot[mark=triangle*,mark
1003 size=1.8pt,color=red] table {VF2PPLabel/contactMapsIso.txt};
1006 \caption{The results are closer to each other on contact maps, but
1007 VF2++ still performs consistently better.}\label{fig:ISOContact}
1013 \begin{subfigure}[b]{0.55\textwidth}
1015 \begin{tikzpicture}[trim axis left, trim axis right]
1016 \begin{axis}[title=Proteins ISO,xlabel={target size},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
1017 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1018 west},scaled x ticks = false,x tick label style={/pgf/number
1019 format/1000 sep = \thinspace}]
1020 %\addplot+[only marks] table {proteinsOrig.txt};
1021 \addplot table {Orig/proteinsIso.txt}; \addplot[mark=triangle*,mark
1022 size=1.8pt,color=red] table {VF2PPLabel/proteinsIso.txt};
1025 \caption{On protein graphs, VF2 Plus has a super linear time
1026 complexity, while VF2++ runs in near constant time. The difference
1027 is about two order of magnitude on large graphs.}\label{fig:ISOProt}
1032 \caption{\normalsize{Graph isomorphism on biological graphs}}\label{fig:bioISO}
1038 \subsection{Random graphs}
1039 This section compares VF2++ with VF2 Plus on random graphs of a large
1040 size. The node labels are uniformly distributed. Let $\delta$ denote
1041 the average degree. For the parameters of problems solved in the
1042 experiments, please see the top of each chart.
1043 \subsubsection{Graph isomorphism}
1044 To evaluate the efficiency of the algorithms in the case of graph
1045 isomorphism, random connected graphs of less than 20 000 nodes have been
1046 considered. Generating a random graph and shuffling its nodes, an
1047 isomorphism had to be found. Figure \ref{fig:randISO} shows the runtime results
1048 on graph sets of various density.
1056 \begin{subfigure}[b]{0.55\textwidth}
1059 \begin{axis}[title={Random ISO, $\delta = 5$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1060 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1061 west},scaled x ticks = false,x tick label style={/pgf/number
1062 format/1000 sep = \space}]
1063 %\addplot+[only marks] table {proteinsOrig.txt};
1064 \addplot table {randGraph/iso/vf2pIso5_1.txt};
1065 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1066 {randGraph/iso/vf2ppIso5_1.txt};
1072 \begin{subfigure}[b]{0.55\textwidth}
1075 \begin{axis}[title={Random ISO, $\delta = 10$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1076 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1077 west},scaled x ticks = false,x tick label style={/pgf/number
1078 format/1000 sep = \space}]
1079 %\addplot+[only marks] table {proteinsOrig.txt};
1080 \addplot table {randGraph/iso/vf2pIso10_1.txt};
1081 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1082 {randGraph/iso/vf2ppIso10_1.txt};
1089 \begin{subfigure}[b]{0.55\textwidth}
1092 \begin{axis}[title={Random ISO, $\delta = 15$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1093 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1094 west},scaled x ticks = false,x tick label style={/pgf/number
1095 format/1000 sep = \space}]
1096 %\addplot+[only marks] table {proteinsOrig.txt};
1097 \addplot table {randGraph/iso/vf2pIso15_1.txt};
1098 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1099 {randGraph/iso/vf2ppIso15_1.txt};
1104 \begin{subfigure}[b]{0.55\textwidth}
1107 \begin{axis}[title={Random ISO, $\delta = 35$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1108 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1109 west},scaled x ticks = false,x tick label style={/pgf/number
1110 format/1000 sep = \space}]
1111 %\addplot+[only marks] table {proteinsOrig.txt};
1112 \addplot table {randGraph/iso/vf2pIso35_1.txt};
1113 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1114 {randGraph/iso/vf2ppIso35_1.txt};
1119 \begin{subfigure}[b]{0.55\textwidth}
1122 \begin{axis}[title={Random ISO, $\delta = 45$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1123 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1124 west},scaled x ticks = false,x tick label style={/pgf/number
1125 format/1000 sep = \space}]
1126 %\addplot+[only marks] table {proteinsOrig.txt};
1127 \addplot table {randGraph/iso/vf2pIso45_1.txt};
1128 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1129 {randGraph/iso/vf2ppIso45_1.txt};
1134 \begin{subfigure}[b]{0.55\textwidth}
1136 \begin{axis}[title={Random ISO, $\delta = 100$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1137 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1138 west},scaled x ticks = false,x tick label style={/pgf/number
1139 format/1000 sep = \thinspace}]
1140 %\addplot+[only marks] table {proteinsOrig.txt};
1141 \addplot table {randGraph/iso/vf2pIso100_1.txt};
1142 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1143 {randGraph/iso/vf2ppIso100_1.txt};
1148 \caption{IND on graphs having an average degree of
1149 5.}\label{fig:randISO}
1153 \subsubsection{Induced subgraph isomorphism}
1154 This section presents a comparison of VF2++ and VF2 Plus in the case
1155 of induced subgraph isomorphism. In addition to the size of the large
1156 graph, that of the small graph dramatically influences the hardness of
1157 a given problem too, so the overall picture is provided by examining
1158 small graphs of various size.
1160 For each chart, a number $0<\rho< 1$ has been fixed, and the following
1161 has been executed 150 times. Generating a large graph $G_{2}$ of an average degree of $\delta$,
1162 choose 10 of its induced subgraphs having $\rho\ |V_{2}|$ nodes,
1163 and for all the 10 subgraphs find a mapping by using both the graph
1164 matching algorithms. The $\delta = 5, 10, 35$ and $\rho = 0.05, 0.1,
1165 0.3, 0.6, 0.8, 0.95$ cases have been examined, see
1166 Figure~\ref{fig:randIND5}, \ref{fig:randIND10} and
1167 \ref{fig:randIND35}.
1176 \begin{subfigure}[b]{0.55\textwidth}
1179 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1180 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1181 west},scaled x ticks = false,x tick label style={/pgf/number
1182 format/1000 sep = \space}]
1183 %\addplot+[only marks] table {proteinsOrig.txt};
1184 \addplot table {randGraph/ind/vf2pInd5_0.05.txt};
1185 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1186 {randGraph/ind/vf2ppInd5_0.05.txt};
1191 \begin{subfigure}[b]{0.55\textwidth}
1194 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1195 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1196 west},scaled x ticks = false,x tick label style={/pgf/number
1197 format/1000 sep = \space}]
1198 %\addplot+[only marks] table {proteinsOrig.txt};
1199 \addplot table {randGraph/ind/vf2pInd5_0.1.txt};
1200 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1201 {randGraph/ind/vf2ppInd5_0.1.txt};
1207 \begin{subfigure}[b]{0.55\textwidth}
1210 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1211 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1212 west},scaled x ticks = false,x tick label style={/pgf/number
1213 format/1000 sep = \space}]
1214 %\addplot+[only marks] table {proteinsOrig.txt};
1215 \addplot table {randGraph/ind/vf2pInd5_0.3.txt};
1216 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1217 {randGraph/ind/vf2ppInd5_0.3.txt};
1222 \begin{subfigure}[b]{0.55\textwidth}
1225 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.6$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1226 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1227 west},scaled x ticks = false,x tick label style={/pgf/number
1228 format/1000 sep = \space}]
1229 %\addplot+[only marks] table {proteinsOrig.txt};
1230 \addplot table {randGraph/ind/vf2pInd5_0.6.txt};
1231 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1232 {randGraph/ind/vf2ppInd5_0.6.txt};
1237 \begin{subfigure}[b]{0.55\textwidth}
1240 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1241 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1242 west},scaled x ticks = false,x tick label style={/pgf/number
1243 format/1000 sep = \space}]
1244 %\addplot+[only marks] table {proteinsOrig.txt};
1245 \addplot table {randGraph/ind/vf2pInd5_0.8.txt};
1246 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1247 {randGraph/ind/vf2ppInd5_0.8.txt};
1252 \begin{subfigure}[b]{0.55\textwidth}
1254 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.95$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1255 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1256 west},scaled x ticks = false,x tick label style={/pgf/number
1257 format/1000 sep = \thinspace}]
1258 %\addplot+[only marks] table {proteinsOrig.txt};
1259 \addplot table {randGraph/ind/vf2pInd5_0.95.txt};
1260 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1261 {randGraph/ind/vf2ppInd5_0.95.txt};
1266 \caption{IND on graphs having an average degree of
1267 5.}\label{fig:randIND5}
1274 \begin{subfigure}[b]{0.55\textwidth}
1278 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1279 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1280 west},scaled x ticks = false,x tick label style={/pgf/number
1281 format/1000 sep = \space}]
1282 %\addplot+[only marks] table {proteinsOrig.txt};
1283 \addplot table {randGraph/ind/vf2pInd10_0.05.txt};
1284 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1285 {randGraph/ind/vf2ppInd10_0.05.txt};
1290 \begin{subfigure}[b]{0.55\textwidth}
1294 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1295 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1296 west},scaled x ticks = false,x tick label style={/pgf/number
1297 format/1000 sep = \space}]
1298 %\addplot+[only marks] table {proteinsOrig.txt};
1299 \addplot table {randGraph/ind/vf2pInd10_0.1.txt};
1300 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1301 {randGraph/ind/vf2ppInd10_0.1.txt};
1307 \begin{subfigure}[b]{0.55\textwidth}
1310 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1311 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1312 west},scaled x ticks = false,x tick label style={/pgf/number
1313 format/1000 sep = \space}]
1314 %\addplot+[only marks] table {proteinsOrig.txt};
1315 \addplot table {randGraph/ind/vf2pInd10_0.3.txt};
1316 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1317 {randGraph/ind/vf2ppInd10_0.3.txt};
1322 \begin{subfigure}[b]{0.55\textwidth}
1325 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.6$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1326 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1327 west},scaled x ticks = false,x tick label style={/pgf/number
1328 format/1000 sep = \space}]
1329 %\addplot+[only marks] table {proteinsOrig.txt};
1330 \addplot table {randGraph/ind/vf2pInd10_0.6.txt};
1331 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1332 {randGraph/ind/vf2ppInd10_0.6.txt};
1338 \begin{subfigure}[b]{0.55\textwidth}
1340 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1341 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1342 west},scaled x ticks = false,x tick label style={/pgf/number
1343 format/1000 sep = \space}]
1344 %\addplot+[only marks] table {proteinsOrig.txt};
1345 \addplot table {randGraph/ind/vf2pInd10_0.8.txt};
1346 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1347 {randGraph/ind/vf2ppInd10_0.8.txt};
1351 \begin{subfigure}[b]{0.55\textwidth}
1353 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.95$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1354 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1355 west},scaled x ticks = false,x tick label style={/pgf/number
1356 format/1000 sep = \thinspace}]
1357 %\addplot+[only marks] table {proteinsOrig.txt};
1358 \addplot table {randGraph/ind/vf2pInd10_0.95.txt};
1359 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1360 {randGraph/ind/vf2ppInd10_0.95.txt};
1365 \caption{IND on graphs having an average degree of
1366 10.}\label{fig:randIND10}
1374 \begin{subfigure}[b]{0.55\textwidth}
1377 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1378 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1379 west},scaled x ticks = false,x tick label style={/pgf/number
1380 format/1000 sep = \space}]
1381 %\addplot+[only marks] table {proteinsOrig.txt};
1382 \addplot table {randGraph/ind/vf2pInd35_0.05.txt};
1383 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1384 {randGraph/ind/vf2ppInd35_0.05.txt};
1389 \begin{subfigure}[b]{0.55\textwidth}
1392 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1393 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1394 west},scaled x ticks = false,x tick label style={/pgf/number
1395 format/1000 sep = \space}]
1396 %\addplot+[only marks] table {proteinsOrig.txt};
1397 \addplot table {randGraph/ind/vf2pInd35_0.1.txt};
1398 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1399 {randGraph/ind/vf2ppInd35_0.1.txt};
1405 \begin{subfigure}[b]{0.55\textwidth}
1408 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1409 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1410 west},scaled x ticks = false,x tick label style={/pgf/number
1411 format/1000 sep = \space}]
1412 %\addplot+[only marks] table {proteinsOrig.txt};
1413 \addplot table {randGraph/ind/vf2pInd35_0.3.txt};
1414 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1415 {randGraph/ind/vf2ppInd35_0.3.txt};
1420 \begin{subfigure}[b]{0.55\textwidth}
1423 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.6$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1424 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1425 west},scaled x ticks = false,x tick label style={/pgf/number
1426 format/1000 sep = \space}]
1427 %\addplot+[only marks] table {proteinsOrig.txt};
1428 \addplot table {randGraph/ind/vf2pInd35_0.6.txt};
1429 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1430 {randGraph/ind/vf2ppInd35_0.6.txt};
1436 \begin{subfigure}[b]{0.55\textwidth}
1438 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1439 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1440 west},scaled x ticks = false,x tick label style={/pgf/number
1441 format/1000 sep = \space}]
1442 %\addplot+[only marks] table {proteinsOrig.txt};
1443 \addplot table {randGraph/ind/vf2pInd35_0.8.txt};
1444 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1445 {randGraph/ind/vf2ppInd35_0.8.txt};
1449 \begin{subfigure}[b]{0.55\textwidth}
1451 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.95$},width=7.2cm,height=6cm,xlabel={target size},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1452 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1453 west},scaled x ticks = false,x tick label style={/pgf/number
1454 format/1000 sep = \thinspace}]
1455 %\addplot+[only marks] table {proteinsOrig.txt};
1456 \addplot table {randGraph/ind/vf2pInd35_0.95.txt};
1457 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1458 {randGraph/ind/vf2ppInd35_0.95.txt};
1463 \caption{IND on graphs having an average degree of
1464 35.}\label{fig:randIND35}
1468 Based on these experiments, VF2++ is faster than VF2 Plus and able to
1469 handle really large graphs in milliseconds. Note that when $IND$ was
1470 considered and the small graphs had proportionally few nodes ($\rho =
1471 0.05$, or $\rho = 0.1$), then VF2 Plus produced some inefficient node
1472 orders (e.g. see the $\delta=10$ case on
1473 Figure~\ref{fig:randIND10}). If these instances had been excluded, the
1474 charts would have seemed to be similar to the other ones.
1475 Unsurprisingly, as denser graphs are considered, both VF2++ and VF2
1476 Plus slow slightly down, but remain practically usable even on graphs
1477 having 10 000 nodes.
1483 \section{Conclusion}
1484 This paper presented VF2++, a new graph matching algorithm based on VF2, called VF2++, and analyzed it from a practical viewpoint.
1486 Recognizing the importance of the node order and determining an
1487 efficient one, VF2++ is able to match graphs of thousands of nodes in
1488 near practically linear time including preprocessing. In addition to
1489 the proper order, VF2++ uses more efficient consistency and cutting
1490 rules which are easy to compute and make the algorithm able to prune
1491 most of the unfruitful branches without going astray.
1493 In order to show the efficiency of the new method, it has been
1494 compared to VF2 Plus\cite{VF2Plus}, which is the best contemporary algorithm.
1497 The experiments show that VF2++ consistently outperforms VF2 Plus on
1498 biological graphs. It seems to be asymptotically faster on protein and
1499 on contact map graphs in the case of induced subgraph isomorphism,
1500 while in the case of graph isomorphism, it has definitely better
1501 asymptotic behaviour on protein graphs.
1503 Regarding random sparse graphs, not only has VF2++ proved itself to be
1504 faster than VF2 Plus, but it also has a practically linear behaviour both
1505 in the case of induced subgraph- and graph isomorphism.
1509 %% The Appendices part is started with the command \appendix;
1510 %% appendix sections are then done as normal sections
1516 %% If you have bibdatabase file and want bibtex to generate the
1517 %% bibitems, please use
1519 \bibliographystyle{elsarticle-num} \bibliography{bibliography}
1521 %% else use the following coding to input the bibitems directly in the
1524 %% \begin{thebibliography}{00}
1526 %% %% \bibitem{label}
1527 %% %% Text of bibliographic item
1531 %% \end{thebibliography}
1536 %% End of file `elsarticle-template-num.tex'.