2 %% Copyright 2007, 2008, 2009 Elsevier Ltd
4 %% This file is part of the 'Elsarticle Bundle'.
5 %% ---------------------------------------------
7 %% It may be distributed under the conditions of the LaTeX Project Public
8 %% License, either version 1.2 of this license or (at your option) any
9 %% later version. The latest version of this license is in
10 %% http://www.latex-project.org/lppl.txt
11 %% and version 1.2 or later is part of all distributions of LaTeX
12 %% version 1999/12/01 or later.
14 %% The list of all files belonging to the 'Elsarticle Bundle' is
15 %% given in the file `manifest.txt'.
18 %% Template article for Elsevier's document class `elsarticle'
19 %% with numbered style bibliographic references
22 \documentclass[preprint,12pt]{elsarticle}
24 %% Use the option review to obtain double line spacing
25 %% \documentclass[authoryear,preprint,review,12pt]{elsarticle}
27 %% Use the options 1p,twocolumn; 3p; 3p,twocolumn; 5p; or 5p,twocolumn
28 %% for a journal layout:
29 %% \documentclass[final,1p,times]{elsarticle}
30 %% \documentclass[final,1p,times,twocolumn]{elsarticle}
31 %% \documentclass[final,3p,times]{elsarticle}
32 %% \documentclass[final,3p,times,twocolumn]{elsarticle}
33 %% \documentclass[final,5p,times]{elsarticle}
34 %% \documentclass[final,5p,times,twocolumn]{elsarticle}
36 %% For including figures, graphicx.sty has been loaded in
37 %% elsarticle.cls. If you prefer to use the old commands
38 %% please give \usepackage{epsfig}
40 %% The amssymb package provides various useful mathematical symbols
42 %% The amsthm package provides extended theorem environments
43 %% \usepackage{amsthm}
45 %% The lineno packages adds line numbers. Start line numbering with
46 %% \begin{linenumbers}, end it with \end{linenumbers}. Or switch it on
47 %% for the whole article with \linenumbers.
48 %% \usepackage{lineno}
51 %% \usepackage[pdftex]{graphicx}
54 \pgfplotsset{width=9cm}
55 \pgfplotsset{compat=1.8}
58 \usepackage{subcaption}
60 \usepackage{algorithm}
61 \usepackage{algpseudocode}
64 \usepackage{amsthm,amssymb}
65 \renewcommand{\qedsymbol}{\rule{0.7em}{0.7em}}
67 \newtheorem{theorem}{Theorem}[subsection]
68 \newtheorem{corollary}{Corollary}[theorem]
69 \newtheorem{claim}[theorem]{Claim}
71 \newtheorem{definition}{Definition}[subsection]
72 \newtheorem{notation}{Notation}[subsection]
73 \newtheorem{example}{Example}[subsection]
74 \usetikzlibrary{decorations.markings}
75 \let\oldproofname=\proofname
76 %% \renewcommand{\proofname}{\rm\bf{Proof:}}
78 \captionsetup{font=normalsize}
80 \journal{Discrete Applied Mathematics}
86 %% Title, authors and addresses
88 %% use the tnoteref command within \title for footnotes;
89 %% use the tnotetext command for theassociated footnote;
90 %% use the fnref command within \author or \address for footnotes;
91 %% use the fntext command for theassociated footnote;
92 %% use the corref command within \author for corresponding author footnotes;
93 %% use the cortext command for theassociated footnote;
94 %% use the ead command for the email address,
95 %% and the form \ead[url] for the home page:
96 %% \title{Title\tnoteref{label1}}
97 %% \tnotetext[label1]{}
98 %% \author{Name\corref{cor1}\fnref{label2}}
99 %% \ead{email address}
100 %% \ead[url]{home page}
103 %% \address{Address\fnref{label3}}
106 \title{VF2++ --- An Improved Subgraph Isomorphism Algorithm}
108 %% use optional labels to link authors explicitly to addresses:
109 %% \author[label1,label2]{}
110 %% \address[label1]{}
111 %% \address[label2]{}
113 \author[egres,elte]{Alp{\'a}r J{\"u}ttner}
114 \ead{alpar@cs.elte.hu}
115 \author[elte]{P{\'e}ter Madarasi}
116 \ead{madarasip@caesar.elte.hu}
117 \address[egres]{MTA-ELTE Egerv{\'a}ry Research Group, Budapest, Hungary.}
118 \address[elte]{Department of Operations Research, E{\"o}tv{\"o}s Lor{\'a}nd University, Budapest, Hungary.}
122 This paper presents a largely improved version of the VF2 algorithm
123 for the \emph{Subgraph Isomorphism Problem}. The improvements are
124 twofold. Firstly, it is based on a new approach for determining the
125 matching order of the nodes, and secondly, more efficient -
126 nevertheless easier to compute - cutting rules significantly
127 reducing the search space are applied.
129 In addition to the usual \emph{Subgraph Isomorphism Problem}, the paper also
130 presents specialized algorithms for the \emph{Induced Subgraph
131 Isomorphism} and for the \emph{Graph Isomorphism Problems}.
133 Finally, an extensive experimental evaluation is provided using a
134 wide range of inputs, including both real-life biological and
135 chemical datasets and standard randomly generated graph series. The
136 results show major and consistent running time improvements over the
139 The C++ implementations of the algorithms are available open-source as
140 part of the LEMON graph and network optimization library.
145 Computational Biology, Subgraph Isomorphism Problem
146 %% keywords here, in the form: keyword \sep keyword
148 %% PACS codes here, in the form: \PACS code \sep code
150 %% MSC codes here, in the form: \MSC code \sep code
151 %% or \MSC[2008] code \sep code (2000 is the default)
160 \section{Introduction}
163 In the last decades, combinatorial structures, and especially graphs
164 have been considered with ever increasing interest, and applied to the
165 solution of several new and revised questions. The expressiveness,
166 the simplicity and the studiedness of graphs make them practical for
167 modelling and appear constantly in several seemingly independent
168 fields, such as bioinformatics and chemistry.
170 Complex biological systems arise from the interaction and cooperation
171 of plenty of molecular components. Getting acquainted with such
172 systems at the molecular level is of primary importance, since
173 protein-protein interaction, DNA-protein interaction, metabolic
174 interaction, transcription factor binding, neuronal networks, and
175 hormone signalling networks can be understood this way.
177 Many chemical and biological structures can easily be modelled
178 as graphs, for instance, a molecular structure can be
179 considered as a graph, whose nodes correspond to atoms and whose
180 edges to chemical bonds. The similarity and dissimilarity of
181 objects corresponding to nodes are incorporated to the model
182 by \emph{node labels}. Understanding such networks basically
183 requires finding specific subgraphs, thus it calls for efficient
184 graph matching algorithms.
186 Other real-world fields related to some
187 variants of graph matching include pattern recognition
188 and machine vision~\cite{HorstBunkeApplications}, symbol recognition~\cite{CordellaVentoSymbolRecognition}, and face identification~\cite{JianzhuangYongFaceIdentification}. \\
190 Subgraph and induced subgraph matching problems are known to be
191 NP-Complete~\cite{SubgraphNPC}, while the graph isomorphism problem is
192 one of the few problems in NP neither known to be in P nor
193 NP-Complete. Although polynomial-time isomorphism algorithms are known
194 for various graph classes, like trees and planar
195 graphs~\cite{PlanarGraphIso}, bounded valence
196 graphs~\cite{BondedDegGraphIso}, interval graphs~\cite{IntervalGraphIso}
197 or permutation graphs~\cite{PermGraphIso}. Furthermore, an FPT algorithm has also been presented for the coloured hypergraph isomorphism problem in~\cite{ColoredHiperGraphIso}.
199 In the following, some algorithms based on other approaches are
200 summarized, which do not need any restrictions on the graphs. Even though,
201 an overall polynomial behaviour is not expectable from such an
202 alternative, it may often have good practical performance, in fact,
203 it might be the best choice even on a graph class for which polynomial
206 The first practically usable approach was due to
207 \emph{Ullmann}~\cite{Ullmann}, which is a commonly used algorithm based on depth-first
208 search with a complex heuristic for reducing the
209 number of visited states. A major problem is its $\Theta(n^3)$ space
210 complexity, which makes it impractical in the case of big sparse
213 In a recent paper, Ullmann~\cite{UllmannBit} presents an
214 improved version of this algorithm based on a bit-vector solution for
215 the binary Constraint Satisfaction Problem.
217 The \emph{Nauty} algorithm~\cite{Nauty} transforms the two graphs to
218 a canonical form before starting to check for isomorphism. It has
219 been considered as one of the fastest graph isomorphism algorithms,
220 although graph categories were shown in which it takes exponentially
221 many steps. This algorithm handles only the graph isomorphism problem.
223 The \emph{LAD} algorithm~\cite{Lad} uses a depth-first search
224 strategy and formulates the matching as a Constraint Satisfaction
225 Problem to prune the search tree. The constraints are that the mapping
226 has to be injective and edge-preserving, hence it is possible to
227 handle new matching types as well.
229 The \emph{RI} algorithm~\cite{RI} and its variations are based on a
230 state space representation. After reordering the nodes of the graphs,
231 it uses some fast executable heuristic checks without using any
232 complex pruning rules. It seems to run really efficiently on graphs
233 coming from biology, and won the International Contest on Pattern
234 Search in Biological Databases~\cite{Content}.
236 Currently, the most commonly used algorithm is the
237 \emph{VF2}~\cite{VF2}, the improved version of \emph{VF}~\cite{VF}, which was
238 designed for solving pattern matching and computer vision problems,
239 and has been one of the best overall algorithms for more than a
240 decade. Although, it is not as fast as some of the new specialized algorithms, it is still widely used due to its simplicity and space efficiency. VF2 uses
241 a state space representation and checks some conditions in each state
242 to prune the search tree.
244 Meanwhile, another variant called \emph{VF2~Plus}~\cite{VF2Plus} has
245 been published. It is considered to be as efficient as the RI
246 algorithm and has a strictly better behaviour on large graphs. The
247 main idea of VF2~Plus is to precompute a heuristic node order of the graph to be embedded, in which VF2 works more efficiently.
249 This paper introduces \emph{VF2++}, a new further improved algorithm
250 for the graph and (induced) subgraph isomorphism problems, which uses
251 efficient cutting rules and determines a node order in which VF2 runs
252 significantly faster on practical inputs.
254 The rest of the paper is structured as
255 follows. Section~\ref{sec:ProbStat} defines the exact problems to be
256 solved, Section~\ref{sec:VF2Alg} provides a description of VF2. Based
257 on that, Section~\ref{sec:VF2ppAlg} introduces VF2++. Some technical
258 details necessary for an efficient implementation are discussed in
259 Section~\ref{sec:VF2ppImpl}. Finally, Section~\ref{sec:ExpRes}
260 provides a detailed experimental evaluation of VF2++ and its comparison
261 to the state-of-the-art algorithm.
263 It must also be mentioned that the C++ implementations of the
264 algorithms have been made available for evaluation and use under an
265 open-source license as a part of LEMON~\cite{LEMON} graph
268 \section{Problem Statement}\label{sec:ProbStat}
269 This section provides a formal description of the problems to be
271 \subsection{Definitions}
273 Throughout the paper $G_{1}=(V_{1}, E_{1})$ and
274 $G_{2}=(V_{2}, E_{2})$ denote two undirected graphs.
277 $\mathcal{L}: (V_{1}\cup V_{2}) \longrightarrow K$ is a \textbf{node
278 label function}, where K is an arbitrary set. The elements in K
279 are the \textbf{node labels}. Two nodes, u and v are said to be
280 \textbf{equivalent} if $\mathcal{L}(u)=\mathcal{L}(v)$.
283 For the sake of simplicity, in this paper the graph, subgraph and induced subgraph isomorphisms are defined in a more general way.
285 \begin{definition}\label{sec:ismorphic}
286 $G_{1}$ and $G_{2}$ are \textbf{isomorphic} (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
287 V_{1} \longrightarrow V_{2}$ bijection, for which the
290 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and\\
291 $\forall u,v\in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in{E_{2}}$
296 $G_{1}$ is a \textbf{subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists \mathfrak{m}:
297 V_{1}\longrightarrow V_{2}$ injection, for which the
300 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and\\
301 $\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Rightarrow (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
306 $G_{1}$ is an \textbf{induced subgraph} of $G_{2}$ (by the node label $\mathcal{L}$) if $\exists
307 \mathfrak{m}: V_{1}\longrightarrow V_{2}$ injection, for which the
310 $\forall u\in{V_{1}} : \mathcal{L}(u)=\mathcal{L}(\mathfrak{m}(u))$ and
312 $\forall u,v \in{V_{1}} : (u,v)\in{E_{1}} \Leftrightarrow
313 (\mathfrak{m}(u),\mathfrak{m}(v))\in E_{2}$
318 \subsection{Common problems}\label{sec:CommProb}
320 The focus of this paper is on the following problems appearing in many applications.
322 The \textbf{subgraph isomorphism problem} is the following: is
323 $G_{1}$ isomorphic to any subgraph of $G_{2}$ by a given node
326 The \textbf{induced subgraph isomorphism problem} asks the same about the
327 existence of an induced subgraph.
329 The \textbf{graph isomorphism problem} can be defined as induced
330 subgraph matching problem where the sizes of the two graphs are equal.
332 In addition, one may want to find a \textbf{single} embedding or \textbf{enumerate} all of them.
334 \section{The VF2 Algorithm}\label{sec:VF2Alg}
335 This algorithm is the basis of both the VF2++ and the VF2~Plus. VF2
336 is able to handle all the variations mentioned in Section~\ref{sec:CommProb}. Although it can also handle directed graphs,
337 for the sake of simplicity, only the undirected case will be
341 \subsection{Common notations}
342 \indent Assume $G_{1}$ is searched in $G_{2}$. The following
343 definitions and notations will be used throughout this paper.
345 An injection $\mathfrak{m} : D \longrightarrow V_2$ is called (partial) \textbf{mapping}, where $D\subseteq V_1$.
349 $\mathfrak{D}(f)$ and $\mathfrak{R}(f)$ denote the domain and the range of a function $f$, respectively.
353 Mapping $\mathfrak{m}$ \textbf{covers} a node $u\in V_1\cup V_2$ if $u\in \mathfrak{D}(\mathfrak{m})\cup \mathfrak{R}(\mathfrak{m})$.
357 A mapping $\mathfrak{m}$ is $\mathbf{whole\ mapping}$ if $\mathfrak{m}$ covers all the
358 nodes of $V_{1}$, i.e. $\mathfrak{D}(\mathfrak{m})=V_1$.
362 Let \textbf{extend}$(\mathfrak{m},(u,v))$ denote the function $f : \mathfrak{D}(\mathfrak{m})\cup\{u\}\longrightarrow\mathfrak{R}(\mathfrak{m})\cup\{v\}$, for which $\forall w\in \mathfrak{D}(\mathfrak{m}) : f(w)=\mathfrak{m}(w)$ and $f(u)=v$ holds, where $u\in V_1\setminus\mathfrak{D}(\mathfrak{m})$ and $v\in V_2\setminus\mathfrak{R}(\mathfrak{m})$; otherwise $extend(\mathfrak{m},(u,v))$ is undefined.
366 Throughout the paper, $\mathbf{PT}$ denotes a generic problem type
367 which can be substituted by any of the $\mathbf{SUB}$, $\mathbf{IND}$
368 and $\mathbf{ISO}$ problems, which stand for the the problems mentioned in Section~\ref{sec:CommProb} respectively.
372 Let $\mathfrak{m}$ be a mapping. The \textbf{consistency function for } $\mathbf{PT}$ is a logical function, for which $\mathbf{Cons_{PT}}(\mathfrak{m})$ is true if and only if $\mathfrak{m}$ satisfies the requirements of $\mathbf{PT}$ considering the subgraphs of $G_{1}$ and $G_{2}$ induced by $\mathfrak{D}(\mathfrak{m})$ and $\mathfrak{R}(\mathfrak{m})$, respectively.
374 %$\mathbf{Cons_{PT}}(\mathfrak{m})$ is true if there exists a whole mapping $w$ satisfying the requirements of $PT$, for which $\mathfrak{m}$ is exactly $w$ restricted to $\mathfrak{D}(\mathfrak{m})$.
378 Let $\mathfrak{m}$ be a mapping. A logical function $\mathbf{Cut_{PT}}$ is a
379 \textbf{cutting function for } $\mathbf{PT}$ if the following
380 holds. $\mathbf{Cut_{PT}(\mathfrak{m})}$ is false if there exists a sequence of extend operations, which results in a whole mapping satisfying the requirements of $PT$.
384 A mapping $\mathfrak{m}$ is said to be \textbf{consistent mapping by} $\mathbf{PT}$ if
385 $Cons_{PT}(\mathfrak{m})$ is true.
388 $Cons_{PT}$ and $Cut_{PT}$ will often be used in the following form.
390 Let $\mathbf{Cons_{PT}(p, \mathfrak{m})}:=Cons_{PT}(extend(\mathfrak{m},p))$, and
391 $\mathbf{Cut_{PT}(p, \mathfrak{m})}:=Cut_{PT}(extend(\mathfrak{m},p))$, where
392 $p\in{V_{1}\backslash\mathfrak{D}(\mathfrak{m}) \!\times\!V_{2}\backslash\mathfrak{R}(\mathfrak{m})}$.
395 $Cons_{PT}$ will be used to check the consistency of the already
396 covered nodes, while $Cut_{PT}$ is for looking ahead to recognize if
397 no whole consistent mapping can contain the current mapping.
399 \subsection{Overview of the algorithm}
401 VF2 begins with an empty mapping and gradually extends it with respect to the consistency and cutting functions until a whole mapping is reached.
403 Algorithm~\ref{alg:VF2Pseu} is a high-level description of
404 the VF2 algorithm. Each state of the matching process can
405 be associated with a mapping $\mathfrak{m}$. The initial state
406 is associated with a mapping $\mathfrak{m}$, for which
407 $\mathfrak{D}(\mathfrak{m})=\emptyset$, i.e. it starts with an empty mapping.
411 \algtext*{EndIf}%ne nyomtasson end if-et
413 \algtext*{EndProcedure}%ne nyomtasson ..
414 \caption{\hspace{0.5cm}$A\ high\ level\ description\ of\ VF2$}\label{alg:VF2Pseu}
415 \begin{algorithmic}[1]
417 \Procedure{VF2}{Mapping $\mathfrak{m}$, ProblemType $PT$}
418 \If{$\mathfrak{m}$ covers
419 $V_{1}$} \State Output($\mathfrak{m}$)
421 \State Compute the set $P_\mathfrak{m}$ of the candidate pairs for extending $\mathfrak{m}$ \ForAll{$p\in{P_\mathfrak{m}}$} \If{Cons$_{PT}$($p,\mathfrak{m}$) $\wedge$
422 $\neg$Cut$_{PT}$($p,\mathfrak{m}$)}
424 VF2($extend(\mathfrak{m},p)$, $PT$) \EndIf \EndFor \EndIf \EndProcedure
429 For the current mapping $\mathfrak{m}$, the algorithm computes $P_\mathfrak{m}$, the set of
430 candidate node pairs for extending the current mapping $\mathfrak{m}$.
432 For each pair $p$ in $P_\mathfrak{m}$, $Cons_{PT}(p,\mathfrak{m})$ and
433 $Cut_{PT}(p,\mathfrak{m})$ are evaluated. If the former is true and
434 the latter is false, the whole process is recursively applied to
435 $extend(\mathfrak{m},p)$. Otherwise, $extend(\mathfrak{m},p)$ is not consistent by $PT$, or it
436 can be proved that $\mathfrak{m}$ can not be extended to a whole mapping.
438 In order to make sure of the correctness, see Claim~\ref{claim:consMapps}.
439 \begin{claim}\label{claim:consMapps}
440 Through consistent mappings, only consistent whole mappings can be
441 reached, and all the consistent whole mappings are reachable through
445 Note that a mapping may be reached in exponentially many different ways, since the
446 order of extensions does not influence the nascent mapping.
448 However, one may make the following observations.
451 %\label{claim:claimTotOrd}
452 %Let $\prec$ be an arbitrary total ordering relation on $V_{1}$. If
453 %the algorithm ignores each $p=(u,v) \in P_\mathfrak{m}$, for which
455 %$\exists (\tilde{u},\tilde{v})\in P_\mathfrak{m}: \tilde{u} \prec u$,
457 %then no mapping can be reached more than once, and each whole mapping %remains reachable.
461 A total order $(u_{\sigma(1)},u_{\sigma(2)},..,u_{\sigma(|V_{1}|)})$ of
462 $V_{1}$ is \textbf{matching order} if VF2 can cover $u_{\sigma(d)}$ on the $d$-th level for all $d\in\{1,..,|V_{1}|\}$.
466 \label{claim:claimTotOrd}
467 If VF2 is prescribed to cover the nodes of $G_1$ according to a matching order, then no mapping can be reached more than once and each whole mapping remains reachable.
470 Note that the cornerstone of the improvements to VF2 is to choose a proper
473 \subsection{The candidate set}
474 \label{candidateComputingVF2}
475 Let $P_\mathfrak{m}$ be the set of the candidate pairs for inclusion in $\mathfrak{m}$.
478 Let $\mathbf{T_{1}(\mathfrak{m})}:=\{u \in V_{1}\backslash\mathfrak{D}(\mathfrak{m}) : \exists \tilde{u}\in{\mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}}\}$, and
479 $\mathbf{T_{2}(\mathfrak{m})} := \{v \in V_{2}\backslash\mathfrak{R}(\mathfrak{m}) : \exists\tilde{v}\in{\mathfrak{R}(\mathfrak{m}):(v,\tilde{v})\in E_{2}}\}$.
482 The set $P_\mathfrak{m}$ contains the pairs of uncovered neighbours of covered
483 nodes, and if there is not such a node pair, all the pairs containing
484 two uncovered nodes are added. Formally, let
488 T_{1}(\mathfrak{m})\times T_{2}(\mathfrak{m})&\hspace{-0.15cm}\text{if }
489 T_{1}(\mathfrak{m})\!\neq\!\emptyset\ \text{and }T_{2}(\mathfrak{m})\!\neq
490 \emptyset,\\ (V_{1}\!\setminus\!\mathfrak{D}(\mathfrak{m}))\!\times\!(V_{2}\!\setminus\!\mathfrak{R}(\mathfrak{m}))
491 &\hspace{-0.15cm}\text{otherwise}.
495 \subsection{Consistency}
496 Let $p=(u,v)\in V_{1}\times V_{2}$, and suppose $\mathfrak{m}$ is a consistent mapping by
497 $PT$. $Cons_{PT}(p,\mathfrak{m})$ checks whether
498 adding pair $p$ into $\mathfrak{m}$ leads to a consistent mapping by $PT$.
500 For example, the consistency function of the induced subgraph isomorphism problem is the following.
502 Let $\mathbf{\Gamma_{1} (u)}:=\{\tilde{u}\in V_{1} :
503 (u,\tilde{u})\in E_{1}\}$, and $\mathbf{\Gamma_{2}
504 (v)}:=\{\tilde{v}\in V_{2} : (v,\tilde{v})\in E_{2}\}$, where $u\in V_{1}$ and $v\in V_{2}$. That is, $\mathbf{\Gamma_{i} (w)}$ denotes the set of neighbours of node $w$ in $G_i$ $(i=1,2)$.
508 $extend(\mathfrak{m},(u,v))$ is a consistent mapping by $IND$ if and only if $\mathfrak{m}$ is consistent and $(\forall \tilde{u}\in \mathfrak{D}(\mathfrak{m}): (u,\tilde{u})\in E_{1}
509 \Leftrightarrow (v,\mathfrak{m}(\tilde{u}))\in E_{2})$.
512 The following formulation gives an efficient way of calculating $Cons_{IND}$.
515 $Cons_{IND}((u,v),\mathfrak{m}):=Cons_{IND}(\mathfrak{m})\wedge\mathcal{L}(u)\!\!=\!\!\mathcal{L}(v)\wedge(\forall \tilde{v}\in \Gamma_{2}(v)\cap\mathfrak{R}(\mathfrak{m}):(u,\mathfrak{m}^{-1}(\tilde{v}))\in E_{1})\wedge
516 (\forall \tilde{u}\in \Gamma_{1}(u)
517 \cap \mathfrak{D}(\mathfrak{m}):(v,\mathfrak{m}(\tilde{u}))\in E_{2})$ is the consistency function for $IND$.
520 \subsection{Cutting rules}
521 $Cut_{PT}(p,\mathfrak{m})$ is defined by a collection of efficiently
522 verifiable conditions. The requirement is that $Cut_{PT}(p,\mathfrak{m})$ can
523 be true only if it is impossible to extend $extend(\mathfrak{m},p)$ to a
526 As an example, a cutting function of induced subgraph isomorphism problem is presented.
528 Let $\mathbf{\tilde{T}_{1}}(\mathfrak{m}):=(V_{1}\backslash
529 \mathfrak{D}(\mathfrak{m}))\backslash T_{1}(\mathfrak{m})$, and
530 \\ $\mathbf{\tilde{T}_{2}}(\mathfrak{m}):=(V_{2}\backslash
531 \mathfrak{R}(\mathfrak{m}))\backslash T_{2}(\mathfrak{m})$.
535 $Cut_{IND}((u,v),\mathfrak{m}):= |\Gamma_{2} (v)\ \cap\ T_{2}(\mathfrak{m})| <
536 |\Gamma_{1} (u)\ \cap\ T_{1}(\mathfrak{m})| \vee |\Gamma_{2}(v)\cap
537 \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}(u)\cap
538 \tilde{T}_{1}(\mathfrak{m})|$ is a cutting function for $IND$.
541 \section{The VF2++ Algorithm}\label{sec:VF2ppAlg}
542 Although any matching order makes the search space of VF2 a
543 tree, its choice turns out to dramatically influence the number of
544 visited states. The goal is to determine an efficient one as quickly
547 The main reason for the superiority of VF2++ over VF2 is twofold. Firstly,
548 taking into account the structure and the node labelling of the graph,
549 VF2++ determines a matching order in which most of the unfruitful
550 branches of the search space can be pruned immediately. Secondly,
551 introducing more efficient --- nevertheless still easier to compute
552 --- cutting rules reduces the chance of going astray even further.
554 In addition to the usual subgraph isomorphism problem, specialized versions
555 for induced subgraph and graph isomorphism problems have been
558 Note that a weaker version of the cutting rules of VF2++ and an efficient
559 candidate set calculation method were described in~\cite{VF2Plus}.
561 It should be noted that all the methods described in this section are
562 extendable to handle directed graphs and edge labels as well.
563 The basic ideas and the detailed description of VF2++ are provided in
564 the following.\newline
566 The goal is to find a matching order in which the algorithm is able to
567 recognize inconsistency or prune the infeasible branches on the
568 highest levels and goes deep only if it is needed.
571 Let $\mathbf{Conn_{H}(u)}:=|\Gamma_{1}(u)\cap H|$, that is the
572 number of neighbours of u which are in H, where $u\in V_{1} $ and
576 The principal question is the following. Suppose a mapping $\mathfrak{m}$ is
577 given. For which node of $T_{1}(\mathfrak{m})$ is the hardest to find a
578 consistent pair in $G_{2}$? The more covered neighbours a node in
579 $T_{1}(\mathfrak{m})$ has --- i.e. the largest $Conn_{\mathfrak{D}(\mathfrak{m})}$ it has
580 ---, the more rarely satisfiable consistency constraints for its pair
583 Most of the graphs of biological and chemical structures are sparse, thus several nodes in
584 $T_{1}(\mathfrak{m})$ may have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$, which makes
585 reasonable to define a secondary and a tertiary order between them.
586 The observation above proves itself to be as determining, that the
587 secondary ordering prefers nodes with the most uncovered neighbours
588 among which have the same $Conn_{\mathfrak{D}(\mathfrak{m})}$ to increase
589 $Conn_{\mathfrak{D}(\mathfrak{m})}$ of uncovered nodes as much, as possible. The tertiary ordering prefers nodes having the rarest uncovered labels in $G_2$.
591 Note that the secondary ordering is the same as ordering by degrees,
592 which is a static data in front of the above used.
594 These rules can easily result in a matching order which contains the
595 nodes of a long path successively, whose nodes may have low $Conn$ and
596 is easily matchable into $G_{2}$. To try to avoid that, a Breadth-first-search order is used, and on each of its levels, the ordering procedure described above is applied.
599 In the following, some examples on which the VF2 may be slow are
600 described, although they are easily solvable by using a proper
604 Suppose $G_{1}$ can be mapped into $G_{2}$ in many ways
605 without node labels. Let $u\in V_{1}$ and $v\in V_{2}$.
607 $\mathcal{L}(u):=black$
609 $\mathcal{L}(v):=black$
611 $\mathcal{L}(\tilde{u}):=red \ \forall \tilde{u}\in V_{1}\backslash
614 $\mathcal{L}(\tilde{v}):=red \ \forall \tilde{v}\in V_{2}\backslash
617 Now, any mapping by $\mathcal{L}$ must contain $(u,v)$, since
618 $u$ is black and no node in $V_{2}$ has a black label except
619 $v$. If unfortunately $u$ were the last node which will get covered,
620 VF2 would check only in the last steps, whether $u$ can be matched to
623 However, had $u$ been the first matched node, u would have been
624 matched immediately to v, so all the mappings would have been
625 precluded in which node labels can not correspond.
629 Suppose there is no node label given, and $G_{1}$ is a small graph that can not be mapped into $G_{2}$ and $u\in V_{1}$.
631 Let $G'_{1}:=(V_{1}\cup
632 \{u'_{1},u'_{2},..,u'_{k}\},E_{1}\cup
633 \{(u,u'_{1}),(u'_{1},u'_{2}),..,(u'_{k-1},u'_{k})\})$, that is,
634 $G'_{1}$ is $G_{1}\cup \{ a\ k$ long path, which is disjoint
635 from $G_{1}$ and one of its starting points is connected to $u\in
638 If unfortunately the nodes of the path were the first $k$ nodes in the
639 matching order, the algorithm would iterate through all the possible k
640 long paths in $G_{2}$, and it would recognize that no path can be
641 extended to $G'_{1}$.
643 However, had it started by the matching of $G_{1}$, it would not
644 have matched any nodes of the path.
647 These examples may look artificial, but the same problems also appear
648 in real-world instances, even though in a less obvious way.
650 %\subsection{Preparations}
652 %\label{claim:claimCoverFromLeft}
653 %The total ordering relation uniquely determines a node order, in which
654 %the nodes of $V_{1}$ will be covered by VF2. From the point of
655 %view of the matching procedure, this means, that always the same node
656 %of $G_{1}$ will be covered on the $d$-th level.
660 %An order $(u_{\sigma(1)},u_{\sigma(2)},..,u_{\sigma(|V_{1}|)})$ of
661 %$V_{1}$ is \textbf{matching order} if there exists $\prec$ total
662 %ordering relation, s.t. the VF2 with $\prec$ on the d-th level finds
663 %pair for $u_{\sigma(d)}$ for all $d\in\{1,..,|V_{1}|\}$.
666 %\begin{claim}\label{claim:MOclaim}
667 %A total ordering is matching order iff the nodes of every component
668 %form an interval in the node sequence, and every node connects to a
669 %previous node in its component except the first node of each component.
672 %In summary, a total ordering always uniquely determines a matching
673 %order, and every matching order can be determined by a total ordering,
674 %however, more than one different total orderings may determine the
675 %same matching order.
677 \subsection{Matching order}
679 Let \textbf{F$_\mathcal{M}$(l)}$:=|\{v\in V_{2} :
680 l=\mathcal{L}(v)\}|-|\{u\in \mathcal{M} : l=\mathcal{L}(u)\}|$,
681 where $l$ is a label and $\mathcal{M}\subseteq V_{1}$.
684 \begin{definition}Let $\mathbf{arg\ max}_{f}(S) :=\{u\in S : f(u)=max_{v\in S}\{f(v)\}\}$ and $\mathbf{arg\ min}_{f}(S) := arg\ max_{(-f)}(S)$, where $S$ is a finite set and $f:S\longrightarrow \mathbb{R}$.
688 Let $deg(v)$ denote the degree of node $v$.
693 \algtext*{EndProcedure}
696 \caption{\hspace{0.5cm}$The\ method\ of\ VF2++\ for\ determining\ the\ node\ order$}\label{alg:VF2PPPseu}
697 \begin{algorithmic}[1]
698 \Procedure{VF2++order}{} \State $\mathcal{M}$ := $\emptyset$
699 \Comment{matching order} \While{$V_{1}\backslash \mathcal{M}
700 \neq\emptyset$} \State $r\in$ arg max$_{deg}$ (arg
701 min$_{F_\mathcal{M}\circ \mathcal{L}}(V_{1}\backslash
702 \mathcal{M})$)\label{alg:findMin} \State Compute $T$, a BFS tree with
703 root node $r$. \For{$d=0,1,...,depth(T)$} \State $V_d$:=nodes of the
704 $d$-th level \State Process $V_d$ \Comment{See Algorithm~\ref{alg:VF2PPProcess1}} \EndFor
705 \EndWhile \EndProcedure
711 \algtext*{EndProcedure}%ne nyomtasson ..
713 \caption{\hspace{.5cm}$The\ method\ for\ processing\ a\ level\ of\ the\ BFS\ tree$}\label{alg:VF2PPProcess1}
714 \begin{algorithmic}[1]
715 \Procedure{VF2++ProcessLevel}{$V_{d}$} \While{$V_d\neq\emptyset$}
716 \State $m\in$ arg min$_{F_\mathcal{M}\circ\ \mathcal{L}}($ arg max$_{deg}($arg
717 max$_{Conn_{\mathcal{M}}}(V_{d})))$ \State $V_d:=V_d\backslash m$
718 \State Append node $m$ to the end of $\mathcal{M}$ \State Refresh
719 $F_\mathcal{M}$ \EndWhile \EndProcedure
723 Algorithm~\ref{alg:VF2PPPseu} is a high-level description of the
724 matching order procedure of VF2++. It computes a BFS tree for each
725 component in ascending order of their rarest node labels and largest $deg$,
726 whose root vertex is the minimal node of its component. Algorithm~\ref{alg:VF2PPProcess1} is a method to process a level of the BFS tree, which appends the nodes of the current level in descending
727 lexicographic order by $(Conn_{\mathcal{M}},deg,-F_\mathcal{M})$ separately
728 to $\mathcal{M}$, and refreshes $F_\mathcal{M}$ immediately.
731 Algorithm~\ref{alg:VF2PPPseu} provides a matching order.
735 \subsection{Cutting rules}
736 \label{VF2PPCuttingRules}
737 This section presents the cutting rules of VF2++, which are improved by using extra information coming from the node labels.
739 Let $\mathbf{\Gamma_{1}^{l}(u)}:=\{\tilde{u} : \mathcal{L}(\tilde{u})=l
740 \wedge \tilde{u}\in \Gamma_{1} (u)\}$ and
741 $\mathbf{\Gamma_{2}^{l}(v)}:=\{\tilde{v} : \mathcal{L}(\tilde{v})=l \wedge
742 \tilde{v}\in \Gamma_{2} (v)\}$, where $u\in V_{1}$, $v\in
743 V_{2}$ and $l$ is a label.
746 \begin{claim}[Cutting function for ISO]
747 \[LabCut_{ISO}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!\neq\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| \neq |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function for ISO.
750 \begin{claim}[Cutting function for IND]
751 \[LabCut_{IND}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\ \vee\]\[\bigvee_{l\ is\ label} \newline |\Gamma_{2}^{l}(v)\cap \tilde{T}_{2}(\mathfrak{m})| < |\Gamma_{1}^{l}(u)\cap \tilde{T}_{1}(\mathfrak{m})|\] is a cutting function for IND.
754 \begin{claim}[Cutting function for SUB]
755 \[LabCut_{SU\!B}((u,v),\mathfrak{m}):=\bigvee_{l\ is\ label}|\Gamma_{2}^{l} (v) \cap T_{2}(\mathfrak{m})|\!<\!|\Gamma_{1}^{l}(u)\cap T_{1}(\mathfrak{m})|\] is a cutting function for SUB.
760 \section{Implementation details}\label{sec:VF2ppImpl}
761 This section provides a detailed summary of an efficient
762 implementation of VF2++.
764 Let $\Delta_1$ and $\Delta_2$ denote the largest degree in $G_1$ and $G_2$, respectively, and let $\Delta=\max\{\Delta_1,\Delta_2\}$.
766 \subsection{Storing a mapping}
767 After fixing an arbitrary node order ($u_0, u_1, ..,
768 u_{|V_{1}|-1}$) of $G_{1}$, an array $M$ can be used to store
769 the current mapping in the following way.
773 v & if\ (u_i,v)\ is\ in\ the\ mapping\\ INV\!ALI\!D &
774 if\ no\ node\ has\ been\ mapped\ to\ u_i,
777 where $i\in\{0,1, ..,|V_{1}|-1\}$, $v\in V_{2}$ and $INV\!ALI\!D$
779 \subsection{Avoiding the recurrence}
780 The recursion of Algorithm~\ref{alg:VF2Pseu} can be realized
781 as a \textit{while loop}, which has a loop counter $depth$ denoting the
782 current depth of the recursion. Fixing a matching order, let $M$
783 denote the array storing the current mapping. Observe that
784 $M$ is $INV\!ALI\!D$ from index $depth$+1 and not $INV\!ALI\!D$ before
785 $depth$. $M[depth]$ changes
786 while the state is being processed, but the property is held before
787 both stepping back to a predecessor state and exploring a successor
790 The necessary part of the candidate set is easily maintainable or
791 computable by following
792 the steps described in Section~\ref{candidateComputingVF2}. A much faster method
793 has been designed for biological and sparse graphs, see the next
796 \subsection{Calculating the candidates for a node}
797 The task is not to maintain the candidate set, but to generate the
798 candidate nodes in $G_{2}$ for a given node $u\in V_{1}$. In
799 case of any of the three problem types and a mapping $\mathfrak{m}$, if a node $v\in
800 V_{2}$ is a potential pair of $u\in V_{1}$, then $\forall
801 u'\in \mathfrak{D}(\mathfrak{m}) : (u,u')\in
802 E_{1}\Rightarrow (v,\mathfrak{m}(u'))\in
803 E_{2}$. That is, each covered neighbour of $u$ has to be mapped to
804 a covered neighbour of $v$, i.e. selecting arbitrarily a covered neighbour $u'$ of $u$, all of the admissible candidates for $u$ are among the neighbours of $\mathfrak{m}(u')$.
806 Having said that, an algorithm running in $\Theta(\Delta_2)$ time is
807 describable if there exists a covered node in the component containing
808 $u$, and a linear one otherwise.
811 \subsection{Determining the node order}
812 This section describes how the node order preprocessing method of
813 VF2++ can efficiently be implemented.
815 For using lookup tables, the node labels are associated with the
816 numbers $\{0,1,..,|K|-1\}$, where $K$ is the set of the labels. It
817 enables $F_\mathcal{M}$ to be stored in an array. At first, the node order
818 $\mathcal{M}=\emptyset$, so $F_\mathcal{M}[i]$ is the number of nodes
819 in $V_{2}$ having label $i$, which is easy to compute in
820 $\Theta(|V_{2}|)$ steps.
822 Representing $\mathcal{M}\subseteq V_{1}$ as an array of
823 size $|V_{1}|$, both the computation of the BFS tree, and processing its levels by Algorithm~\ref{alg:VF2PPProcess1} can be done in-place by swapping nodes.
825 \subsection{Cutting rules}
826 In Section~\ref{VF2PPCuttingRules}, the cutting rules were
827 described using the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
828 and $\tilde T_{2}$, which are dependent on the current mapping. The aim is to check the labelled cutting
829 rules of VF2++ in $\Theta(\Delta)$ time.
831 Firstly, suppose that these four sets are given in such a way, that
832 checking whether a node is in a certain set takes constant time,
833 e.g. they are given by their 0-1 characteristic vectors. Let $L$ be an
834 initially zero integer lookup table of size $|K|$. After incrementing
835 $L[\mathcal{L}(u')]$ for all $u'\in \Gamma_{1}(u) \cap T_{1}(\mathfrak{m})$ and
836 decrementing $L[\mathcal{L}(v')]$ for all $v'\in\Gamma_{2} (v) \cap
837 T_{2}(\mathfrak{m})$, the first part of the cutting rules is checkable in
838 $\Theta(\Delta)$ time by considering the proper signs of $L$. Setting $L$
839 to zero takes $\Theta(\Delta)$ time again, which makes it possible to use
840 the same table through the whole algorithm. The second part of the
841 cutting rules can be verified using the same method with $\tilde
842 T_{1}$ and $\tilde T_{2}$ instead of $T_{1}$ and
843 $T_{2}$. Thus, the overall time complexity is $\Theta(\Delta)$.
845 To maintain the sets $T_{1}$, $T_{2}$, $\tilde T_{1}$
846 and $\tilde T_{2}$, two other integer lookup tables storing the number of covered neighbours of the nodes of the two graphs can be used. This representation allows constant-time membership checking, furthermore it is maintainable in $\Theta(\Delta)$ time whenever a node pair is added or subtracted by incrementing
847 or decrementing the proper indices. A further improvement is that the
848 values of $L[\mathcal{L}(u')]$ in case of checking $u$ are dependent only on
849 $u$, i.e. on the current depth of the recursion, so for each $u\in V_{1}$, an array of pairs \textit{(label, number of such labels)} can store $L$. Note that these arrays are at most of size
850 $\Delta_1$ if pairs with non-appearing node labels are discarded.
852 Using similar techniques, the consistency function can be evaluated in
853 $\Theta(\Delta)$ steps, as well.
855 \section{Experimental results}\label{sec:ExpRes}
856 This section compares the performance of VF2++ and VF2~Plus. According to
857 our experience, both algorithms run faster than VF2 with orders of
858 magnitude, thus its inclusion was not reasonable.
860 The algorithms were implemented in C++ using the open-source
861 LEMON graph and network optimization library~\cite{LEMON}. The tests were carried out on a Linux-based system with an Intel i7 X980 3.33 GHz CPU and 6 GB of RAM.
862 \subsection{Biological graphs}
863 The tests have been executed on a recent biological dataset created
864 for the International Contest on Pattern Search in Biological
865 Databases~\cite{Content}, which has been constructed of molecule,
866 protein and contact map graphs extracted from the Protein Data
867 Bank~\cite{ProteinDataBank}.
869 The molecule dataset contains small graphs with less than 100 nodes
870 and an average degree of less than 3. The protein dataset contains
871 graphs having 500-10 000 nodes and an average degree of 4, while the
872 contact map dataset contains graphs with 150-800 nodes and an average
875 In the following, both the induced subgraph and the graph
876 isomorphism problems will be examined.
877 This dataset provides graph pairs, between which all the induced subgraph isomorphisms have to be found. For runtime results, please see Figure~\ref{fig:bioIND}.
879 In an other experiment, the nodes of each graph in the database had been
880 shuffled, and an isomorphism between the shuffled and the original
881 graph was searched. The running times are shown on Figure~\ref{fig:bioISO}.
887 \begin{subfigure}[b]{0.55\textwidth}
889 \begin{tikzpicture}[trim axis left, trim axis right]
890 \begin{axis}[title=Molecules IND,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
891 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
892 west},scaled x ticks = false,x tick label style={/pgf/number
893 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
894 format/1000 sep = \kern 0.08em}]
895 %\addplot+[only marks] table {proteinsOrig.txt};
896 \addplot table {Orig/Molecules.32.txt}; \addplot[mark=triangle*,mark
897 size=1.8pt,color=red] table {VF2PPLabel/Molecules.32.txt};
900 \caption{In the case of molecules, the algorithms have
901 similar behaviour, but VF2++ is almost two times faster even on such
902 small graphs.} \label{fig:INDMolecule}
906 \begin{subfigure}[b]{0.55\textwidth}
908 \begin{tikzpicture}[trim axis left, trim axis right]
909 \begin{axis}[title=Contact maps IND,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
910 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
911 west},scaled x ticks = false,x tick label style={/pgf/number
912 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
913 format/1000 sep = \kern 0.08em}]
914 %\addplot+[only marks] table {proteinsOrig.txt};
915 \addplot table {Orig/ContactMaps.128.txt};
916 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
917 {VF2PPLabel/ContactMaps.128.txt};
920 \caption{On contact maps, VF2++ runs almost in constant time, while VF2
921 Plus has a near-linear behaviour.} \label{fig:INDContact}
927 \begin{subfigure}[b]{0.55\textwidth}
929 \begin{tikzpicture}[trim axis left, trim axis right]
930 \begin{axis}[title=Proteins IND,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
931 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
932 west},scaled x ticks = false,x tick label style={/pgf/number
933 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
934 format/1000 sep = \kern 0.08em}] %\addplot+[only marks] table
935 {proteinsOrig.txt}; \addplot[mark=*,mark size=1.2pt,color=blue]
936 table {Orig/Proteins.256.txt}; \addplot[mark=triangle*,mark
937 size=1.8pt,color=red] table {VF2PPLabel/Proteins.256.txt};
940 \caption{Both of the algorithms have linear behaviour on protein
941 graphs. VF2++ is more than 10 times faster than VF2
942 Plus.} \label{fig:INDProt}
947 \caption{\normalsize{Induced subgraph isomorphism problem on biological graphs}}\label{fig:bioIND}
954 \begin{subfigure}[b]{0.55\textwidth}
956 \begin{tikzpicture}[trim axis left, trim axis right]
957 \begin{axis}[title=Molecules ISO,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
958 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
959 west},scaled x ticks = false,x tick label style={/pgf/number
960 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
961 format/1000 sep = \kern 0.08em}]
962 %\addplot+[only marks] table {proteinsOrig.txt};
963 \addplot table {Orig/moleculesIso.txt}; \addplot[mark=triangle*,mark
964 size=1.8pt,color=red] table {VF2PPLabel/moleculesIso.txt};
967 \caption{The results are close to each other on contact maps, but VF2++ seems to be slightly faster as the number of nodes increases.
968 }\label{fig:ISOMolecule}
972 \begin{subfigure}[b]{0.55\textwidth}
974 \begin{tikzpicture}[trim axis left, trim axis right]
975 \begin{axis}[title=Contact maps ISO,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
976 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
977 west},scaled x ticks = false,x tick label style={/pgf/number
978 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
979 format/1000 sep = \kern 0.08em}]
980 %\addplot+[only marks] table {proteinsOrig.txt};
981 \addplot table {Orig/contactMapsIso.txt}; \addplot[mark=triangle*,mark
982 size=1.8pt,color=red] table {VF2PPLabel/contactMapsIso.txt};
985 \caption{In the case of molecules, there is no significant
986 difference, but VF2++ performs consistently better.}\label{fig:ISOContact}
992 \begin{subfigure}[b]{0.55\textwidth}
994 \begin{tikzpicture}[trim axis left, trim axis right]
995 \begin{axis}[title=Proteins ISO,xlabel={$|V_2|$},ylabel={time (ms)},legend entries={VF2 Plus,VF2++},grid
996 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
997 west},scaled x ticks = false,x tick label style={/pgf/number
998 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
999 format/1000 sep = \kern 0.08em}]
1000 %\addplot+[only marks] table {proteinsOrig.txt};
1001 \addplot table {Orig/proteinsIso.txt}; \addplot[mark=triangle*,mark
1002 size=1.8pt,color=red] table {VF2PPLabel/proteinsIso.txt};
1005 \caption{On protein graphs, VF2~Plus has a super linear time
1006 complexity, while VF2++ runs in near constant time. The difference
1007 is about two orders of magnitude on large graphs.}\label{fig:ISOProt}
1012 \caption{\normalsize{Graph isomorphism problem on biological graphs}}\label{fig:bioISO}
1018 \subsection{Random graphs}
1019 This section compares VF2++ with VF2~Plus on random graphs of large
1020 size. The node labels are uniformly distributed. Let $\delta$ denote
1021 the average degree. For the parameters of problems solved in the
1022 experiments, please see the top of each chart.
1023 \subsubsection{Graph isomorphism problem}
1024 To evaluate the efficiency of the algorithms in the case of graph
1025 isomorphism problem, random connected graphs of less than 20 000 nodes have been
1026 considered. Generating a random graph and shuffling its nodes, an
1027 isomorphism had to be found. Figure~\ref{fig:randISO} shows the runtime results
1028 on graph sets of various density.
1036 \begin{subfigure}[b]{0.55\textwidth}
1039 \begin{axis}[title={Random ISO, $\delta = 5$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1040 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1041 west},scaled x ticks = false,x tick label style={/pgf/number
1042 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1043 format/1000 sep = \kern 0.08em}]
1044 %\addplot+[only marks] table {proteinsOrig.txt};
1045 \addplot table {randGraph/iso/vf2pIso5_1.txt};
1046 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1047 {randGraph/iso/vf2ppIso5_1.txt};
1053 \begin{subfigure}[b]{0.55\textwidth}
1056 \begin{axis}[title={Random ISO, $\delta = 10$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1057 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1058 west},scaled x ticks = false,x tick label style={/pgf/number
1059 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1060 format/1000 sep = \kern 0.08em}]
1061 %\addplot+[only marks] table {proteinsOrig.txt};
1062 \addplot table {randGraph/iso/vf2pIso10_1.txt};
1063 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1064 {randGraph/iso/vf2ppIso10_1.txt};
1071 \begin{subfigure}[b]{0.55\textwidth}
1074 \begin{axis}[title={Random ISO, $\delta = 15$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1075 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1076 west},scaled x ticks = false,x tick label style={/pgf/number
1077 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1078 format/1000 sep = \kern 0.08em}]
1079 %\addplot+[only marks] table {proteinsOrig.txt};
1080 \addplot table {randGraph/iso/vf2pIso15_1.txt};
1081 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1082 {randGraph/iso/vf2ppIso15_1.txt};
1087 \begin{subfigure}[b]{0.55\textwidth}
1090 \begin{axis}[title={Random ISO, $\delta = 100$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1091 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1092 west},scaled x ticks = false,x tick label style={/pgf/number
1093 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1094 format/1000 sep = \kern 0.08em}]
1095 %\addplot+[only marks] table {proteinsOrig.txt};
1096 \addplot table {randGraph/iso/vf2pIso100_1.txt};
1097 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1098 {randGraph/iso/vf2ppIso100_1.txt};
1104 \caption{Graph isomorphism problem on random graphs
1105 }\label{fig:randISO}
1109 \subsubsection{Induced subgraph isomorphism problem}
1110 This section presents a comparison of VF2++ and VF2~Plus in the case
1111 of induced subgraph isomorphism problem. In addition to the size of graph $G_2$, that of $G_1$ dramatically influences the hardness of
1112 a given problem too, so the overall picture is provided by examining graphs to be embedded of various size.
1114 For each chart, a number $0<\rho< 1$ has been fixed, and the following
1115 has been executed 150 times. Generating a large graph $G_{2}$ of an average degree of $\delta$,
1116 choose 10 of its induced subgraphs having $\rho|V_{2}|$ nodes,
1117 and for all the 10 subgraphs find a mapping by using both graph
1118 matching algorithms. The $\delta = 5, 10, 35$ and $\rho = 0.05, 0.1,
1119 0.3, 0.8$ cases have been examined, see
1120 Figure~\ref{fig:randIND5},~\ref{fig:randIND10}~and~\ref{fig:randIND35}.
1129 \begin{subfigure}[b]{0.55\textwidth}
1132 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1133 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1134 west},scaled x ticks = false,x tick label style={/pgf/number
1135 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1136 format/1000 sep = \kern 0.08em}]
1137 %\addplot+[only marks] table {proteinsOrig.txt};
1138 \addplot table {randGraph/ind/vf2pInd5_0.05.txt};
1139 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1140 {randGraph/ind/vf2ppInd5_0.05.txt};
1145 \begin{subfigure}[b]{0.55\textwidth}
1148 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1149 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1150 west},scaled x ticks = false,x tick label style={/pgf/number
1151 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1152 format/1000 sep = \kern 0.08em}]
1153 %\addplot+[only marks] table {proteinsOrig.txt};
1154 \addplot table {randGraph/ind/vf2pInd5_0.1.txt};
1155 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1156 {randGraph/ind/vf2ppInd5_0.1.txt};
1162 \begin{subfigure}[b]{0.55\textwidth}
1165 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1166 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1167 west},scaled x ticks = false,x tick label style={/pgf/number
1168 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1169 format/1000 sep = \kern 0.08em}]
1170 %\addplot+[only marks] table {proteinsOrig.txt};
1171 \addplot table {randGraph/ind/vf2pInd5_0.3.txt};
1172 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1173 {randGraph/ind/vf2ppInd5_0.3.txt};
1178 \begin{subfigure}[b]{0.55\textwidth}
1181 \begin{axis}[title={Random IND, $\delta = 5$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1182 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1183 west},scaled x ticks = false,x tick label style={/pgf/number
1184 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1185 format/1000 sep = \kern 0.08em}]
1186 %\addplot+[only marks] table {proteinsOrig.txt};
1187 \addplot table {randGraph/ind/vf2pInd5_0.8.txt};
1188 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1189 {randGraph/ind/vf2ppInd5_0.8.txt};
1195 \caption{Induced subgraph isomorphism problem on random graphs having an average degree of
1196 5}\label{fig:randIND5}
1203 \begin{subfigure}[b]{0.55\textwidth}
1207 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1208 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1209 west},scaled x ticks = false,x tick label style={/pgf/number
1210 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1211 format/1000 sep = \kern 0.08em}]
1212 %\addplot+[only marks] table {proteinsOrig.txt};
1213 \addplot table {randGraph/ind/vf2pInd10_0.05.txt};
1214 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1215 {randGraph/ind/vf2ppInd10_0.05.txt};
1220 \begin{subfigure}[b]{0.55\textwidth}
1224 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1225 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1226 west},scaled x ticks = false,x tick label style={/pgf/number
1227 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1228 format/1000 sep = \kern 0.08em}]
1229 %\addplot+[only marks] table {proteinsOrig.txt};
1230 \addplot table {randGraph/ind/vf2pInd10_0.1.txt};
1231 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1232 {randGraph/ind/vf2ppInd10_0.1.txt};
1238 \begin{subfigure}[b]{0.55\textwidth}
1241 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1242 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1243 west},scaled x ticks = false,x tick label style={/pgf/number
1244 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1245 format/1000 sep = \kern 0.08em}]
1246 %\addplot+[only marks] table {proteinsOrig.txt};
1247 \addplot table {randGraph/ind/vf2pInd10_0.3.txt};
1248 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1249 {randGraph/ind/vf2ppInd10_0.3.txt};
1254 \begin{subfigure}[b]{0.55\textwidth}
1257 \begin{axis}[title={Random IND, $\delta = 10$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1258 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1259 west},scaled x ticks = false,x tick label style={/pgf/number
1260 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1261 format/1000 sep = \kern 0.08em}]
1262 %\addplot+[only marks] table {proteinsOrig.txt};
1263 \addplot table {randGraph/ind/vf2pInd10_0.8.txt};
1264 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1265 {randGraph/ind/vf2ppInd10_0.8.txt};
1271 \caption{Induced subgraph isomorphism problem on random graphs having an average degree of
1272 10}\label{fig:randIND10}
1280 \begin{subfigure}[b]{0.55\textwidth}
1283 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.05$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1284 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1285 west},scaled x ticks = false,x tick label style={/pgf/number
1286 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1287 format/1000 sep = \kern 0.08em}]
1288 %\addplot+[only marks] table {proteinsOrig.txt};
1289 \addplot table {randGraph/ind/vf2pInd35_0.05.txt};
1290 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1291 {randGraph/ind/vf2ppInd35_0.05.txt};
1296 \begin{subfigure}[b]{0.55\textwidth}
1299 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.1$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1300 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1301 west},scaled x ticks = false,x tick label style={/pgf/number
1302 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1303 format/1000 sep = \kern 0.08em}]
1304 %\addplot+[only marks] table {proteinsOrig.txt};
1305 \addplot table {randGraph/ind/vf2pInd35_0.1.txt};
1306 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1307 {randGraph/ind/vf2ppInd35_0.1.txt};
1313 \begin{subfigure}[b]{0.55\textwidth}
1316 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.3$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1317 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1318 west},scaled x ticks = false,x tick label style={/pgf/number
1319 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1320 format/1000 sep = \kern 0.08em}]
1321 %\addplot+[only marks] table {proteinsOrig.txt};
1322 \addplot table {randGraph/ind/vf2pInd35_0.3.txt};
1323 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1324 {randGraph/ind/vf2ppInd35_0.3.txt};
1329 \begin{subfigure}[b]{0.55\textwidth}
1332 \begin{axis}[title={Random IND, $\delta = 35$, $\rho = 0.8$},width=7.2cm,height=6cm,xlabel={$|V_2|$},ylabel={time (ms)},ylabel near ticks,legend entries={VF2 Plus,VF2++},grid
1333 =major,mark size=1.2pt, legend style={at={(0,1)},anchor=north
1334 west},scaled x ticks = false,x tick label style={/pgf/number
1335 format/1000 sep = \kern 0.08em},y tick label style={/pgf/number
1336 format/1000 sep = \kern 0.08em}]
1337 %\addplot+[only marks] table {proteinsOrig.txt};
1338 \addplot table {randGraph/ind/vf2pInd35_0.8.txt};
1339 \addplot[mark=triangle*,mark size=1.8pt,color=red] table
1340 {randGraph/ind/vf2ppInd35_0.8.txt};
1346 \caption{Induced subgraph isomorphism problem on random graphs having an average degree of
1347 35}\label{fig:randIND35}
1351 Based on these experiments, VF2++ is faster than VF2~Plus and able to
1352 handle really large graphs in milliseconds. Note that when $IND$ was
1353 considered and the graph to be embedded had proportionally few nodes ($\rho =
1354 0.05$, or $\rho = 0.1$), then VF2~Plus produced some inefficient node
1355 orders (e.g. see the $\delta=10$ case on
1356 Figure~\ref{fig:randIND10}). If these instances had been excluded, the
1357 charts would have looked similarly to the other ones.
1358 Unsurprisingly, as denser graphs are considered, both VF2++ and VF2
1359 Plus slow slightly down, but remain practically usable even on graphs
1360 having 10 000 nodes.
1366 \section{Conclusion}
1367 This paper presented VF2++, a new graph matching algorithm based on VF2, and analysed it from a practical viewpoint.
1369 Recognizing the importance of the node order and determining an
1370 efficient one, VF2++ is able to match graphs of thousands of nodes in
1371 near practically linear time including preprocessing. In addition to
1372 the proper order, VF2++ uses more efficient cutting
1373 rules which are easy to compute and make the algorithm able to prune
1374 most of the unfruitful branches without going astray.
1376 In order to show the efficiency of the new method, it has been
1377 compared to VF2~Plus~\cite{VF2Plus}, which is the best contemporary algorithm.
1379 The experiments show that VF2++ consistently outperforms VF2~Plus on
1380 biological graphs. It seems to be asymptotically faster on protein and
1381 on contact map graphs in the case of induced subgraph isomorphism problem,
1382 while in the case of graph isomorphism problem, it has definitely better
1383 asymptotic behaviour on protein graphs.
1385 Regarding random sparse graphs, not only has VF2++ proved itself to be
1386 faster than VF2~Plus, but it also has a practically linear behaviour both
1387 in the case of induced subgraph and graph isomorphism problems.
1390 \section*{Acknowledgement} \label{sec:ack}
1392 This research project was initiated and sponsored by QuantumBio
1393 Inc.~\cite{QUANTUMBIO}.
1395 The authors were supported by the Hungarian Scientific Research Fund -
1396 OTKA, K109240 and by the J\'anos Bolyai Research Fellowship program of
1397 the Hungarian Academy of Sciences.
1400 %% The Appendices part is started with the command \appendix;
1401 %% appendix sections are then done as normal sections
1407 %% If you have bibdatabase file and want bibtex to generate the
1408 %% bibitems, please use
1410 \bibliographystyle{elsarticle-num} \bibliography{bibliography}
1412 %% else use the following coding to input the bibitems directly in the
1415 %% \begin{thebibliography}{00}
1417 %% %% \bibitem{label}
1418 %% %% Text of bibliographic item
1422 %% \end{thebibliography}
1427 %% End of file `elsarticle-template-num.tex'.