# A tree-based method for the rapid screening of chemical fingerprints

- Thomas G Kristensen
^{1}Email author,### Affiliated with

- Jesper Nielsen
^{1}Email author and### Affiliated with

- Christian NS Pedersen
^{1}### Affiliated with

**5**:9

**DOI: **10.1186/1748-7188-5-9

© Kristensen et al. 2010

**Received: **28 July 2009

**Accepted: **4 January 2010

**Published: **4 January 2010

## Abstract

### Background

The fingerprint of a molecule is a bitstring based on its structure, constructed such that structurally similar molecules will have similar fingerprints. Molecular fingerprints can be used in an initial phase of drug development for identifying novel drug candidates by screening large databases for molecules with fingerprints similar to a query fingerprint.

### Results

In this paper, we present a method which efficiently finds all fingerprints in a database with Tanimoto coefficient to the query fingerprint above a user defined threshold. The method is based on two novel data structures for rapid screening of large databases: the *k*D grid and the Multibit tree. The *k*D grid is based on splitting the fingerprints into *k* shorter bitstrings and utilising these to compute bounds on the similarity of the complete bitstrings. The Multibit tree uses hierarchical clustering and similarity within each cluster to compute similar bounds. We have implemented our method and tested it on a large real-world data set. Our experiments show that our method yields approximately a three-fold speed-up over previous methods.

### Conclusions

Using the novel *k*D grid and Multibit tree significantly reduce the time needed for searching databases of fingerprints. This will allow researchers to (1) perform more searches than previously possible and (2) to easily search large databases.

## 1 Introduction

When developing novel drugs, researchers are faced with the task of selecting a subset of all commercially available molecules for further experiments. There are more than 8 million such molecules available [1], and it is not feasible to perform computationally expensive calculations on each one. Therefore, the need arises for fast screening methods for identifying the molecules that are most likely to have an effect on a given disease. It is often the case that a molecule with some effect is already known, e.g. from an already existing drug. An obvious initial screening method presents itself, namely to identify the molecules which are similar to this known molecule. To implement this screening method one must decide on a representation of the molecules and a similarity measure between representations of molecules. Several representations and similarity measures have been proposed [2–4]. We focus on *molecular fingerprints*. A fingerprint for a given molecule is a bitstring of size *N* which summarises structural information about the molecule [3]. Fingerprints should be constructed such that if two fingerprints are very similar, so are the molecules which they represent. There are several ways of measuring the similarity between fingerprints [4]. We focus on the *Tanimoto coefficient*, which is a normalised measure of how many bits two fingerprints share. It is 1.0 when the fingerprints are the same, and strictly smaller than 1.0 when they are not. Molecular fingerprints in combination with the Tanimoto coefficient have been used successfully in previous studies [5].

We focus on the screening problem of finding all fingerprints in a database with Tanimoto coefficient to a query fingerprint above a given threshold, e.g. 0.9. Previous attempts have been made to improve the query time. One approach is to reduce the number of fingerprints in the database for which the Tanimoto coefficient to the query fingerprint has to be computed explicitly. This includes storing the fingerprints in the database in a vector of bins [6], or in a trie like structure [7], such that searching certain bins, or parts of the trie, can be avoided based on an upper-bound on the Tanimoto coefficient between the query fingerprint and all fingerprints in individual bins or subtries. Another approach is to store an XOR summary, i.e. a shorter bitstring, of each fingerprint in the database, and use these as rough upper bounds on the maximal Tanimoto coefficients achievable, before calculating the exact coefficients [8].

In this paper, we present an efficient method for the screening problem, which is based on an extension of an upper bound given in [6] and two novel tree based data structures for storing and retrieving fingerprints. To further reduce the query time we also utilise the XOR summary strategy [8]. We have implemented our method and tested it on a realistic data set. Our experiments clearly demonstrate that it is superior to previous strategies, as it yields a three-fold speed-up over the previous best method.

## 2 Methods

*N*. Let

*A*and

*B*be bitstrings, and let |

*A*| denote the number of 1-bits in

*A*. Let

*A*∧

*B*denote the

*logical and*of

*A*and

*B*, that is,

*A*∧

*B*is the bitstring that has 1-bits in exactly those positions where both

*A*and

*B*do. Likewise, let

*A*∨

*B*denote the

*logical or*of

*A*and

*B*, that is,

*A*∨

*B*is the bitstring that has 1-bits in exactly those positions where either

*A*or

*B*do. With this notation the Tanimoto coefficient becomes:

*B*in a database of fingerprints with a Tanimoto coefficient above some query-specific threshold

*S*

_{min}to a query fingerprint

*A*. The method is based on two novel data structures, the

*k*D grid and the Multibit tree, for storing the database of fingerprints.

### 2.1 *k*D grid

*et al*. showed in [6] that if |

*A*| and |

*B*| are known,

*S*

_{ T }(

*A*,

*B*) can be upper-bounded by

This bound can be used to speed up the search, by storing the database of fingerprints in *N* + 1 buckets such that bitstring *B* is stored in the |*B*|th bucket. When searching for bitstrings similar to a query bitstring *A* it is sufficient to examine the buckets where *S*
_{max} ≥ *S*
_{min}.

*k*and split the bitstrings into

*k*equally sized fragments such that

where *X*·*Y* is the concatenation of bitstrings *X* and *Y* .

*A*

_{1}|, |

*A*

_{2}|, ..., |

*A*

_{ k }| and |

*B*

_{1}|, |

*B*

_{2}|, ..., |

*B*

_{ k }| can be used to obtain a tighter bound than

*S*

_{max}. Let

*N*

_{ i }be the length of

*A*

_{ i }and

*B*

_{ i }. The

*k*D grid is a

*k*-dimensional cube of size (

*N*

_{1}+ 1) × (

*N*

_{2}+ 1) × ... × (

*N*

_{ k }+ 1). Each grid point is a bucket and the fingerprint

*B*is stored in the bucket at coordinates (

*n*

_{1},

*n*

_{2}, ...,

*n*

_{ k }), where

*n*

_{ i }= |

*B*

_{ i }|. An example of such a grid is illustrated in Fig. 2. By comparing the partial coordinates (

*n*

_{1},

*n*

_{2}, ...,

*n*

_{ i }) of a given bucket to |

*A*

_{1}|, |

*A*

_{2}|, ..., |

*A*

_{ i }|, where

*i*≤

*k*, it is possible to upper-bound the Tanimoto coefficient between

*A*and every

*B*in that bucket. By looking at the partial coordinates (

*n*

_{1},

*n*

_{2}, ...,

*n*

_{ i-1}), we can use this to quickly identify those partial coordinates (

*n*

_{1},

*n*

_{2}, ...,

*n*

_{ i }) that may contain fingerprints

*B*with a Tanimoto coefficient above

*S*

_{min}.

*i*in the data structure. The indices

*n*

_{1},

*n*

_{2}, ...,

*n*

_{ i-1}are known, but we need to compute which

*n*

_{ i }to visit at this level. The entries to be visited further down the data structure

*n*

_{ i+1}, ...,

*n*

_{ k }are, of course, unknown at this point. A bound can be calculated in the following manner.

*n*

_{ i }s to visit lie in an interval and it is thus sufficient to compute the upper and lower indices of this interval,

*n*

_{ u }and

*n*

_{ l }respectively. Setting , isolating

*n*

_{ i }and ensuring that the result is an integer in the range 0...

*N*

_{ i }gives:

where
is a bound on the number of 1-bits in the *logical and* in the first part of the bitstrings.
is a bound for the *logical or* in the first part of the bitstrings.

Similarly, is a bound on the last part.

*k*= 1 this datastructure simply becomes the list presented by Swamidass

*et al*. [6], and in the case where

*k*=

*N*the datastructure becomes the binary trie presented by Smellie [7]. We have implemented the

*k*D grid as a list of lists, where any list containing no fingerprints is omitted. See Fig. 3 for an example of a 4D grid containing four bitstrings. The fingerprints stored in a single bucket in the

*k*D grid can be organised in a number of ways. The most naive approach is to store them in a simple list which has to be searched linearly. We propose to store them in tree structures as explained below.

### 2.2 Singlebit tree

The *Singlebit tree* is a binary tree which stores the fingerprints of a single bucket from a *k*D grid. At each node in the tree a position in the bitstring is chosen. All fingerprints with a zero at that position are stored in the left subtree while all those with a one are stored in the right subtree. This division is continued recursively until all the fingerprints in a given node are the same. When searching for a query bitstring *A* in the tree it now becomes possible, by comparing *A* to the path from the root of the tree to a given node, to compute an upper bound
on *S*
_{
T
} (*A*, *B*) for every fingerprint *B* in the subtree of that given node. Given two bitstring *A* and *B* let *M*
_{
ij
} be the number of positions where *A* has an *i* and *B* has a *j*. There are four possible combinations of *i* and *j*, namely *M*
_{00}, *M*
_{01}, *M*
_{10} and *M*
_{11}.

The path from the root of a tree to a node defines lower limits *m*
_{
ij
} on *M*
_{
ij
} for every fingerprint in the subtree of that node. Let *u*
_{
ij
} denote the unknown difference between *M*
_{
ij
} and *m*
_{
ij
}, that is *u*
_{
ij
} = *M*
_{
ij
}- *m*
_{
ij
}.

Remember that is known when processing a given bucket.

*B*in the subtree can then be calculated as

The Singlebit tree can also be used to store all the fingerprints in the database without a *k*D grid. In this case, however, |*B*| is no longer available and thus the
bound cannot be used. A less tight bound can be formulated, but experiments, not included in this paper, indicate that this is a poor strategy.

### 2.3 Multibit tree

The experiments in Sec. 3 unfortunately show that using the *k*D grid combined with Singlebit trees decreases performance compared to using the *k*D grid and simple lists. The fingerprints used in our experiments have a length of 1024 bits. In our experiments no Singlebit tree was observed to contain more the 40,000 fingerprints. This implies that the expected height of the Singlebit trees is no more than 15 (as we aim for balanced trees cf. above). Consequently, the algorithm will only obtain information about 15 out of 1024 bits before reaching the fingerprints. A strategy for obtaining more information is to store a list of bit positions, along with an annotation of whether each bit is zero or one, in each node. The bits in this list are called the *match-bits*.

The *Multibit tree* is an extension of the Singlebit tree, where we no longer demand that all children of a given node are split according to the value of a single bit. In fact we only demand that the data is arranged in *some* binary tree. The match-bits of a given node are computed as all bits that are not a match-bit in any ancestor and for which all fingerprints in the leaves of the node have the same value. Note that a node could easily have no match-bits. When searching through the Multibit tree, the query bitstring *A* is compared to the match-bits of each visited node and *m*
_{00}, *m*
_{01}, *m*
_{10} and *m*
_{11} are updated accordingly.
is computed the same way as
and only branches for which
≥ *S*
_{min} are visited.

*l*children. Based on initial experiments, not included in this paper,

*l*is chosen as 6, which reduces memory consumption by more than a factor of two and has no significant impact on speed. An obvious alternative way to build the tree would be to base it on some hierarchical clustering method, such as Neighbour Joining [9].

## 3 Experiments

We have implemented the *kD* grid and the Single- and Multibit tree in Java. The implementation along with all test data is available at

. http://www.birc.au.dk/~tgk/TanimotoQuery/

Using these implementations, we have constructed several search methods corresponding to the different combinations of the data structures. We have examined the *k*D grid for *k* = 1, 2, 3 and 4, where the fingerprints in the buckets are stored in a simple list, a Singlebit tree or a Multibit tree. For purposes of comparison, we have implemented a linear search strategy, that simply examines all fingerprints in the database. We have also implemented the strategy of "pruning using the bit-bound approach first, followed by pruning using the difference of the number of 1-bits in the XOR-compressed vectors, followed by pruning using the XOR approach" from [8]. This strategy will hereafter simply be known as *Baldi*. A trick of comparing the XOR-folded bitstrings [8] immediately before computing the true Tanimoto coefficient, is used in all our strategies to improve performance. The length of the XOR summary is set to 128, as suggested in [8]. An experiment, not included in this paper, confirmed that this is indeed the optimal size of the XOR fingerprint. We have chosen to reimplement related methods in order to make an unbiased comparision of the running times independent of programming language differences.

The experiments were performed on an Intel Core 2 Duo running at 2.5 GHz and with 2 GB of RAM. Fingerprints were generated using the CDK fingerprint generator [10] which has a standard fingerprint size *N* of 1024. One molecule timed out and did not generate a fingerprint. We have performed our tests on different sizes of the data set, from 100,000 to 2,000,000 fingerprints in 100,000 increments. For each data set size, the entire data structure is created. Next, the first 100 fingerprints in the database are used for queries. We measure the query time and the space consumption.

## 4 Results

Figure 7 shows the average query time for the different strategies and different values of *k* plotted against the database size. We note that the Multibit tree in a 1D grid is best for all sizes. Surprisingly, the simple list, for an appropriately high value of *k*, is faster than the Singlebit tree, yet slower than the Multibit tree. This is probably due to the fact that the Singlebit trees are too small to contain sufficient information for an efficient pruning: the entire tree is traversed, which is slower than traversing the corresponding list implementation. All three approaches (List, Singlebit- and Multibit trees) are clearly superior to the Baldi approach, which in turn is better than a simple linear search (with the XOR folding trick).

*k*. This trend is further investigated in Fig. 8, which indicate that a

*k*of three or four seems optimal. As

*k*grows the grid becomes larger and more time consuming to traverse while the lists in the buckets become shorter. For sufficiently large values of

*k*, the time spent pruning buckets exceeds the time visiting buckets containing superfluous fingerprints. The Singlebit tree data in Fig. 7b indicates that the optimal value of

*k*is three. It seems the trees become too small to contain enough information for an efficient pruning, when

*k*reaches four. In Fig. 7c we see the Multibit tree. Again, a too large

*k*will actually slow down the data structure. This can be explained with arguments similar to those for the Singlebit tree. Surprisingly, it seems a

*k*as low as one is optimal.

*k*'s. In the worst case, where all buckets contain fingerprints, the memory consumption per fingerprint, for the grid alone, becomes , where

*n*is the number of fingerprints in the database. Thus we are not surprised by our actual results.

The reason why linear search is not constant time for a constant data set is that, while it will always visit all fingerprints, the time for visiting a given fingerprint is not constant due to the XOR folding trick.

The results seems to be consistent with the average query time presented in Fig. 10.

## 5 Conclusion

In this paper we have presented a method for finding all fingerprints in a database with Tanimoto coefficient to a query fingerprint above a user defined threshold. Our method is based on a generalisation of the bounds developed in [6] to multiple dimensions. Our generalisation results in a tighter bound, and experiments indicate that this results in a performance increase. Furthermore, we have examined the possibility of utilising trees as secondary data structures in the buckets. Again, our experiments clearly demonstrate that this leads to a significant performance increase.

Our methods allow researchers to search larger databases faster than previously possible. The use of larger databases should increase the likelihood of finding relevant matches. The faster query times decreases the effort and time needed to do a search. This allow more searches to be done, either for more molecules or with different thresholds *S*
_{min} on the Tanimoto coefficient. Both of these features increase the usefulness of fingerprint based searches for the researcher in the laboratory.

Our method is currently limited by the rather larger memory consumption of the Multibit tree. Another implementation might remedy this situation somewhat. Otherwise we suggest an I/O efficient implementation where the tree is kept on disk.

To increase the speed of our method further we are aware of two approaches. Firstly, the best way to construct the Multibit trees remain uninvestigated. Secondly, a tighter coupling between the Multibit tree and the *k*D grid would allow us to use grid information in the Multibit tree: in the *k*D grid we have information about each fragment of the fingerprints which is not used in the current tree bounds.

## Declarations

## Authors’ Affiliations

## References

- Irwin JJ, Shoichet BK:
**ZINC: A Free Database of Commercially Available Compounds for Virtual Screening.***Journal of Chemical Information and Modeling*2005,**45:**177–182.PubMedView Article - Gillet VJ, Willett P, Bradshaw J:
**Similarity Searching Using Reduced Graphs.***Journal of Chemical Information and Computer Sciences*2003,**43**(2)**:**338–345.PubMed - Leach AR, Gillet VJ:
*An Introduction to Chemoinformatics*. Kluwer Academic Publishers, Dordrecht, The Netherlands, rev. ed; 2007.View Article - Willett P:
**Similarity-based approaches to virtual screening.***Biochem Soc Trans*2003,**31**(Pt 3)**:**603–606.PubMed - Willett P, Barnard JM, Downs GM:
**Chemical Similarity Searching.***Journal of Chemical Information and Computer Sciences*1998,**38**(6)**:**983–996. - Swamidass SJ, Baldi P:
**Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time.***Journal of Chemical Information and Modeling*2007,**47**(2)**:**302–317.PubMedView Article - Smellie A:
**Compressed Binary Bit Trees: A New Data Structure For Accelerating Database Searching.***Journal of Chemical Information and Modeling*2009,**49**(2)**:**257–262.PubMedView Article - Baldi P, Hirschberg DS, Nasr RJ:
**Speeding Up Chemical Database Searches Using a Proximity Filter Based on the Logical Exclusive OR.***Journal of Chemical Information and Modeling*2008,**48**(7)**:**1367–1378.PubMedView Article - Saitou N, Nei M:
**The neighbor-joining method: a new method for reconstructing phylogenetic trees.***Mol Biol Evol*1987,**4**(4)**:**406–425.PubMed - Steinbeck C, Han Y, Kuhn S, Horlacher O, Luttmann E, Willighagen E:
**The Chemistry Development Kit (CDK): An Open-Source Java Library for Chemo- and Bioinformatics.***Journal of Chemical Information and Computer Sciences*2003,**43**(2)**:**493–500.PubMed

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.