A tree-based method for the rapid screening of chemical fingerprints

Kristensen, Thomas G; Nielsen, Jesper; Pedersen, Christian NS

doi:10.1186/1748-7188-5-9

linear search results

Andrew Dalke, Dalke Scientific

25 May 2010

The linear search implementation uses several layers of overhead on top of the CDK Tanimoto calculation code, which in turn creates an intermediate bitset object. I expect that quite a bit of the compute time is spent just in function call and object management overhead.

The linear search data structure uses a lot of memory. Based on figure 9, the implementation uses about 330 bytes for a 1024-bit fingerprint, which only needs 128 bytes. Adding perhaps 10 bytes for the identifier and a few bytes for overhead leads to a memory requirement of about 150 bytes, which is less than half used by the implementation.

I have implemented the linear search code using Python for file I/O and memory management and C for search speed. By using a single byte array, with the fingerprints stored as byte-based bitstrings identified by offset into the array, the memory requirements are little higher than the raw data. My hardware is about the same as used in the paper, and my performance was about 8x to 10x faster.

A similar implementation is possible in Java. I don't know if Java can come close to C performance, but I would expect it to be within a factor of 2, which is still 4x to 5x faster than the authors reported.

The same optimization apply to the Baldi implementation, so in figure 7 the performance line for Baldi line should be about the same slope as the multibit tree.

The multibit code may have its own set of speedups, but I have not analyzed that. I write to point out that the graph for the linear searches reflect a rather high implementation overhead.

Competing interests

I am a consultant in computational chemistry and make my earnings by developing software and implementing algorithms in this field, including fingerprint searches. I develop and have helped others develop fingerprint algorithms including those available commercially. This is the only thing which may be construed as being a competing interest.

linear search results

Andrew Dalke, Dalke Scientific

25 May 2010

The linear search implementation uses several layers of overhead on top of the CDK Tanimoto calculation code, which in turn creates an intermediate bitset object. I expect that quite a bit of the compute time is spent just in function call and object management overhead.

The linear search data structure uses a lot of memory. Based on figure 9, the implementation uses about 330 bytes for a 1024-bit fingerprint, which only needs 128 bytes. Adding perhaps 10 bytes for the identifier and a few bytes for overhead leads to a memory requirement of about 150 bytes, which is less than half used by the implementation.

I have implemented the linear search code using Python for file I/O and memory management and C for search speed. By using a single byte array, with the fingerprints stored as byte-based bitstrings identified by offset into the array, the memory requirements are little higher than the raw data. My hardware is about the same as used in the paper, and my performance was about 8x to 10x faster.

A similar implementation is possible in Java. I don't know if Java can come close to C performance, but I would expect it to be within a factor of 2, which is still 4x to 5x faster than the authors reported.

The same optimization apply to the Baldi implementation, so in figure 7 the performance line for Baldi line should be about the same slope as the multibit tree.

The multibit code may have its own set of speedups, but I have not analyzed that. I write to point out that the graph for the linear searches reflect a rather high implementation overhead.

Competing interests

I am a consultant in computational chemistry and make my earnings by developing software and implementing algorithms in this field, including fingerprint searches. I develop and have helped others develop fingerprint algorithms including those available commercially. This is the only thing which may be construed as being a competing interest.

Archived Comments for: A tree-based method for the rapid screening of chemical fingerprints

linear search results

Competing interests

Algorithms for Molecular Biology

Contact us