linear search results 25 May 2010 Andrew Dalke, Dalke Scientific The linear search implementation uses several layers of overhead on top of the CDK Tanimoto calculation code, which in turn creates an intermediate bitset object. I expect that quite a bit of the compute time is spent just in function call and object management overhead. The linear search data structure uses a lot of memory. Based on figure 9, the implementation uses about 330 bytes for a 1024-bit fingerprint, which only needs 128 bytes. Adding perhaps 10 bytes for the identifier and a few bytes for overhead leads to a memory requirement of about 150 bytes, which is less than half used by the implementation. I have implemented the linear search code using Python for file I/O and memory management and C for search speed. By using a single byte array, with the fingerprints stored as byte-based bitstrings identified by offset into the array, the memory requirements are little higher than the raw data. My hardware is about the same as used in the paper, and my performance was about 8x to 10x faster. A similar implementation is possible in Java. I don't know if Java can come close to C performance, but I would expect it to be within a factor of 2, which is still 4x to 5x faster than the authors reported. The same optimization apply to the Baldi implementation, so in figure 7 the performance line for Baldi line should be about the same slope as the multibit tree. The multibit code may have its own set of speedups, but I have not analyzed that. I write to point out that the graph for the linear searches reflect a rather high implementation overhead. Competing interests I am a consultant in computational chemistry and make my earnings by developing software and implementing algorithms in this field, including fingerprint searches. I develop and have helped others develop fingerprint algorithms including those available commercially. This is the only thing which may be construed as being a competing interest.