Skip to main content

Table 1 Filters used to curate protedatasets for use in training PPIS predictors, including the reasoning behind their use, the methods and specific software used to implement them, as well as references detailing the predictors making use thereof

From: Algorithmic approaches to protein-protein interaction site prediction

Filter Reason Method Software Used By
Exclusion of non-biological complexes Avoid training on complexes not present in vivo Check against other database PQS [81], PISA [82,83] [7,12,32,42]
Resolution Low resolution structures may be inaccurate PDB filtering In-House [7,20,29,37]
Canonical AAs Most programs cannot handle non-canonical amino acids    [7,12,37,42]
Redundancy Reduce overfitting Sequence similarity cutoff BLAST [84], PISCES [85,86], CD-HIT [87,88] [7,12,20,29,37,38, 42]
   Removal of members of same superfamily SCOP [89,90] [40]
   Similarity clustering with representative structure In-House [15,31,32]
Specialized databases Pre-filtered databases are more reliable Use of database ProtInDB [91], Piccolo [92], Negatome [93], iPfam [94,95], 3did [96] [12,38]
Chain Length Ensure removal of fragments and peptides PDB filtering; UniPROT [97] annotation and mapping to PDB [98] In-House [7,12,20,29,37,38, 42]
Only X-ray Crystal Structures NMR are harder to validate, less precise, and more difficult to process [99-101] PDB filtering In-House [37,42]
No antibody-antigen interactions Ag-Ab complexes bind on different principles than PPIs [9,16,102]    [37,40,103]