From: Algorithmic approaches to protein-protein interaction site prediction
Filter | Reason | Method | Software | Used By |
---|---|---|---|---|
Exclusion of non-biological complexes | Avoid training on complexes not present in vivo | Check against other database | ||
Resolution | Low resolution structures may be inaccurate | PDB filtering | In-House | |
Canonical AAs | Most programs cannot handle non-canonical amino acids | Â | Â | |
Redundancy | Reduce overfitting | Sequence similarity cutoff | ||
 |  | Removal of members of same superfamily | [40] | |
 |  | Similarity clustering with representative structure | In-House | |
Specialized databases | Pre-filtered databases are more reliable | Use of database | ProtInDB [91], Piccolo [92], Negatome [93], iPfam [94,95], 3did [96] | |
Chain Length | Ensure removal of fragments and peptides | PDB filtering; UniPROT [97] annotation and mapping to PDB [98] | In-House | |
Only X-ray Crystal Structures | NMR are harder to validate, less precise, and more difficult to process [99-101] | PDB filtering | In-House | |
No antibody-antigen interactions | Ag-Ab complexes bind on different principles than PPIs [9,16,102] | Â | Â |