Skip to main content

Table 1 Filters used to curate protedatasets for use in training PPIS predictors, including the reasoning behind their use, the methods and specific software used to implement them, as well as references detailing the predictors making use thereof

From: Algorithmic approaches to protein-protein interaction site prediction

Filter

Reason

Method

Software

Used By

Exclusion of non-biological complexes

Avoid training on complexes not present in vivo

Check against other database

PQS [81], PISA [82,83]

[7,12,32,42]

Resolution

Low resolution structures may be inaccurate

PDB filtering

In-House

[7,20,29,37]

Canonical AAs

Most programs cannot handle non-canonical amino acids

  

[7,12,37,42]

Redundancy

Reduce overfitting

Sequence similarity cutoff

BLAST [84], PISCES [85,86], CD-HIT [87,88]

[7,12,20,29,37,38, 42]

  

Removal of members of same superfamily

SCOP [89,90]

[40]

  

Similarity clustering with representative structure

In-House

[15,31,32]

Specialized databases

Pre-filtered databases are more reliable

Use of database

ProtInDB [91], Piccolo [92], Negatome [93], iPfam [94,95], 3did [96]

[12,38]

Chain Length

Ensure removal of fragments and peptides

PDB filtering; UniPROT [97] annotation and mapping to PDB [98]

In-House

[7,12,20,29,37,38, 42]

Only X-ray Crystal Structures

NMR are harder to validate, less precise, and more difficult to process [99-101]

PDB filtering

In-House

[37,42]

No antibody-antigen interactions

Ag-Ab complexes bind on different principles than PPIs [9,16,102]

  

[37,40,103]