Algorithmic approaches to protein-protein interaction site prediction

Table 1 Filters used to curate protedatasets for use in training PPIS predictors, including the reasoning behind their use, the methods and specific software used to implement them, as well as references detailing the predictors making use thereof

Filter	Reason	Method	Software	Used By
Exclusion of non-biological complexes	Avoid training on complexes not present in vivo	Check against other database	PQS [81], PISA [82,83]	[7,12,32,42]
Resolution	Low resolution structures may be inaccurate	PDB filtering	In-House	[7,20,29,37]
Canonical AAs	Most programs cannot handle non-canonical amino acids			[7,12,37,42]
Redundancy	Reduce overfitting	Sequence similarity cutoff	BLAST [84], PISCES [85,86], CD-HIT [87,88]	[7,12,20,29,37,38, 42]
		Removal of members of same superfamily	SCOP [89,90]	[40]
		Similarity clustering with representative structure	In-House	[15,31,32]
Specialized databases	Pre-filtered databases are more reliable	Use of database	ProtInDB [91], Piccolo [92], Negatome [93], iPfam [94,95], 3did [96]	[12,38]
Chain Length	Ensure removal of fragments and peptides	PDB filtering; UniPROT [97] annotation and mapping to PDB [98]	In-House	[7,12,20,29,37,38, 42]
Only X-ray Crystal Structures	NMR are harder to validate, less precise, and more difficult to process [99-101]	PDB filtering	In-House	[37,42]
No antibody-antigen interactions	Ag-Ab complexes bind on different principles than PPIs [9,16,102]			[37,40,103]

ISSN: 1748-7188