Constructing phylogenetic networks via cherry picking and machine learning

Table 3 Trained random forest models on different datasets for different combinations of \(\max L\) (maximum number of leaves per network) and M (number of networks)

\(\max L\)	M	Accuracy	Num. data	Training (min)	Data gen. (hour/core)
(a) Normal
20	5	1.0	840	00:00	00:00:12
	10	0.994	1804	00:00	00:00:22
	100	0.998	17,388	00:03	00:04:19
	500	0.994	73,168	00:16	00:15:18
	1000	0.993	151,308	00:42	00:29:49
50	5	0.994	3580	00:00	00:01:21
	10	0.997	7860	00:01	00:02:22
	100	0.996	53,988	00:11	00:18:07
	500	0.997	268,552	01:04	01:31:18
	1000	0.998	535,624	04:01	02:56:21
100	5	1.0	4944	00:00	00:01:13
	10	0.999	12,444	00:01	00:04:05
	100	0.999	128,824	00:25	00:41:54
	500	0.999	676,768	04:21	04:15:49
	1000	0.999	1,362,220	12:10	08:08:58

\(\max L\)	M	Accuracy	Num. data	Training (min)	Data gen. (hour/core)
(b) LGT
20	5	0.974	768	00:01	00:00:19
	10	0.994	1548	00:02	00:00:41
	100	0.976	12,244	00:09	00:04:20
	500	0.975	58,900	00:24	00:19:13
	1000	0.975	118,104	00:27	00:35:38
50	5	0.997	2952	00:01	00:00:43
	10	0.995	3796	00:03	00:01:01
	100	0.995	44,116	00:23	00:14:01
	500	0.994	219,472	01:39	01:06:45
	1000	0.994	421,204	02:45	02:10:45
100	5	0.996	5080	00:06	00:01:23
	10	0.996	7540	00:05	00:01:58
	100	0.998	114,900	00:31	00:34:25
	500	0.998	605,652	04:44	02:54:15
	1000	0.998	1,175,628	10:23	05:31:13

Each row in the table represents one model. For each model, the testing accuracy is given under “Accuracy”, and the total number of data points retrieved from all M networks is given under “Num. data”. Each dataset is split for training and testing (\(90-10\%\)). The training duration for the random forest is given in column “Training” and the time needed to generate the training data is given in column “Data gen.”, in hours per core (we used 16 cores in total)

ISSN: 1748-7188