Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training

Table 1 Theoretical computational requirements

training one parameter at a time
type of training	algorithm	time	memory	reference
Viterbi	Viterbi	$O$ (T_maxLM)	$O$ (ML)	[17]
	Lam-Meyer	$O$ (T_maxLM)	$O$ (M)	this paper
Baum-Welch	Baum-Welch	$O$ (T_maxLM)	$O$ (ML)	[13]
	checkpointing	$O$ (T_maxLM log(L))	$O$ (M log(L))	[34]
	linear-memory	$O$ (T_maxLM)	$O$ (M)	[29]
stochastic EM	forward & back-tracing	$O$ (T_maxL(M + K))	$O$ (ML)	[32]
	Lam-Meyer	$O$ (T_maxLMK)	$O$ (MK + T_max)	this paper
training P of Q parameters at the same time with P ∈ {1, ..., Q} and Q/P ∈ ℕ
Viterbi	Viterbi	$O$ (T_maxLMQ/P)	$O$ (ML)	[17]
	Lam-Meyer	$O$ (T_maxLMQ/P)	$O$ (MP)	this paper
Baum-Welch	Baum-Welch	$O$ (T_maxLMQ/P)	$O$ (ML + P)	[13]
	checkpointing	$O$ (T_maxLMQ log(L/P))	$O$ (M log(L))	[34]
	linear-memory	$O$ (T_maxLM Q/P)	$O$ (M)	[29]
stochastic EM	forward & back-tracing	$O$ (T_maxL(M + K)Q/P )	$O$ (ML)	[32]
	Lam-Meyer	$O$ (T_maxLMKQ/P )	$O$ (MKP + T_max)	this paper

Overview of the theoretical time and memory requirements for Viterbi training, Baum-Welch training and stochastic EM training for an HMM with M states, a connectivity of T_maxand Q free parameters. K denotes the number of state paths sampled in each iteration for every training sequence for stochastic EM training. The time and memory requirements below are the requirements per iteration for a single training sequence of length L. It is up to the user to decide whether to train the Q free parameters of the model sequentially, i.e. one at a time, or in parallel in groups. The two tables below cover all possibilities.
In the general case we are dealing with a training set $X$ = {X¹, X², ... , X^N} of N sequences, where the length of training sequence Xⁱ is Lⁱ. If training involves the entire training set, i.e. all training sequences simultaneously, L in the formulae below needs to be replaced by $\sum_{i = 1}^{N} L_{i}$ for the memory requirements and by max_i{L_i} for the time requirements. If, on the other hand, training is done by considering by one training sequence at a time, L in the formulae below needs to be replaced by $\sum_{i = 1}^{N} L_{i}$ for the time requirements and by max_i{L_i} for the memory requirements.

ISSN: 1748-7188