Working with big, dense transition matrices poses two connected problems: on the one hand, the storage size of the matrix may considerably slow down calculations or be altogether too big for the computer, on the other hand, the relevant information about the model may be difficult to extract from the great amount of data contained in the matrix. Visualization techniques for the interpretation of matrix data can, however, also help to find matrix properties which allow reducing the storage size, such as partial symmetry or the occurrence of many nearzero transition probabilities. We therefore start by describing the visualization techniques in the first part, and then move on to storage size reduction by sparse approximation in the second part of the results.
Visualization
An intuitive first step in analyzing the transient behavior of a Markov chain model is a diagnostic visualization of the transition matrix. By summarizing results in an accessible way, the resulting diagram may ideally also provide a basis for direct biological interpretation. With one exception (landscape plot), all the following visualization methods are available using the functions histogrid, histo3d and networkplot (with its support function percolation) in the mamoth module; an example for the runtime of each method is given in Additional file 1.
Heat map
A heat map or histogram of the transition matrix, where the transition probabilities p are symbolized by color/ shade or height, is perhaps the easiest way to visualize it (Fig. 1). The resolution may be enhanced by an appropriate transformation of the range of values for p, for example by using a negative logarithm (\([0;1] \rightarrow [0; \infty ]\)) or a logit transformation (\([0;1] \rightarrow [\infty ; \infty ]\)).
For big matrices, heat maps can be costly to produce (memory size) and are often still not very clear, due to the large number of cases. However, they may help to recognize basic patterns (symmetries, groups of similar/more strongly connected states etc.) of potential value for finding more adapted visualizations/numerical methods.

In our example, the heat map shows that many of the transition probabilities in the matrix are, though not equal, very close to zero. After reordering the states, the partial symmetry of the matrix also becomes visible.
Network display
The duality between matrices and graphs (e.g. [7, 25]) provides an alternative for the visualization and mathematical analysis of either structure. In a graph \(\mathcal {G(V, E)}\), the states of a Markov chain are thus represented as nodes/vertices \(\mathcal {V}\) and the transitions as (weighted and directed) edges \(\mathcal {E}\) connecting them, which is especially useful for sparse transition matrices.
For big, dense matrices, the number of edges in the resulting complete multidigraph (of edge multiplicity two) equals the number of entries in the transition matrix and is thus too big for easy interpretation. Concepts from network theory can be used to selectively display edges and summarize information about each state of the model system on the nodes. This leads to a variety of very clear synthetic representations constructed with different parameters and taking into account different time scales: from one generation (based on M) across t generations (based on \(M^{t}\)) up to the longtime equilibrium (dominant eigenvector of M, v).
To facilitate biological interpretation, we arranged the nodes of the network according to biological “meta data”. For our model example where states represent distributions of individuals on three genotypes (aa, aA, AA) under a constant population size (compositional data), we placed the nodes in a de Finetti diagram ([17], see Fig. 2), a specialized ternary plot for population genetics.
The following visualization techniques are based on selectively displaying the network’s edges:
Most probable neighbor This is the analog to a nearest neighbor if distances (edge weights) represent probabilities. For each state i, there are one or several states j which have the highest probability to be the destination of a transition in the next time step; tracing these connections gives the expectation for the onestep transient behavior of the model.

In our example, the most likely state for the next generation (Fig. 2) is always on or very near to the HardyWeinberg Equilibrium, which is represented by the continuous black curve going through (1/4; 1/2; 1/4) in the diagram in Fig. 2a.
Most probable path This is the counterpart of a shortest path if distances (edge weights) represent probabilities. For each noncommutative pair of states i and j, there exists at least one series of consecutive edges connecting i to j along which the product of the edge weights is maximal. It can be determined by using an “ordinary” shortest path algorithm (e.g. [26, 27]) on a negative log transform of the transition matrix. The most probable path is the most likely trajectory of the model system to get from one state to another.
Flow threshold Using the smallest probability along the most likely path between two nodes i and j as a threshold, very rare transitions can be excluded.

In our example (Additional file 3), horizontal transitions along the base of the triangle, where no heterozygotes are produced despite of two homozygous genotypes being present in the population, would be excluded.
The following visualization techniques are based on changing the appearance of the network’s nodes:
Degree For each node in a graph representing a dense matrix, the number of incoming (indegree) and outgoing (outdegree) edges is normally (approximately) equal to the number of nodes (matrix rows/columns). This method should therefore be used in connection with selective edge plotting and interpreted according to context.

In our example (Fig. 2), the nodes with the highest indegree are nearest neighbors to the largest number of nodes; if all states were equally likely at the current generation, those next to (0.25; 0.5; 0.25) on the HardyWeinberg curve would be the most likely in the next generation.
Betweennesscentrality Based on the same concept as the most probable path, this can be redefined as the number of most probable paths passing through each node when connections between each pair of nodes are considered. It can be derived in a similar way as the most probable path, by applying a standard algorithm developed for additive distances to a negative log transform of the multiplicative probabilities in M. Nodes with a high betweennesscentrality represent frequent transient states.
Probabilities For each state i in the Markov chain model, several probabilities can be calculated—and displayed on the nodes—to describe both the transient and limiting behavior:

\(p_{\text {stay}}\)
:

probability to stay for one time step
\(p_{\text {stay}}(i) = p_{i,i}\), the probabilities on the matrix diagonal; for each state i this is the probability that the system remains at state i for the next time step (“stickiness”). This probability allows the easy detection of (near)absorptive states.
In population genetics, the fixation states \(\lbrace (N;0;0),\)
\((0;0;N)\rbrace\) are typical examples (Fig. 2).

\(p_{\text {out}}\)
:

probability to leave in one time step
\(p_{\text {out}}(i) = 1p_{i,i},\) the column sums of the matrix without the diagonal; for each state i this is the probability that the system changes state at the next time step (“conductivity”). Being the opposite of \(p_{stay}\), this probability allows the detection of states which are rarely occupied for consecutive time steps. In our example, these are the states where the population consists of an approximately even mixture of both homozygotes (central basis of the triangle) or only of heterozygotes (top of the triangle; Additional file 3). In contrast, the row sums of a leftstochastic matrix may exceed one and are thus not probabilities. As a result of the Markov property, a probability to arrive always depends on the state at the previous time step, which results in a number of possible definitions.

p(ij) :

probability to arrive from state j in one time step
\(p(ij) = p_{j, i}, j \in S\), all probabilities in one column of the transition matrix; the probability distribution (mean, variance, skew according to arrangement of nodes) for transitions starting from one particular state. This allows the prediction of the most likely states for the next time step. In our example, the variance around the fixation states is much more limited than at the interior states of the triangle (Additional file 4).

\(p_{\text {in}}\)
:

probability to arrive in one time step
\(p_{\text {in}}(i) = 1/(S1) \cdot \sum _j p_{j, i}\) for \(i \ne j\), the row sums of the matrix divided by the number of other states; probability to arrive at state i if all previous states are equally likely. This shows states which are generally very likely destinations for onestep transitions. In our example, these are the states around the HardyWeinberg curve (additional file 3).

\(p_{\text {in}}^{\infty }\)
:

probability to arrive in an infinite run
\(p_{\text {in}}^{\infty }(i) = \sum _j p_{j, i} \cdot v_{j}\) for \(i \ne j\), the sum over the elementwise product of eigenvector and matrix row, without the diagonal; probabilities to arrive at state i if the likelihood of the previous states is distributed according to the limiting distribution. This shows the states which are the most frequent destination of transitions in an infinite run of the model. In our example, these are the two states next to the fixation states where there is exactly one “foreign” allele (Additional file 4).

\(p^{\infty }\)
:

limiting distribution/eigenvectorcentrality
\(p^{\infty }(i) = v_{i}\), the eigenvector; probability to find the system at state i after infinitely many time steps, or proportion of time spent in each state averaged over infinitely many time steps (limiting distribution). This is the prediction for the most likely states independently of the start state. As is well known for our example, these are the fixation states (Additional file 3).
Expected time to first passage To calculate the expected time to arrive at a certain (group of) states from any other, the “target” states are considered absorptive (first passage time, [7]). Based on the submatrix \(M'\) including only the transition probabilities between nontarget states, the times \(t_{\text {target}}\) are
$$\begin{aligned} t_{\text {target}} = \mathbf {1}(IM')^{1} \end{aligned}$$
where \(\mathbf {1}\) is a row vector of ones matching the dimension of \(M'\) and I is the corresponding unit matrix. The first passage times of the target states are zero.
Landscape plot
Combining length and direction of the transitions in the most probable neighbor plot (Fig. 2) gives a three dimensional “landscape” illustrating the most probable dynamics of the Markov chain, similar to the “gravity well” plots known from physics. The expected changes in the genotype frequencies are thus represented in a more intuitive fashion, by imagining the population as a small ball rolling on a “landscape” from “hills” to “valleys”. Elevations h are derived from the equality of potential and kinetic energy, which resolves to
$$\begin{aligned} h=d^2 \cdot 0.05 \end{aligned}$$
for a single time step, approximating gravitational acceleration by 10. For each model state/node, the distances d are given by the changes in genotype frequencies when moving to the most probable neighbor
$$\begin{aligned} d = \sqrt{(\Delta aa)^{2}+(\Delta aA)^{2}+(\Delta AA)^{2}}. \end{aligned}$$
The “landscape” is subsequently drawn as a triangular grid, using the elevation at each state/node as support. To improve readability, h can be rescaled by a constant factor and the landscape colored according to the relative elevation (taking the center of each triangle as reference). The resulting “landscape” shows only the (deterministic) expected dynamics of the Markov chain one could imagine the accompanying stochastic effects as an “earthquake”.
Note: because of its dependence on a function or matrix specifying the distances between states, and on the triangular gridlike structure of the state space, this method is not included in the mamoth source code.
Approximation
One major drawback of staterich Markov chain models is that the transition matrix in its full form takes up a lot of memory (Table 1). Beside switching to one of the alternative model types mentioned in the introduction (diffusion approximation, coalescence process), there are multiple computational approaches to addressing this issue while keeping the original state and time discrete framework, including:

External memory: the whole matrix is stored on a (sufficiently large) hard drive, only parts are loaded into active storage when needed (analogous to [28])

Iterative/selective matrix creation: the whole matrix is never stored, only parts are created when needed (e.g. in combination with algorithms such as [29])

Lumping states based on model properties: if a group of states has the same (sum of) transition probabilities leading into it and out of it to any other (group of) states and the same analytical meaning (e.g. same value of \(F_{IS}\)) they can be combined into one ([30, 31]); other algorithms of state aggregation, such as [32], lead to an approximation of the original matrix

Sparse approximation: turning a dense matrix into a sparse matrix by approximating very small matrix elements to zero (e.g. as in [33, 34])
Which of the first two options is more appropriate depends both on the available hardware and the nature of the task: if the whole matrix is needed repeatedly, storing it will save the time to recalculate despite increased memory access times, but if calculating the matrix elements is fast, the matrix is needed only once or only some parts of the matrix (e.g. the most probable neighbor of each state) are needed, storing the matrix as a whole would be an unnecessary effort.
Because of the symmetry between the two allele frequencies in our model example, almost half of all states could be pairwise lumped, thus reducing matrix size to a little over a quarter of the original. The exception are the states on the symmetry axis of the de Finetti diagram (compare Figs. 1, 2), which do not have a “lumping partner”. Symmetry with respect to the allele frequencies is often found in population genetics models [3]. However, because of this dependency on model structure a size reduction algorithm based on lumping would not be applicable to nonsymmetric extensions of the original model, e.g. with an asymmetric mutation rate or directional selection. Allele frequencies would have to be analyzed jointly, as the new states retain only the ratio of both; once lumped, “unpacking” the states becomes difficult.
The high number of very small values in the Markov chain transition matrix (Fig. 1) of our model example suggests that sparse approximation would be very effective. Moreover, as each column of the matrix corresponds to a probability distribution (constant sum of one) which becomes less uniform as the number of states/population size increases (the expected convergence to a multinormal distribution with variance proportional to 1/N is the underlying principle of the wellknown diffusion approximation), the proportion of very small transition probabilities is likely to augment as the matrix size increases. While sparse approximation is independent of modelspecific properties such as symmetry and does not change the states as such, it has the disadvantage of changing the actual content of the transition matrix, potentially leading to the loss of relevant properties such as leftstochasticity or irreducibility.
The sparse approximation algorithm we propose ensures that the resulting sparse matrix still has all the properties relevant to its function in the Markov chain model. Additionally, it can be executed iteratively so that the complete dense matrix need not be stored. The algorithm iterates over all columns of the transition matrix M and excludes (almost) all values which, in total, contribute less than a threshold value \(s \in [0,1]\) to the column sum:
for all columns \(C^i = M_{1\ldots S,i}\) with \(i \in [1, S]\):

1.
Create a permutation R of the row indices so that the corresponding entries are ranked according to size: \(R \leftarrow \text {ordinalrank}(j\, \, 1 \ge C^i_{j} \ge 0)\)

2.
Find the minimal rank (index of R) so the corresponding entries sum at least to the threshold value s: \(r \leftarrow \text {min}(k)\) for \(\sum _{R_1}^{R_k} C^i_{R_k} \ge s\)

3.
Keep at least the two biggest values per column: \(r \leftarrow \text {max}(2,r)\)

4.
Keep all values of equal rank: while \(C^i_{R_{r+1}} = C^i_{R_{r}}\) : \(r \leftarrow r+1\)

5.
Round all values with ranks greater then r to zero, but keep those on the main diagonal and the first lower and first upper diagonals: \(C^i_{R_k} \leftarrow 0\) for all k with \(k > r \wedge R_k \notin \{(i1, i, i+1) \, \text {mod} \, S\}\)

6.
Rescale the column to sum to 1: \(C^i \leftarrow C^i/\text {sum}(C^i)\).
The first two steps, together with the rounding in step five, form the core of the algorithm (compare Fig. 3), steps three and four prevent distortions and steps five and six ensure the continued validity of essential Markov chain transition matrix properties: Irreducibility is assured by keeping at least one outgoing and one incoming transition probability per state in such a way that all states remain connected (step five, first lower and first upper diagonal), aperiodicity by keeping all probabilities to stay at the same state (step five, main diagonal), and the rescaling of each column ensures leftstochasticity of the matrix (step six). In contrast, the property that onestep transitions are possible between all states is deliberately given up. The sparse approximation algorithm is available as the appromatrix function in the mamoth module.
Both the efficiency, i.e. the density or memory use of the resulting matrix, and the bias vary according to the value of s and the distribution of values in the original matrix. If s is low or the probability distribution in the column is far from uniform, more values will be discarded (compare Fig. 3). An appropriate value for s has to be determined heuristically by testing successively increasing values, up to the point where the bias due to the approximation no longer interferes with the interpretability of the model results. The sum of the differences between the entries of the approximate and original matrices has a theoretical upper limit of \((1s) \cdot S\), but the effect of this perturbation on the model output may be more complex.
In our model example, we analysed the effect of sparse approximation on the equilibrium \(F_{IS}\) distribution derived from the dominant eigenvector of the transition matrix. The dominant eigenvector of either a sparse or dense matrix can be calculated with the eigenone function in mamoth, while a comparison between two vectors by a GTest (correctly omitting infinity values from the test statistic) is implemented in the testvector function. A direct comparison between the “original” and “sparse approximate” equilibrium \(F_{IS}\) distributions (Fig. 4) shows a very close fit which does not obscure the biologically relevant changes due to different rates of asexual reproduction. To test if the method gives similarly good results over a wider range of parameters (population size, mutation rate, rate of asexuality and approximation threshold), we performed a Global Sensitivity Analysis (GSA) [37, 38] using different divergence statistics to compare the limiting distribution of \(F_{IS}\) derived from original and sparse approximate matrix [35, 39] and the density of the sparse matrix.
The results of the GSA show that all four model parameters may generally have nonlinear/interacting effects on the quality of the approximation, but in the mean these effects are not very strong (Fig. 5). Memory reduction is highly efficient as the mean density of the sparse matrices was only \(\approx 0.11\). Individual densities ranged from \(\approx 0.42\) (small matrix, high threshold) to \(\approx 0.03\) (big matrix, low threshold), varying most strongly with the population size, though all four parameters have a significant influence. On our reference system (Intel Core i73930K 3.2 GHz processor with 64 GB RAM), calculating the sparse approximation based on the original matrix took on average 1.7 s for \(N=50\) (14.6 s to construct the original), and 31.3 s for \(N=100\) (221.7 s to construct the original). Finding the dominant eigenvector of sparse approximate and original matrix took on average 0.1 s (sparse) versus 51.7 s (original) for \(N=50\) and 2.4 s (sparse) versus 7869.1 s (2 h, 11 min, 9.1 s, original) for \(N=100\), so that in both cases less than one percent of the original runtime was needed with the sparse approximate matrix.
The overall similarity of the original and approximate equilibrium \(F_{IS}\) distributions, measured with different divergence statistics (total distance, KullbackLeibler divergence, power divergence statistics [40]; Fig. 5), is very high: e.g. the mean for the total distance \(\sum \text {abs} (f_{orig}f_{approx})\) is \(\approx 0.06\). It is largely independent of the rate of asexual reproduction and depends most strongly on the approximation threshold and the mutation rate. In contrast, the maximal difference (KolmogorovSmirnov twosample test statistic) between classes of the original and approximate equilibrium \(F_{IS}\) distribution is hardly affected by the mutation rate, but rather by approximation threshold (high mean effect) and rate of asexual reproduction (strong nonlinearity/interaction). Though on average not significant, the KolmogorovSmirnov test gave pvalues below 0.05 in \(20~\%\) of the parameter sets sampled. Consequently, the same approximation threshold can be used to compare the overall shape of the distributions across the whole range of rates of asexual reproduction, but it may have to be adapted if mutation rate and population size differ strongly between the modeled scenarios. Care must be taken when individual classes within the distribution (e.g. longterm fixation probability) shall be compared as the probabilities derived from a sparse approximate matrix may then be significantly different from the original.
In conclusion, sparse approximation using our algorithm has the advantage of being easily applicable to all transition matrices independently of the properties of the underlying model, and is well suited to provide an overview of the equilibrium \(F_{IS}\) distribution under different rates of asexual reproduction in our model example. However, it needs an initial effort to verify the model results derived from the approximate matrix and to estimate their final bias. For finescale analyses, lumping states may provide an approximationfree alternative, but is not always possible as it depends on the model structure.