First, the techniques Chaos Game Representation, CGR [1, 3], and its bidirectional generalization by Universal Sequence Map, USM [14], will be revisited and illustrated for a small nucleotide sequence. That original report is referenced for the detailed rationale regarding the critical advantage of the bidirectional implementation over the preceding unidirectional solution: all units of a common pattern between two sequences are observed to be equidistant regardless of the individual positions within the sequence. In addition, the USM procedure, more exactly its initialization, will be slightly adjusted to represent motifs in a fashion that is independent of the length of neighboring sequences. Secondly, the discrete density kernel proposed will be described and illustrated with same collections of promoter regions of *Bacillus subtilis* used in the motivating entropy study [15].

### Illustrating iterative map positioning

Universal Sequence Mapping is an iterative procedure that populates a unitary hypercube bijectively: each sequence corresponds to a position in the map, and each map position corresponds to a unique sequence. For nucleotide sequences the hypercube has log_{2}(4) = 2 dimensions, that is, it is a unit square. For that case, the original USM procedure in each direction is exactly equal to CGR. The same exercise for a sequence of aminoacids would produce a hypercube with 5 dimensions [14], which is the upper integer of log_{2}(20). The edges of the hypercube correspond to the units of the alphabet that compose the sequence and the position is found by moving half the distance between the previous position and the edge corresponding to the unit at the position in the sequence being considered. This procedure, which was formally detailed in a previous report [3], is illustrated in Figure 1 for the sequence ACTGCC. The full USM procedure implements two such mappings, one in the forward and the other in the reverse directions [14].

### Seeding the iterative USM function

The iterative USM procedure described graphically in the previous section and in Figure 1 is formally defined by Equation 1 for an arbitrary sequence of *N* units built from an alphabet with *M* possible symbols.

\{\begin{array}{l}{u}_{j}^{f(0)}={u}_{j}^{b(2)}\\ {u}_{j}^{f(i)}={u}_{j}^{f(i-1)}+\frac{1}{2}\left({U}_{j}^{(i)}-{u}_{j}^{f(i-1)}\right)=\frac{1}{2}{u}_{j}^{f(i-1)}+\frac{1}{2}{U}_{j}^{(i)}\\ {u}_{j}^{b(N+1)}={u}_{j}^{f(N-1)}\\ {u}_{j}^{b(i)}=\frac{1}{2}{u}_{j}^{b(i+1)}+\frac{1}{2}{U}_{j}^{(i)}\\ {U}_{j}^{(i)}\in \left\{0,1\right\}\\ i=\left\{1,2,\mathrm{...},N\right\}\\ j=\left\{1,2,\mathrm{...},D\right\}\end{array}Equation1

Each of the unique *M* units of the alphabet are represented by unique binary vector which, graphically, positions them as unique edges of a unitary hypercube with *D* = *log2(M)* dimensions [14]. The reason why the CGR/USM procedure is revisited here is to highlight the novel seeding procedure, by {u}_{j}^{b(2)} for the forward iteration and by {u}_{j}^{f(N-1)} for the backward coordinate iteration procedure.

#### Why not seeding at 1/2

In the original CGR proposition [1] the mid coordinate, 1/2, is invariably used as the initial position. Because this position cannot be mapped back to a real sequence this at first appeared as a reasonable proposition even if not fundamentally superior to any of the other boundary positions such as 0 and 1. However, seeding all iterations equally causes an artifactual conservation of the beginning of the sequence which will bias sequence entropy calculations based on map coordinates [15], particularly for small sequences: the first iteration can only produce two coordinates, 1/4 or 3/4, the second iteration will produce one of 4 possibilities: 1/8, 3/8, 5/8 or 7/8, etc. This will cause some extent of artfactual high density at those positions.

#### Other approaches to seeding iterative maps

A possible solution to seed within the domain of possible sequences would be to start with a position randomly collected from a uniform distribution, as indeed used in the original USM paper [14]. However, that too will cause a bias, this time towards missing conservation of initial units in a sequence if that is the case. A negligible few false negatives may be an acceptable outcome for pattern recognition and would have no effect elsewhere in the sequence. However, it falls short of what is required for a kernel generating truly scale independent density distribution of patterns.

#### The solution proposed here

The solution proposed by Equation 1 is to seed the iterative mapping with the reverse coordinates: to seed the first forward coordinate with the next to last backward coordinate for the same dimension and vice versa. Note the first forward coordinate, {u}_{j}^{f(1,\mathrm{...})}, and the last backward coordinate, {u}_{j}^{b(\mathrm{...},1)}, to be iterated are both the first unit of the sequence, e.g. *i* = 1. Similarly, the last forward coordinate and the first backward coordinate are assigned to the last unit of the sequence, *i* = *N*. Therefore, the new seeding solution can be interpreted as considering that each sequence is preceded and succeeded by its mirror images for the effect of studying local properties. If the sequence is long enough that the numerical resolution of *u*^{f(N)}is insensitive to the seed value, then the seed value can be determined in practice by simply iterating the last few tens of units of the reverse sequence starting with an arbitrary value. For very short sequences however, Equation 1 has to go through more than one circular iteration, starting from an arbitrary seed value, until the coordinates values converge. This solution causes each unique sequence to have a unique scale independent distribution of patterns where its statistical characteristics can be studied with no need to rebuild the original sequence. This also implies that the coordinates of iterative maps of sequences, as defined by Equation 1, are, fundamentally, steady state solutions. A simple, dramatic, example where this is of consequence is in the positioning of the sequence "A", or "AA" in Figure 1. In the conventional CGR procedure they'd be positioned with coordinates (1/4, 1/4) and (1/8, 1/8) which would place them next to very different, much more heterogeneous, sequences. On the contrary, the solution by seeding as described in Equation 1 will correctly produce the coordinate (0,0). Similarly, a sequence with regular alternation of two units, say "ABABABABAB" should produce well defined density peaks at only two positions, 1/3 and 2/3, which is in fact the steady state solution produced by Equation 1. On the contrary, both CGR and the random seeded USM would produce two trails of values converging to those solutions but not quite reaching them. The fully self-referenced nature of the modified USM construction is also reflected in the observation that the steady state solutions invariably produce {u}_{j}^{f(1)}={u}_{j}^{b(1)} and {u}_{j}^{f(N)}={u}_{j}^{b(N)}. However, exploring the bidirectional density distributions is beyond the scope of this report.

### Construction of density kernel

The shape of the density kernel should match the fractal nature of the iterative USM function itself. The solution reported here will first be described for a USM coordinate, and illustrated for an arbitrary coordinate of the map, say the horizontal dimension of the forward map in Figure 1. The value, *K*, of the proposed Kernel function (Equation 2) in map coordinate position *u*, has two user-defined parameters, memory length, *L*, and smoothing, *S*, which is the ratio between the areas assigned to two consecutive Markov orders (e.g. *S* = *2* implies the kernel density area assigned to order *i* ≤ *L-1* is twice the area assigned to order *i-1*).

K(u)={\displaystyle \sum _{j=1}^{N}{\displaystyle \sum _{i=1}^{L}\{\begin{array}{l}H(i,D,L,S)\leftarrow LB(i,{x}_{j})<u<UB(i,{x}_{j})\\ 0\leftarrow otherwise\end{array}}}Equation2

The parameter *D* is the number of dimensions of the unitary USM hypercubes (e.g. *d* = *2* for the example in Figure 1) and the expression in Equation simply states that the kernel density value in position *u* is obtained by adding the values of *H*, for each of the orders up to *L-1*, which makes it a scale dependent height function, for the number of elements of the kernel training dataset, *x*, that are positioned within a scale dependent neighborhood confined by lower and upper boundaries, *LB* and *UB*, respectively. The choice of memory length, *L*, of the kernel, sets the resolution of the density function. This is graphically reflected by the finer grain of the density distribution for higher values of *L* in Figure 2 and Figure 3.

As will be shown next, the kernel volume defined by this surface is equal to the number of points (sequence units), *N*, of the kernel-training dataset *x*. This result, strictly considered, disqualifies *K* as a kernel density function as kernel density volumes are unitary by definition. There are a number of reasons why having a volume that is the number of sequence units is desirable, particularly when sequences of different lengths are being compared. A compliant alternative definition of *K* is in any case obtained by dividing the expression in Equation , by the total length of the training sequences, *N*. This alternative will not be explicitly explored here because the scale alteration is so straight forward that it can easily be applied to any of the results reported here. The 2D density plots are offered without a scale in the z-axis to highlight the inconsequence of the correction. On the other hand, when multiple sequences are plotted together, as in Figure 4, the effect is that that the same motif in two sequences is represented with the same density height, Equation, even if the two sequences have very different lengths.

The kernel density definition in Equation is completed by two more expressions, Equation and Equation, where the height function and its boundaries are detailed. The kernel density height function, Equation, establishes the step height added at each memory length smaller or equal to the value of L. It is useful to recall that the memory length is one unit smaller than the Markovian order, e.g. for nucleotide sequences, memory length one corresponds to mono nucleotide frequencies, memory length two corresponds to di-nucleotide frequencies, which populate a first order Markov transition table, and so on.

H(i,D,L,S)=\frac{{({2}^{D}/S)}^{i}}{{\displaystyle \sum _{r=0}^{L}{S}^{-L}}}Equation3

The boundary values set by the functions LB and UB, Equation, define the neighborhood of a training sequence unit, that is, neighborhood to its USM position, *x*, which will have the corresponding value of *H*, Equation , added to the kernel density height, as detailed in Equation .

\begin{array}{l}LB(i,x)=\frac{floor(x\cdot {2}^{i})}{{2}^{i}}\\ UB(i,x)=\frac{floor(x\cdot {2}^{i})+1}{{2}^{i}}Equation4\end{array}

Before illustrating the calculation of the kernel density for multi-dimensional USM hypercube it is useful to illustrate the procedure for the one-dimensional example of a binary sequence such as 'ABABAAA'. The corresponding USM forward coordinates would be [0.3138 0.6569 0.3284 0.6642 0.3321 0.1661 0.0830] and the corresponding kernel density, Equation, for all positions in the one-dimensional USM map are shown in Figure 2 for different values of memory length, *L*, and smoothing, *S*.

Figure 2 illustrates how the choice of parameters will set both the resolution and detail of the pattern representation. If smoothing is set to +∞ then the kernel density will be distributed between the different fractions exactly as it would in a Markov transition matrix with the same memory length. This becomes clearer when a two dimension example is used such as the more familiar representation of nucleotide sequences. To illustrate this procedure, Equation was applied to the forward map of a small nucleotide sequence represented in Figure 1, which results in the density distribution represented in Figure 3.