Skip to main content

Metric multidimensional scaling for large single-cell datasets using neural networks


Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.


Single-cell RNA sequencing (scRNA-seq) experiments provide quantitative measurements for thousands of genes across tens to hundreds of thousands or even millions of cells. The high-dimensionality as well as the sheer size of scRNA-seq data sets pose particular challenges for downstream analysis methods such as clustering and trajectory inference methods. An essential step in single-cell data processing is the reduction of data dimensionality to remove noise in gene expression measurements [1]. One of the most popular methods for dimensionality reduction of single-cell data is principle component analysis (PCA). PCA aims to maximize the variance in the reduced space and can be computed efficiently by a singular value decomposition. The existence of efficient implementations [2] has contributed to its routine application to large single-cell data sets. PCA is used, for example, in state-of-the-art clustering methods Seurat [3], SC3 [4], and CIDR [5], and cell lineage inference methods such as TSCAN [6] and Waterfall [7].

Metric Multidimensional Scaling (metric MDS), on the other hand, aims to find an embedding that preserves pairwise distances between points, which can improve the accuracy of various types of downstream analyses of single-cell data compared to, e.g., PCA. In a comprehensive comparison of 18 dimensionality reduction methods on 30 scRNA-seq datasets, MDS often outperformed other methods or achieved comparable results [1]. The evaluation in Sun et al. [1] included approaches developed specifically for scRNA-seq, such as deep-learning based methods scScope [8] and DCA [9]. MDS yielded accurate and highly stable clusterings for both low and high number of embedding dimensions, and performed well when inferring lineages using Monocle3 [10].

The high computational cost of metric MDS has hindered its wide application to single-cell data sets. In PHATE [11], for example, metric MDS is paired with a sampling-based approach to cope with its computational complexity. In the experiments in Sun et al. [1], however, most dimensionality reduction methods, including MDS, showed a performance loss when combined with a sub-sampling procedure.

Surprisingly, no algorithm is known that can solve metric MDS efficiently with more than a few thousand cells. Here, we provide the first such algorithm. Our contributions are two-fold: First, we provide a simple two-layer neural network approach that can solve the metric MDS problem for (single-cell) data sets with up to a few million data points (cells). This is orders of magnitude larger than current state-of-the-art methods can handle. Second, our approach for the first time learns a non-linear mapping of the high-dimensional points into the low-dimensional space, which can be used to place previously unseen cells in the same low-dimensional embedding.


MDS comes in three different versions: (1) classical MDS, (2) metric MDS, and (3) non-metric MDS, aka ordinal scaling. While we focus here on metric MDS, it is important to understand all three methods and their differences.

Suppose we are given n data points \(x_i\in \textbf{R}^m\) that we want to embed into \(\textbf{R}^k\) where \(k<m\). Let \(y_i\) be the corresponding point of \(x_i\) in the low-dimensional space \(\textbf{R}^k\).

Classical MDS Classical MDS tries to map these data points into \(\textbf{R}^k\) while trying to preserve the pairwise inner products \(\langle x_i, x_j\rangle\). Specifically, it solves the optimization problem

$$\min _{{y_{1} ,y_{2} , \ldots ,y_{n} \in {\mathbf{R}}^{k} }} \sum\limits_{{i,j}} {\left( {\left\langle {x_{i} ,x_{j} } \right\rangle - \left\langle {y_{i} ,y_{j} } \right\rangle } \right)} ^{2}$$

Metric MDS Metric MDS tries to preserve the pairwise distances between the points, i.e., it solves the optimization problem

$$\begin{array}{*{20}c} {\min _{{y_{1} ,y_{2} , \ldots ,y_{n} \in {\mathbf{R}}^{k} }} } & {\sum\limits_{{i,j}} {w_{{ij}} } \left( {\left\| {x_{i} - x_{j} } \right\| - \left\| {y_{i} - y_{j} } \right\|} \right)^{2} ,} \\ \end{array}$$

where \(\Vert .\Vert\) denotes the Euclidean norm and \(w_{ij}\ge 0\) are some given weights.

Non-metric MDS Non-metric MDS embeds the data into low-dimensional Euclidean space by preserving only the relative distance ordering, i.e, it solves the optimization problem

$$\begin{array}{*{20}c} {\min _{{y_{1} ,y_{2} , \ldots ,y_{n} \in {\mathbf{R}}^{k} ,f}} } & {\sum\limits_{{i,j}} {w_{{ij}} } (f\left( {\left\| {x_{i} - x_{j} } \right\| - \left\| {y_{i} - y_{j} } \right\|} \right)^{2} ,} \\ \end{array}$$

where f is a monotonically increasing function. Note, only the ordering of the pairwise distances is important here which should be preserved and not the actual distances.

All three formulations differ only in the objective functions that are minimized. While this seems like a minor difference, it has a substantial impact on its computability. It can be shown that classical MDS is equivalent to PCA when the input points are given explicitly and hence can be solved efficiently by a singular value decomposition. Thus, it can be solved efficiently for large data sets, having millions of cells. The computational complexity of metric MDS is fundamentally different. It has been shown that metric MDS is NP-hard when the target dimension is one [12] and it is believed that it is NP-hard in general. Hence, no efficient algorithm is likely to exist for solving this problem optimally. However, even finding a local minima is very time consuming. The most popular algorithm for solving this problem is the SMACOF algorithm [13]. However, its running time grows quadratically in the number of data points n. It can only be applied to solve this problem with up to a few thousand data points.

There is one more important difference between classical MDS and metric MDS; it can be shown that in classical MDS the optimal solution corresponds to a linear mapping of the high-dimensional space \(\textbf{R}^n\) into the low-dimensional Euclidean space \(\textbf{R}^k\). This is not true for metric MDS. The optimal solution for metric MDS does not correspond to a linear mapping. Asking for a linear mapping leads to suboptimal solutions. Having a mapping from the input space \(\textbf{R}^m\) into the output space \(\textbf{R}^k\) is important for new, unseen data points. For instance, one can compute the mapping on a training set and apply the same mapping to the test set as it is common practice for other low-dimensional embedding and preprocessing methods like PCA. The SMACOF algorithm does not provide such a mapping.

Often, the input points are not given explicitly, but instead, their pairwise distances or pairwise scalar products are given. In this case, such a mapping cannot be provided.

Related work

Multidimensional scaling has a long history [14]. Classical MDS was first studied by Torgerson [15] and independently by Gower [16]. They used an eigenvector decomposition to solve the problem. Later, Kruskal [17] defined the problem of metric MDS as an optimization problem and used a steepest descent approach for solving it. Leeuw [13] improved the running time of Kruskal’s algorithm by using an iterative majorization approach. This algorithm is referred to as SMACOF algorithm. Surprisingly, it still represents the state-of-the-art for solving the metric MDS problem. The non-metric MDS problem was introduced by Kruskal [17] and Guttman [18]. It is mainly used in the psychometric area.

The research on MDS can be split up into two branches; improving statistical performance and improving speed. Here, we focus on the latter one. The books by Cox and Cox [19], and Borg and Groenen [20] provide an in-depth coverage on the statistical properties and applications of MDS. See also the book by Burges [21] for a comparison of MDS to other embedding techniques.

When the input points are given explicitly, then classical MDS can be shown to be equivalent to PCA. Hence, computational speed is no issue in this case. When the input is instead a distance matrix, Qu and Cai [22] and Yang et al. [23] used a divide-and-conquer approach for scaling up classical MDS to larger data sets. However, their approach only works for the classical MDS problem.

A technique called landmark MDS was introduced by Silva and Tenenbaum [24]. The idea behind this approach is to select only a subset of the input points, called landmarks, compute the distance matrix between these points, and use only these points for the embedding. Hence, it can scale to larger data sets at the expense of ignoring the majority of the input points in the embedding process. This approach is also used in Isomap by Tenenbaum et al. [25]. Williams [26] provided a similarity between kernel PCA and metric MDS. However, they are both similar but not equivalent due to their computational complexity (P vs. NP).

As already stated, classical MDS can be solved to optimality via a linear mapping. This is not true for the metric MDS. Sammon [27] set the weights \(w_{ij}= 1/\Vert x_i - x_j\Vert\) in the metric MDS problem and used a steepest descent algorithm for embedding the data. However, unlike the title suggests, it does not provide a mapping from the input space \(\textbf{R}^m\) to the target space \(\textbf{R}^k\).

There have also been some early attempts on solving the metric MDS problem using neural networks. This includes works by Mao and Jain [28] and Ridder and Duin [29]. However, their approaches scaled to very small data sets with up to a few hundred data points only. While the neural network approach later was used for other non-linear embedding and dimensionality reduction methods and gave rise to autoencoders, see, e.g., the seminal work by Hinto and Salakhutdinov [30], they have never been used successfully for solving reasonably-sized metric MDS problems. Wezel et al. [31] used a neural network approach for classical MDS and considered non-metric MDS in Wezel and Kosters [32]. However, they did not consider the metric MDS problem.

Multidimensional scaling and linear mappings

In this section we will introduce projected metric MDS as an intermediate version of MDS that combines the optimization objective of metric MDS with the linear mapping obtained from classical MDS.

When applying MDS to a given input set, it is often important to also obtain the corresponding mapping, i.e., the mapping that maps the whole input space \(\textbf{R}^m\) to the output space \(\textbf{R}^k\). This is for instance necessary when MDS is used as a preprocessing step. One usually computes the embedding on the given training data set and applies the same mapping to the test data set, i.e., to new, unseen data points without recomputing the whole problem again. This is common practice in preprocessing steps like PCA and also necessary in order to not induce a shift in the distribution of the test data. Hence, it is important to obtain such a mapping along with the actual embedding.

Let \(X\in \textbf{R}^{n\times m}\) be the matrix where the ith row is the ith input point \(x_i\in \textbf{R}^m\). We define the distance matrix \(D_X\in \textbf{R}^{n\times n}\) as \((D_X)_{ij} = \Vert x_i-x_j\Vert\). Let \(D_X^2\) be the elementwise squared distance matrix, i.e., \((D_X^2)_{ij} = \Vert x_i-x_j\Vert ^2\). Let Y and \(D_Y\) be defined accordingly for the output points \(y_i\).

Classical MDS minimizes the error on the pairwise scalar products, i.e., \(\sum _{i, j} (\langle x_i, x_j\rangle - \langle y_i, y_j\rangle )^2\). It has been shown that the optimal solution leads to a linear mapping which is obtained by the top-k eigenvectors of the Gram matrix \(X^\top X\). The following computation shows that classical MDS can be translated into an optimization problem where approximately the error on the squared distances is minimized.

The objective function of classical MDS can be rewritten in matrix notation as

$$\begin{aligned} \min _Y \Vert X^\top X - Y^\top Y\Vert ^2_F. \end{aligned}$$

It holds that \(X^\top X = -\frac{1}{2} HD_X^2H\), where \(H=\textbf{I}-\frac{1}{n}ee^\top\) is a centering matrix with \(\textbf{I}\) being the identity matrix and \(e\in \textbf{R}^{n}\) the all-ones vector.

Hence, the objective function of classical MDS can be rewritten as

$$\min _{Y} \frac{1}{2}\left\| {H\left( {D_{X}^{2} - D_{Y}^{2} } \right)H} \right\|_{F}^{2} .{\text{ }}$$

Note, this is very similar, however not equivalent to

$$\begin{aligned} \min _Y \frac{1}{2}\Vert D_X^2-D_Y^2\Vert ^2_F, \end{aligned}$$

i.e., to the problem of minimizing the error on the squared distances. Compare this to the metric MDS problem that solves

$$\begin{aligned} \min _Y \frac{1}{2}\Vert D_X-D_Y\Vert ^2_F. \end{aligned}$$

Hence, classical MDS can be seen as approximately minimizing the error of the squared distances while metric MDS tries to minimize the error of the distances. Thus, classical MDS can be a bit more sensitive to outliers in the data. One could then ask for an intermediate approach between classical and metric MDS; one could try to minimize the error on the distances while asking for a linear mapping. This gives rise to the projected metric MDS problem.

Definition 1

(projected metric MDS) Given some input data \(X\in \textbf{R}^{n\times m}\), projected metric MDS solves the following optimization problem

$$\begin{aligned} \begin{array}{ll} \displaystyle \min _{P, Y} &{} \Vert D_{X} - D_{Y}\Vert _F^2\\ \text {st.} &{} Y= XP.\nonumber \end{array} \end{aligned}$$

Matrix \(P\in \textbf{R}^{m\times d}\) along with the constraint forces the mapping to be linear. The following theorem relates the optimal solution of projected metric MDS to the optimal solution of metric MDS.

Theorem 1

Let \(X\in \textbf{R}^{n\times m}\) be an input data matrix with centered rows, \(Y^*\in \textbf{R}^{n\times k}\) the solution of metric MDS and \({P}^*\in \textbf{R}^{m\times k}\) the solution of the projected metric MDS. Then, the following inequality holds:

$$\begin{aligned} \frac{\Vert D_{X} - D_{X{P^*}}\Vert _F - \Vert D_{X} - D_{Y^*}\Vert _F}{\Vert D_{Y^*}\Vert _F} \le \sqrt{n-r + 2}, \end{aligned}$$

where \(r=\textrm{rank}(X)\).

In order to prove the theorem, we will need two technical lemmas (Lemma 1 and Lemma 2) and Proposition 1, where we show how to compute a projection matrix P such that the projected points are close to the given ones in a least-square sense. We show that this optimization problem has a closed formula solution and use this result as a way to generate a feasible solution for the projected metric MDS instance. After this we are ready to state the proof of Theorem 1.

Definition 2

We say that rows of data matrix \(X\) are centered if

$$\begin{aligned} \sum _{i=1}^n x_{ij} = 0, \quad \forall j\in \{1,\ldots ,d\}. \end{aligned}$$

Lemma 1

If \(X\in \textbf{R}^{n\times m}\) is centered, then for each \(P\in \textbf{R}^{m\times k}\) matrix XP is centered.


Since \(X\) is centered, we have

$$\begin{aligned}1^TX= 0,\end{aligned}$$

where \(1\in \textbf{R}^{n}\) is vector of ones. From associativity of matrix multiplication we have that

$$0 = 1^{T} X = \left( {1^{T} X} \right)P = 1^{T} \left( {XP} \right),$$

which implies that \(XP\) is centered. \(\square\)

Lemma 2

Let \(X\in \textbf{R}^{n\times m}\) be a matrix with centered rows. Then

$$\begin{aligned}\Vert D_{X}\Vert _F^2 = 2n\Vert X\Vert _F^2.\end{aligned}$$



$$\begin{aligned} (D_{X} )_{{ij}}^{2} = & \left( {x_{i} - x_{j} } \right)^{T} \left( {x_{i} - x_{j} } \right) \\ = & {\mkern 1mu} \,x_{i}^{T} x_{i} + x_{j}^{T} x_{j} - 2x_{i}^{T} x_{j} \\ \end{aligned}$$

and rows of \(X\) are centered, we have

$$\begin{aligned} \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {(D_{X} )_{{ij}}^{2} } } = & \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left( {x_{i}^{T} x_{i} + x_{j}^{T} x_{j} - 2x_{i}^{T} x_{j} } \right)} } \\ = & \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {x_{i}^{T} } } x_{i} + \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {x_{j}^{T} } } x_{j} \\ & - {\mkern 1mu} 2\left( {\sum\limits_{{i = 1}}^{n} {x_{i} } } \right)^{T} \left( {\sum\limits_{{j = 1}}^{n} {x_{j} } } \right) \\ = & 2n\left\| X \right\|_{F}^{2} . \\ \end{aligned}$$


Proposition 1

Let \(X\in \textbf{R}^{n\times m}\) be an input data matrix, \(\tilde{Y}\in \textbf{R}^{n\times k}\) for some \(k<m\), and a singular value decomposition of \(X\) given by the following:

$$\begin{aligned} X= \left[ \begin{array}{cc} U_1&U_2 \end{array} \right] \left[ \begin{array}{cc} \tilde{\Sigma } &{} 0\\ 0&{} 0\end{array}\right] \left[ \begin{array}{c} V_1^T\\ V_2^T \end{array} \right] \end{aligned}$$

where \(U= [U_1\,\,U_2]\), \(U_1\in \textbf{R}^{n\times r}\), \(U_2\in \textbf{R}^{n\times (n-r)}\) and \(V= [V_1\,\,V_2]\), \(V_1\in \textbf{R}^{m\times r}\), \(V_2\in \textbf{R}^{m\times (m-r)}\) are orthogonal matrices. \(\tilde{\Sigma }\in \textbf{R}^{r\times r}\) is a diagonal matrix that contains r non-zero singular values such that \(\sigma _1\ge \sigma _2\ge \ldots \ge \sigma _r > 0\), where r is a rank of matrix \(X\).

The solution of the following optimization problem

$$\begin{aligned} \min _{P}\Vert \tilde{Y} - XP\Vert _F^2 \end{aligned}$$

is given by

$$\begin{aligned} \tilde{P} = V_1\tilde{\Sigma }^{-1}U_1^T\tilde{Y}, \end{aligned}$$

whose objective value is

$$\begin{aligned} \Vert \tilde{Y} - X\tilde{P}\Vert _F^2 = \Vert U_2^T\tilde{Y}\Vert _F^2. \end{aligned}$$


The objective function in (3) can be rewritten as follows:

$$\begin{aligned} \left\| {\tilde{Y} - XP} \right\|_{F}^{2} & = \left\| {\tilde{Y} - U\Sigma V^{T} P} \right\|_{F}^{2} \\ & = \left\| {U^{T} \left( {\tilde{Y} - U\Sigma V^{T} P} \right)} \right\|_{F}^{2} \\ \end{aligned}$$
$$\begin{aligned}&=\Vert U^T\tilde{Y} - \Sigma V^TP\Vert _F^2, \end{aligned}$$

where last equality follows from unitary invariance of Frobenius norm. We introduce substitution

$$\begin{aligned} \left[ \begin{array}{cc} Z_1\\ Z_2 \end{array} \right] = \left[ \begin{array}{cc} V_1^TP\\ V_2^TP\\ \end{array}\right] , \end{aligned}$$

such that the objective in (7) is

$$\begin{aligned} \Vert \tilde{Y} - XP\Vert _F^2&=\left\| \left[ \begin{array}{cc} U_1^T\tilde{Y}\\ U_2^T\tilde{Y}\\ \end{array}\right] - \left[ \begin{array}{c} \tilde{\Sigma }Z_1\\ 0\end{array} \right] \right\| _F^2\\&=\left\| \left[ \begin{array}{cc} U_1^T\tilde{Y} - \tilde{\Sigma }Z_1\\ U_2^T\tilde{Y}\\ \end{array}\right] \right\| _F^2,\\ \end{aligned}$$

where \(U_1^T\tilde{Y} - \tilde{\Sigma }Z_1\) can be set to zero matrix when \(Z_1\) is a solution of the following equation:

$$\begin{aligned}\tilde{\Sigma }Z_1 = U_1^T\tilde{Y}.\end{aligned}$$

Indeed, \(Z_1 = \tilde{\Sigma }^{-1}U_1^T\tilde{Y}\), while \(Z_2\) can be set to \(0\) since it does not affect the objective value. Substitution of \(Z_1\) from (8) gives the solution (4) whose objective value is given in (5).


Now we are ready to state the proof of Theorem 1

Proof of Theorem 1

In order to prove the inequality, we first bound \((D_{\tilde{Y}} - D_{X\tilde{P}})_{ij}^2\).

$$\begin{aligned} \left( {D_{{\tilde{Y}}} - D_{{X\tilde{P}}} } \right)_{{ij}}^{2} & = \left( {\left( {D_{{\tilde{Y}}} } \right)_{{ij}} - \left( {D_{{X\tilde{P}}} } \right)_{{ij}} } \right)^{2} \\ & = \left( {D_{{\tilde{Y}}} } \right)_{{ij}}^{2} - 2\left( {D_{{\tilde{Y}}} } \right)_{{ij}} \left( {D_{{X\tilde{P}}} } \right)_{{ij}} + \left( {D_{{X\tilde{P}}} } \right)_{{ij}}^{2} \\ & = \left\| {\tilde{y}_{i} } \right\|_{2}^{2} {\mkern 1mu} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} \\ & + \left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} \\ & - 2\sqrt {\left\| {\tilde{y}_{i} } \right\|_{2}^{2} {\mkern 1mu} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} } \cdot \sqrt {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} } \\ \end{aligned}$$

Now, from Cauchy-Schwartz inequality follows that \(\tilde{y}_i^T\tilde{y}_j\le |\tilde{y}_i^T\tilde{y}_j|\le \Vert \tilde{y}_i\Vert _2\Vert \tilde{y}_j\Vert _2\) which gives

$$\begin{aligned} \left\| {\tilde{y}_{i} } \right\|_{2}^{2} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} \ge & \left\| {\tilde{y}_{i} } \right\|_{2}^{2} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} \\ & - \,2\left\| {\tilde{y}_{i} } \right\|_{2} \left\| {\tilde{y}_{j} } \right\|_{2} \\ = & \left( {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{y}_{j} } \right\|_{2} } \right)^{2} , \\ \end{aligned}$$

and analogously

$$\left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} \ge \left( {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right)^{2} .$$

From (9), (10), and (11) follows that

$$\begin{aligned} \left( {D_{{\tilde{Y}}} - D_{{X\tilde{P}}} } \right)_{{ij}}^{2} \le & \left\| {\tilde{y}_{i} } \right\|_{2}^{2} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} + \left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} \\ & + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} - 2\sqrt {\left( {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{y}_{j} } \right\|_{2} } \right)^{2} } \sqrt {\left( {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right)^{2} } \\ = & \left\| {\tilde{y}_{i} } \right\|_{2}^{2} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} + \left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} \\ & + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} - 2\left| {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{y}_{j} } \right\|_{2} } \right|\left| {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right| \\ \le & \,\left\| {\tilde{y}_{i} } \right\|_{2}^{2} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} + \left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} \\ & + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} - 2\left( {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{y}_{j} } \right\|_{2} } \right)\left( {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right) \\ \le & \left\| {\tilde{y}_{i} } \right\|_{2}^{2} - 2\left\| {\tilde{y}_{i} } \right\|_{2} \left\| {\tilde{P}^{T} x_{i} } \right\|_{2} + \left\| {\tilde{P}^{T} x_{i} } \right\|_{2}^{2} \\ & + \left\| {\tilde{y}_{j} } \right\|_{2}^{2} - 2\left\| {\tilde{y}_{j} } \right\|_{2} \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} + \left\| {\tilde{P}^{T} x_{j} } \right\|_{2}^{2} - 2\tilde{y}_{i}^{T} \tilde{y}_{j} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} \\ & + 2\left\| {\tilde{y}_{j} } \right\|_{2} \left\| {\tilde{P}^{T} x_{i} } \right\|_{2} + 2\left\| {\tilde{y}_{i} } \right\|_{2} \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} \\ = & \left( {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{i} } \right\|_{2} } \right)^{2} + \left( {\left\| {\tilde{y}_{j} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right)^{2} \\ & - 2\tilde{y}_{i}^{T} \tilde{y}_{j} - 2x_{i}^{T} \tilde{P}\tilde{P}^{T} x_{j} \\ & + 2\left\| {\tilde{y}_{j} } \right\|_{2} \left\| {\tilde{P}^{T} x_{i} } \right\|_{2} + 2\left\| {\tilde{y}_{i} } \right\|_{2} \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} \\ \end{aligned}$$

which after summing up both sides gives

$$\begin{aligned} \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left( {D_{{\tilde{Y}}} - D_{{X\tilde{P}}} } \right)_{{ij}}^{2} } } \le & \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left( {\left\| {\tilde{y}_{i} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{i} } \right\|_{2} } \right)^{2} } } \\ + & \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left( {\left\| {\tilde{y}_{j} } \right\|_{2} - \left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } \right)^{2} } } \\ - & 2\left( {\sum\limits_{{i = 1}}^{n} {\tilde{y}_{i} } } \right)^{T} \left( {\sum\limits_{{j = 1}}^{n} {\tilde{y}_{j} } } \right) - 2\left( {\sum\limits_{{i = 1}}^{n} {\tilde{P}^{T} } x_{i} } \right)^{T} \left( {\sum\limits_{{j = 1}}^{n} {\tilde{P}^{T} } x_{j} } \right) \\ + & 2\left( {\sum\limits_{{i = 1}}^{n} {\left\| {\tilde{P}^{T} x_{i} } \right\|_{2} } } \right)\left( {\sum\limits_{{j = 1}}^{n} {\left\| {\tilde{y}_{j} } \right\|_{2} } } \right) \\ + & 2\left( {\sum\limits_{{i = 1}}^{n} {\left\| {\tilde{y}_{i} } \right\|_{2} } } \right)\left( {\sum\limits_{{j = 1}}^{n} {\left\| {\tilde{P}^{T} x_{j} } \right\|_{2} } } \right). \\ \end{aligned}$$

Now, from \(|\Vert x\Vert - \Vert y\Vert | \le \Vert x-y\Vert\) and Lemma 1 we obtain the following bound

$$\begin{aligned} \left\| {D_{{\tilde{Y}}} - D_{{X\tilde{P}}} } \right\|_{F}^{2} \le & \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left\| {\tilde{y}_{i} - \tilde{P}^{T} x_{i} } \right\|_{2}^{2} } } + \sum\limits_{{i = 1}}^{n} {\sum\limits_{{j = 1}}^{n} {\left\| {\tilde{y}_{j} - \tilde{P}^{T} x_{j} } \right\|_{2}^{2} } } \\ & + 2\sqrt n \left\| {X\tilde{P}} \right\|_{F} \sqrt n \left\| {\tilde{Y}} \right\|_{F} + 2\sqrt n \left\| {X\tilde{P}} \right\|_{F} \sqrt n \left\| {\tilde{Y}} \right\|_{F} \\ =\, & 2n\left\| {\tilde{Y} - X\tilde{P}} \right\|_{F}^{2} + 4n\left\| {X\tilde{P}} \right\|_{F} \left\| {\tilde{Y}} \right\|_{F} \\ \end{aligned}$$

Now, from Proposition 1 we conclude that

$$\begin{aligned} \Vert X\tilde{P}\Vert _F = \Vert U_1\tilde{\Sigma }V_1^TV_1\tilde{\Sigma }^{-1}U_1^T\tilde{Y}\Vert _F = \Vert \tilde{Y}\Vert _F \end{aligned}$$


$$\begin{aligned} \Vert D_{\tilde{Y}} - D_{X\tilde{P}}\Vert _F^2&\le 2n\Vert U_2^T\tilde{Y}\Vert _F^2 + 4n\Vert \tilde{Y}\Vert _F^2\nonumber \\&\quad \le 2n\Vert U_2\Vert _F^2 \Vert \tilde{Y}\Vert _F^2 + 4n\Vert \tilde{Y}\Vert _F^2\nonumber \\&\quad =2n(n-r)\Vert \tilde{Y}\Vert _F^2 + 4n\Vert \tilde{Y}\Vert _F^2 \end{aligned}$$

Inequality (16) and Lemma 2 imply

$$\begin{aligned} \Vert D_{\tilde{Y}} - D_{X\tilde{P}}\Vert _F^2 \le (n-r+2)\Vert D_{\tilde{Y}}\Vert _F^2. \end{aligned}$$

Finally, from triangle inequality and (16) we conclude that

$$\begin{aligned} \left\| {D_{X} - D_{{X\tilde{P}}} } \right\|_{F} \le \, & \left\| {D_{X} - D_{{\tilde{Y}}} } \right\|_{F} \\ & + \left\| {D_{{\tilde{Y}}} - D_{{X\tilde{P}}} } \right\|_{F} \\ \le & \left\| {D_{X} - D_{{\tilde{Y}}} } \right\|_{F} \\ & + \sqrt {n - r + 2} \left\| {D_{{\tilde{Y}}} } \right\|_{F} \\ \end{aligned}$$

which completes the proof. \(\square\)

While the theorem proves that the objective function values of projected metric MDS and metric MDS do not differ too much, in practice they still might provide substantially different solutions. See Fig. 1 for an example. Hence, we conclude that it is really necessary to have a nonlinear mapping between the input and the output space. This motivates the neural network approach introduced below.

Fig. 1
figure 1

We use an open box example (a) in order to illustrate the power of nonlinear mapping, such as metric MDS (d), over the linear mapping, such as PCA (b) and projected metric MDS (c)

The neural network approach

Here we will describe the neural network architecture that we used as well as the algorithm for computing the metric MDS mapping. We use a simple fully-connected neural network with a single hidden layer and a \(\tanh\) activation function. The size of the input layer corresponds to the dimension of the input data n, while the size of the output layer corresponds to the dimension k of the output data. The size of the hidden layer is chosen as an estimate of the intrinsic dimension of the input data set. In practice, we estimated the intrinsic dimension by computing a SVD of the input data and selecting the top k singular values that retain at least \(95\%\) of the data variance.

The motivation for our simple neural network architecture comes from the fact that any feedforward neural network with only a single hidden layer and any infinitely differentiable sigmoidal activation function, i.e., a function that retains the “S” shape can uniformly approximate any continuous function on a compact set, see, e.g., [33]. Furthermore, any mapping from the input points to the output points can be extended to a continuous mapping from \(\textbf{R}^n\) to \(\textbf{R}^k\), provided that the set of input points is finite and no two input points share the same coordinates, i.e., \(x_i\ne x_j\) for \(i\ne j\). Hence, a neural network should be able to approximate that mapping arbitrarily well. Due to the simplicity, differentiability, and the small number of parameters compared to the number of input points and features, there is little chance of overfitting. As can be seen in the experiments, the found mapping generalizes well on test data.

Algorithm 1
figure a

NN approach for metric MDS problem

We adopt the standard batch stochastic gradient descent approach and partition the input data set into a set of batches. We optimize the weights in our simple neural network following the idea of the Siamese neural network approach, see, e.g., [34]. The basic idea of this approach is that in each learning step the neural net is shown two points, say \(x_i\) and \(x_j\). The outputs are stored for both points, say \(y_i\) and \(y_j\) respectively and the distance \(\Vert y_i-y_j\Vert\) between the output vectors are calculated. A loss function is defined in terms of the squared difference of this distance and the distance between the points in the input space \(\Vert x_i-x_j\Vert\), for all pairs of input points in the current batch. The weights of the neural network are updated using the Adam optimizer. Since the data is shuffled after each epoch, we sample the input distance matrix uniformly at random which provides an unbiased estimate of it. We set the batch size to 256 points which provides a good tradeoff between memory consumption and number of iterations needed to converge. All the steps of our simple but efficient approach are summarized in Algorithm 1.


We implemented the neural network (NN) approach and compared it to other methods for solving MDS.Footnote 1 Specifically, we compared it to the SMACOF algorithm which still represents the state-of-the-art for solving the metric MDS problem. However, due to its inherent quadratic runtime and space complexity, it is prohibitive to run it on data sets with more than a few thousand data points. To still assess the quality of our neural network approach, we also compared it to a random projection (RP) approach. In the random projection approach, a random Gaussian matrix is used for projecting the points into a low-dimensional Euclidean space. The well-known Johnson-Lindenstrauss lemma [35] states that such a random projection can embed any n-point data set into a low-dimensional Euclidean space of dimension at most \(O\left( \frac{\log n}{\varepsilon ^2}\right)\) while incurring a multiplicative distortion error of no more than \(1+\varepsilon\) for any small \(\varepsilon > 0\) in the worst case. Dasgupta [36] showed that in general this approach works surprisingly well when trying to embed a data set while preserving inter-cluster distances. We also compared our neural network approach to PCA to demonstrate that the preservation of distances as explicitly optimized for in mMDS cannot be obtained as a by-product of a much simpler optimization model (here classical MDS). Finally, we also compared it to the projected metric MDS problem that we have defined above. We solved the projected metric MDS problem using a quasi-Newton method combined with a smoothing technique. Note that this approach also does not scale well to large data sets.

We ran the experiments on a Ryzen 9 3900X CPU with 12 cores running at 3.8 GHz, 64 GB DDR4 RAM and a RTX 2080 graphics card using an Ubuntu 19.10. operating system. All implementations were done in Python 3.7. We used PyTorch 1.5.1 for the neural network approach. We used the implementation of the random projection, PCA, and SMACOF algorithm from scikit-learn 0.22.1.

Fig. 2
figure 2

The loss of the metric MDS problem for different values of the target dimension for train and test data sets. The loss function is displayed on a logarithmic scale. Due to its quadratic running time, SMACOF was run only on the smallest USPS data set

Fig. 3
figure 3

Comparison of the loss function of the metric MDS problem for random projections (RP), PCA, and our neural network (NN) approach

Comparison of loss and running time of metric MDS

Figure 2 shows the metric MDS loss achieved by the different methods on the USPS, MNIST, CIFAR10, and SVHN data sets for different embedding dimension k. For each data set, we computed the embedding on the train data set and then applied the mapping of the input space to the lower-dimensional output space provided by each method (except SMACOF) to new, unseen data points from the test data set. Since the SMACOF algorithm does not provide such a mapping we recomputed the embedding of combined train data and test data and reported the loss of SMACOF on the test data set in Fig. 2b. On the smallest USPS data set our neural network approach is typically at least as good as the SMACOF algorithm, which we could not run on other (larger) data sets due to its prohibitive running time. As expected, both methods yielded substantially smaller loss across data sets than the remaining methods that do not explicitly optimize the same objective. Note that the metric MDS loss function is plotted on a logarithmic scale.

Figure 3 shows consistent results on three word embedding data sets that represent words as vectors that geometrically capture the semantics of the words. More precisely, the Glove data set consists of 2 data sets: glove.6B learned word vectors on Wikipedia 2014 dump (size 400 K), and glove.840B learned word vectors on Common Crawl corpus (size 2.2.M). The FastText data set wiki-news-300d-1 M holds 1 M word vectors trained on Wikipedia 2017 dump, UMBC webbase corpus and news dataset. We had to exclude SMACOF from this comparison due to its quadratic running time.

We also report the running time of our approach in Table 1 and compare it to the running time of the SMACOF algorithm when embedding the different data sets into dimension \(k=12\). We observed that neither the running time of SMACOF nor of our approach depends on the dimension k. Since the SMACOF algorithm can handle only small data sets we subsampled the MNIST data set. We always ran our approach for 1000 epochs on all data sets. Table 1 shows that the running time of the SMACOF algorithm grows quadratically in the number of data points. In contrast, our approach shows an approximately linear dependence, which allows it to be applied to large data sets where it is orders of magnitude faster than the current state-of-the-art approach.

Table 1 Running times of SMACOF and our neural network based approach on data sets of different size

Metric MDS based clustering of scRNA-seq data

A comprehensive comparison of method for clustering single-cell RNA sequencing (scRNA-seq) data based on generic as well as scRNA-seq specific dimensionality reduction methods, including classical MDS, has been performed in Sun et al. [1]. Here, we demonstrate the utility of metric MDS for the clustering of scRNA-seq data. The unsupervised clustering of scRNA-seq data allows to identify known as well as novel cell types based on the cell’s transcriptomes. Seurat [37] is the most widely used computational method for clustering of scRNA-seq. It is based on the Louvain clustering algorithm and relies on a prior preprocessing of the data that includes, among others, a dimensionality reduction step using principle component analysis (PCA). A major advantage of metric MDS over PCA is its flexibility with respect to the distance metric that is used in the underlying optimization problem. We therefore compared the standard Seurat clustering pipeline to a pipeline in which we replaced the PCA step by our metric MDS approach but kept all other computational steps identical. In metric MDS we experimented with 3 different distance metrics, the Euclidean, cosine, and correlation based distance.

We compared the PCA and metric MDS based clustering approaches on all but one real data sets that were used in Duò et al. [38] to benchmark clustering methods using cell phenotypes defined independently of scRNA-seq. Following [38], we labeled cell types based on FACS sorting in the Koh data set, and grouped cells according to genetic perturbation and growth medium in the Kumar data set. In data set Zhengmix4eq (Zhengmix4uneq), the authors in Duò et al. [38] randomly mixed equal (unequal) proportions of presorted B cells, naive cytotoxic T cells, CD14 monocytes, and regulatory T cells. Data set Zhengmix8eq additionally included equal proportions of CD56 NK cells, naive T cells, memory T cells, and CD4 T helper cells. We excluded a single data set in which ground truth labels correspond to collection time points that all methods in Duò et al. [38] failed to reconstruct. In agreement with recent clustering benchmarks [38, 39] we used the Adjusted Rand Index (ARI) [40] and Normalized Mutual Information (NMI) [41] to quantify the similarity of inferred to ground truth clusterings.

Table 2 Average scores of ARI and NMI metrics across 5 real data sets
Fig. 4
figure 4

Comparison of metric MDS and PCA

On average, metric MDS yielded more accurate clusterings than when applying PCA, independent of the specific distance metric used (Table 2). Clusterings obtained from embeddings computed by metric MDS using correlation or cosine based distances were most accurate, and achieved a substantial improvement compared to PCA on the three most difficult (with respect to PCA performance) data sets (Fig. 4a). Performance metric NMI provided a consistent picture of method performance. A visualization of the two embeddings highlights the better separation of cell types by metric MDS compared to PCA on data set Zhengmix4eq (Fig. 4b), especially between naive cytotoxic and regulatory T cells. Note that this is consistent with findings in Sun et al. [1] where MDS performed well in clustering visualization compared to, e.g., PCA and t-SNE.

mMDS can improve spatial domain detection

In addition to measuring gene expression in single cells, spatially resolved transcriptomics (SRT) provides information about the relative location of cells in a tissue section [42]. Compared to clustering (non-spatial) scRNA-seq data, spatial clustering can reveal higher-order tissue structures, i.e. spatial domains, which inform many downstream tasks such as marker gene detection and the comparison of health and disease states [43]. Many existing computational approaches to identify spatial domains rely on PCA to pre-process SRT data. Here, we used 12 tissue sections from the human dorsolateral prefrontal cortex assayed with the 10x Visium platform [44] to illustrate the utility of mMDS compared to commonly used PCA in pre-processing SRT data. We compared the spatial domains inferred by methods CCST [45] and SpaGCN [46] to the six cortical layers and white matter that were manually annotated in the original publication. Both methods rely on PCA for initial dimensionality reduction, which we have replaced in a separate run by mMDS using identical target dimensions (200 for CCST, 50 for SpaGCN). Figure 5 quantifies the similarity of inferred and true clusterings using the Adjusted Rand Index. Both methods detect cortical layers more accurately when using our mMDS approach in place of PCA.

Fig. 5
figure 5

Accuracy of spatial domain detection by methods CCST and SpaGCN when pre-processing 10x Visium data either using metric MDS or PCA


We presented a two-layer neural network approach for solving the metric multidimensional scaling problem. Our approach provides two advantages over previous state-of-the-art approaches; it is orders of magnitude faster and scales to much larger data sets with up to a few million data points and may thus represent a viable alternative to the widely used PCA in single-cell analysis. At the same time it provides a mapping of the input space to the output space. This allows to apply the same embedding to new, unseen data, which prevents inducing a shift in the data distribution for test data.

On datasets used in a previous benchmark study, we demonstrated the utility and flexibility of metric MDS in the analysis of scRNA-seq data. Our algorithm allows for the first time to apply metric MDS to large scRNA-seq data sets produced by the highest-throughput sequencing technologies. It thus provides a powerful alternative to methods routinely used to represent noisy gene expression measurements in a low-dimensional subspace useful for the exploration and analysis of scRNA-seq data.

Data availability

All datasets used in this manuscript are publicly available and have been cited appropriately.

Code availability

The source code is publicly available at


  1. The source code is publicly available via


  1. Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20(1):269.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Baglama J, Reichel L. Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J Sci Comput. 2005;27(1):19–42.

    Article  Google Scholar 

  3. Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat Biotechnol. 2018;36(5):411–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kiselev VY, Kirschner K, Schaub MT, Andrews T, Yiu A, Chandra T. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods. 2017;14:483.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Lin P, Troup M, Ho JWK. CIDR: ultrafast and accurate clustering through imputation for single-cell RNA-seq data. Genome Biol. 2017;18(1):59.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Ji Z, Ji H. TSCAN: pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res. 2016;44(13):e117-7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Shin J, Berg DA, Zhu Y, Shin JY, Song J, Bonaguidi MA, et al. Single-cell RNA-seq with waterfall reveals molecular cascades underlying adult neurogenesis. Cell Stem Cell. 2015;17(3):360–72.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Deng Y, Bao F, Dai Q, Wu LF, Altschuler SJ. Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods. 2019;16(4):311–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Eraslan G, Simon LM, Mircea M, Mueller NS, Single-cell Theis FJ. RNA-seq denoising using a deep count autoencoder. Nat Commun. 2019;10(1):390.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature. 2019;566(7745):496–502.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Moon KR, van Dijk D, Wang Z, Gigante S, Burkhardt DB, Chen WS, et al. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol. 2019;37(12):1482–92.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Pliner V. Metric unidimensional scaling and global optimization. J Classif. 1996;13(1):3–18.

    Article  Google Scholar 

  13. de Leeuw J, et al. Applications of convex analysis to multidimensional scaling. In: Barra JR, et al., editors. Recent developments in statistics. Amsterdam: North Holland Publishing Company; 1977. p. 133–46.

    Google Scholar 

  14. Groenen PJ, Borg I. Past, present, and future of multidimensional scaling. In: Blasius J, Greenacre M, editors. Visualization and verbalization of data. Boca Raton: CRC Press; 2014. p. 95–117.

    Google Scholar 

  15. Torgerson WS. Theory and methods of scaling. New York: Wiley; 1958.

    Google Scholar 

  16. Gower JC. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika. 1966;53(3–4):325–38.

    Article  Google Scholar 

  17. Kruskal JB. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika. 1964;29(1):1–27.

    Article  Google Scholar 

  18. Guttman L. A general nonmetric technique for finding the smallest coordinate space for a configuration of points. Psychometrika. 1968;33(4):469–506.

    Article  Google Scholar 

  19. Cox T, Cox M. Multidimensional scaling. 2nd ed. Boca Raton: Chapman Hall; 2001.

    Google Scholar 

  20. Borg I, Groenen PJ. Modern multidimensional scaling: theory and applications. Berlin: Springer Science & Business Media; 2005.

    Google Scholar 

  21. Burges CJ. Dimension reduction: a guided tour. Hanover: Now Publishers Inc; 2010.

    Google Scholar 

  22. Qu TG, Cai ZX. A divide-and-conquer based multidimensional scaling algorithm. Pattern Recognit Artif Intell. 2014;27:961–9.

    Google Scholar 

  23. Yang T, Liu J, Mcmillan L, Wang W. A fast approximation to multidimensional scaling. In: Proceedings of the IEEE Workshop on Computation Intensive Methods for Computer Vision. 2006.

  24. de Silva V, Tenenbaum JB. Sparse multidimensional scaling using landmark points. Stanford: Stanford University; 2004.

    Google Scholar 

  25. Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–23.

    Article  CAS  PubMed  Google Scholar 

  26. Williams CK. On a connection between kernel PCA and metric multidimensional scaling. In: Advances in Neural Information Processing Systems (NIPS); 2001. p. 675–81.

  27. Sammon JW. A nonlinear mapping for data structure analysis. IEEE Trans Comput. 1969;100(5):401–9.

    Article  Google Scholar 

  28. Mao J, Jain AK. Artificial neural networks for feature extraction and multivariate data projection. IEEE Trans Neural Netw. 1995;6(2):296–317.

    Article  CAS  PubMed  Google Scholar 

  29. Ridder DD, Duin RP. Sammon’s mapping using neural networks: a comparison. Pattern Recognit Lett. 1997;18:1307–16.

    Article  Google Scholar 

  30. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313(5786):504–7.

    Article  CAS  PubMed  Google Scholar 

  31. Wezel M, Kok J, Kosters W. Two Neural Network Methods for Multidimensional Scaling. In: European Symposium on Artificial Neural Networks (ESANN); 1997.

  32. Wezel M, Kosters W. Nonmetric multidimensional scaling: neural networks versus traditional techniques. Intell Data Anal. 2004;8:601–13.

    Article  Google Scholar 

  33. Cybenko G. Approximation by superpositions of a sigmoidal function. Math Control Signal Syst. 1989;2(4):303–14.

    Article  Google Scholar 

  34. Chopra S, Hadsell R, LeCun Y. Learning a similarity metric discriminatively, with application to face verification. In: Conference on Computer Vision and Pattern Recognition (CVPR’05); 2005. p. 539–46.

  35. Johnson WB, Lindenstrauss J. Extensions of Lipschitz maps into a Hilbert space. Contemp Math. 1984;26:189–206.

    Article  Google Scholar 

  36. Dasgupta S. Experiments with random projection. In: Conference on Uncertainty in Artificial Intelligence (UAI); 2000. p. 143–51.

  37. Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33(5):495–502.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Research. 2018;7:1141.

    Article  PubMed  Google Scholar 

  39. Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data. F1000Research. 2018;7:1297.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218.

    Article  Google Scholar 

  41. Studholme C, Hill DLG, Hawkes DJ. An overlap invariant entropy measure of 3D medical image alignment. Pattern Recognit. 1999;32(1):71–86.

    Article  Google Scholar 

  42. Moffitt JR, Lundberg E, Heyn H. The emerging landscape of spatial profiling technologies. Nat Rev Genet. 2022;23(12):741–59.

    Article  CAS  PubMed  Google Scholar 

  43. Yuan Z, Zhao F, Lin S, Zhao Y, Yao J, Cui Y, et al. Benchmarking spatial clustering methods with spatially resolved transcriptomics data. Nat Methods. 2024;21(4):712–22.

    Article  CAS  PubMed  Google Scholar 

  44. Maynard KR, Collado-Torres L, Weber LM, Uytingco C, Barry BK, Williams SR, et al. Transcriptome-scale spatial gene expression in the human dorsolateral prefrontal cortex. Nat Neurosci. 2021;24(3):425–36.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Li J, Chen S, Pan X, Yuan Y, Shen HB. Cell clustering for spatial transcriptomics data with graph neural networks. Nat Comput Sci. 2022;2(6):399–408.

    Article  CAS  PubMed  Google Scholar 

  46. Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021;18(11):1342–51.

    Article  CAS  PubMed  Google Scholar 

Download references


V. H. D. thanks the Vietnam Institute for Advanced Study in Mathematics (VIASM) for the financial support.


Open Access funding enabled and organized by Projekt DEAL. This work was supported by the Deutsche Forschungsgemeinschaft SFB-TRR 338/1 2021-452881907 (to S.C.).

Author information

Authors and Affiliations



All authors conceived the algorithm, contributed to the interpretation of results and the writing of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Stefan Canzar.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Canzar, S., Do, V.H., Jelić, S. et al. Metric multidimensional scaling for large single-cell datasets using neural networks. Algorithms Mol Biol 19, 21 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: