Efficient privacy-preserving variable-length substring match for genome sequence

Nakagawa, Yoshiki; Ohata, Satsuya; Shimizu, Kana

doi:10.1186/s13015-022-00211-1

Research
Open access
Published: 26 April 2022

Efficient privacy-preserving variable-length substring match for genome sequence

Yoshiki Nakagawa¹,
Satsuya Ohata² &
Kana Shimizu^1,3

Algorithms for Molecular Biology volume 17, Article number: 9 (2022) Cite this article

3188 Accesses
1 Citations
Metrics details

Abstract

The development of a privacy-preserving technology is important for accelerating genome data sharing. This study proposes an algorithm that securely searches a variable-length substring match between a query and a database sequence. Our concept hinges on a technique that efficiently applies FM-index for a secret-sharing scheme. More precisely, we developed an algorithm that can achieve a secure table lookup in such a way that $V[V[\ldots V[p_0] \ldots ]]$ is computed for a given depth of recursion where $p_0$ is an initial position, and V is a vector. We used the secure table lookup for vectors created based on FM-index. The notable feature of the secure table lookup is that time, communication, and round complexities are not dependent on the table length N, after the query input. Therefore, a substring match by reference to the FM-index-based table can also be conducted independently against the database length, and the entire search time is dramatically improved compared to previous approaches. We conducted an experiment using a human genome sequence with the length of 10 million as the database and a query with the length of 100 and found that the query response time of our protocol was at least three orders of magnitude faster than a non-indexed database search protocol under the realistic computation/network environment.

Introduction

The dramatic reduction in the cost of genome sequencing has prompted increased interest in personal genome sequencing over the last 15 years. Extensive collections of personal genome sequences have been accumulated both in academic and industrial organizations, and there is now a global demand for sharing the data to accelerate scientific research [1, 2]. As discussed in previous studies, disclosing personal genome information has a high privacy risk [3], so it is crucial to ensure that individuals’ privacy is protected upon data sharing. At present, the most popular approach for this is to formulate and enforce a privacy policy, but it is a time-consuming process to reach an agreement, especially among stakeholders with different legal backgrounds, which slows down the pace of research. Therefore, there is a strong demand for privacy-preserving technologies that can potentially compensate for or even replace the traditional policy-based approach [4, 5]. One important application that needs a privacy-preserving technology is private genome sequence search, where different stakeholders respectively hold a query sequence and a database sequence and the goal is to let the query holder know the result while simultaneously keeping the query and the database private. Many studies have addressed the problem of how to compute exact or approximate edit distance or the longest common substring (LCS) through techniques based on homomorphic encryption [6,7,8] and secure multi-party computation (MPC) [9,10,11,12,13,14,15], or how to compute sequence similarity based on private set intersection [16]. While these studies can evaluate global sequence similarity for two sequences of similar length, other studies address the problem of finding a substring between a query and a long genome sequence or a set of long genome sequences, with the aim of evaluating local sequence similarity [17,18,19,20,21,22,23]. Shimizu et al. proposed an approach to combine an additive homomorphic encryption and index structures such as FM-index [24] and the positional Burrows-Wheeler transform [25] to find the longest prefix of a query that matches a database (LPM) and a set-maximal match for a collection of haplotypes [17]. Sudo et al. used a similar approach and improved the time and communication complexities for LPM on a protein sequence by using a wavelet matrix [19]. Ishimaki et al. improved the round complexity of a set-maximal match, though the search time was more than one order of magnitude slower than [17] due to the heavy computational cost caused by the fully homomorphic encryption [18]. Sotiraki et al. used the Goldreich-Micali-Wigderson protocol to build a suffix tree for a set-maximal match [20]. According to experiments by [21], the search time of [20] is one order of magnitude slower than [17, 21]. Mahdi et al. [21] used a garbled circuit to build a suffix tree for substring match and a set-maximal match under a different security assumption such that the tree-traversal pattern is leaked to the cloud server. Chen et al. [22] and Popic et al. [23] found fixed-length substring matches using a one-way hash function or homomorphic encryption on a public cloud under a security assumption such that the database is a public sequence and a query is leaked to a private cloud server.

In this study, we aim to improve privacy-preserving substring match under the security assumption such that both the query and the database sequence are strictly protected. We first propose a more efficient method for finding LPM, and then extend it to find the longest maximal exact match (LMEM), which is more practically important in bioinformatics. We designed the protocol for LMEM for ease of explanation, and the protocol can be applied to similar problems such as finding all maximal exact matches (MEMs) with a small modification. To our knowledge, this is the first study to address the problem of securely finding MEMs.

Our contribution

The time complexity of the previous studies [17, 19] include the factor of $N$, and thus they do not scale well to a large database. For a similar reason, using secure matching protocols (e.g., [26]) for the shares (or tags in searchable encryption) of all substrings in a query and database is even worse in terms of time complexity. To achieve a real-time search on an actual genome database, we propose novel secret-sharing-based protocols that do not include the factor of $N$ in the time, communication, and round complexities for the search time (i.e., the time after the input of a query until the end of the search).

The basic idea of the protocols is to represent the database string by a compressed index [24, 27] and store the index as a lookup table. LPM and MEMs are found by at most $\ell$ and $2\ell$ table lookups respectively, where $\ell$ is the length of the query. More specifically, the table $V$ is referenced in a recursive manner; i.e., one needs to obtain $V[j]$, where $j=V[i]$, given i. To ensure security, we need to compute $V[j]$ without seeing any element of $V$. The key technical contribution of this study is an efficient protocol that achieves this type of recursive reference. We named the protocol secret-shared recursive oblivious transfer (ss-ROT). While the previous studies require $O(N)$ time complexity to ensure security, the time, communication, and round complexities of ss-ROT are all $O(\ell )$ for $\ell$ recursive table lookups, except for the preparation of the table and generation of shares before the query input. Since the entire protocols mainly consist of $\ell$ table lookups for LPM, and $2\ell$ table lookups and $2\ell$ inner product computations for LMEM, the search times for LPM and LMEM do not depend on the database size. In addition to the protocols based on ss-ROT, we developed a protocol to reduce data transfer size in the initial step by using a similar approach taken in ss-ROT. The protocol offers a reasonable trade-off between the amount of reduction in data transfer in the initial step and the increase in computational cost in the later step.

We implemented the proposed protocol and tested it on substrings of a human genome sequence $10^3$ to $10^7$ in length and confirmed that the actual CPU time and data transfer overhead were in good agreement with the theoretical complexities. We also found that the search time of our protocol was three orders of magnitude faster than that of the previous method [17, 19]. For conducting further performance analysis, we designed and implemented baseline protocols using major techniques of secret-sharing-based protocols. The results showed that the search times of our protocols were at least two orders of magnitude faster than those of the baseline protocols.

Preliminaries

Secure computation based on secret sharing

Here, we explain the 2-out-of-2 additive secret sharing ((2, 2)-SS) scheme and how to securely compute arithmetic/Boolean gates (Fig. 1).

Secret sharing and secure computation In t-out-of-n secret sharing (e.g., [28]), we split the secret value x into n pieces, and can reconstruct x by combining more or an equal number of t pieces. We call the split pieces “share”. The basic security notion for secret sharing is that we cannot obtain any information about x even if we gather less than or equal to $(t-1)$ shares. In this paper, we consider a case with $(t,n) = (2,2)$. A 2-out-of-2 secret sharing ((2, 2)-SS) scheme over $\mathbb {Z}_{2^n}$ consists of two algorithms: $\mathsf {Share}$ and $\mathsf {Reconst}$. $\mathsf {Share}$ takes as input $x \in \mathbb {Z}_{2^n}$ and outputs $([\![x]\!]_0, [\![x]\!]_1) \in \mathbb {Z}_{2^n}^2$, where the bracket notation $[\![x]\!]_i$ denotes the arithmetic share of the i-th party (for $i \in \{0,1\}$). We denote $[\![x]\!] = ([\![x]\!]_0,[\![x]\!]_1)$ as their shorthand. $\mathsf {Reconst}$ takes as inputs $[\![x]\!]_0$ and $[\![x]\!]_1$ and outputs x. For arithmetic sharing $[\![x]\!]_i$ and Boolean sharing $[\![x]\!]^B_i$, we consider power-of-two integers n (e.g., $n=16$) and $n=1$, respectively.

Depending on the secret sharing scheme, we can compute arithmetic/Boolean gates over shares; that is, we can execute some kind of processing related to x without x. This means it is possible to perform some computation without violating the privacy of the secret data, and is called secure (multi-party) computation. It is known that we can execute arbitrary computation by combining basic arithmetic/Boolean gates. In the following paragraphs, we show how to concretely compute these gates over shares.

Table 1 Secure subprotocols used in this paper

Full size table

Semi-honest secure two-party computation based on (2, 2)-Additive SS We use a standard (2, 2)-additive SS scheme, defined by

$\mathsf {Share}(x):$ randomly choose $r \in \mathbb {Z}_{2^n}$ and let $[\![x]\!]_0 = r$ and $[\![x]\!]_1 = x - r$.
$\mathsf {Reconst}([\![x]\!]_0, [\![x]\!]_1):$ output $[\![x]\!]_0 + [\![x]\!]_1$.

Note that one of the shares of x ($[\![x]\!]_0$ or $[\![x]\!]_1$) does not reveal any information about x. In Fig. 1, the secret value $x = 2$ is split into $[\![x]\!]_0 = 4$ and $[\![x]\!]_1 = 6$. These are valid (2, 2)-additive shares because $4 + 6 \equiv 2 \pmod 8$ holds. Even if we can see $[\![x]\!]_0 = 4$, we cannot decide the value of x since we execute a split of x uniformly at random. This means, in Fig. 1, computing nodes $P_0$ and $P_1$ cannot obtain any information about x as long as these two nodes do not collude. On the other hand, we can compute arithmetic $\textsf {ADD}/\textsf {MULT}$ gates over shares as follows:

$[\![z]\!] \leftarrow \textsf {ADD}([\![x]\!], [\![y]\!])$ can be done locally by just adding each party’s share on x and on y. In Fig. 1 (left), we show an example of secure addition. $P_0/P_1$ obtain shares 6/7 by adding their two shares. In this process, $P_0/P_1$ cannot find they are computing $2+3$.
Multiplication is more complex than addition. There are various methods for multiplication over shares, most of which require communication between computing nodes. In this paper, we use the standard method for $[\![w]\!] \leftarrow \textsf {MULT}([\![x]\!], [\![y]\!])$ based on Beaver triples (BT) [29]. Such a triple consists of $\mathsf {bt}_0 = (a_0, b_0, c_0)$ and $\mathsf {bt}_1 = (a_1, b_1, c_1)$ such that $(a_0 + a_1)(b_0 + b_1) = (c_0 + c_1)$. Hereafter, a, b, and c denote $a_0 + a_1$, $b_0 + b_1$, and $c_0 + c_1$, respectively. We use these BTs as auxiliary inputs for computing $\textsf {MULT}$. Note that we can compute them in advance (or in offline phase) since they are independent of inputs $[\![x]\!]$ and $[\![y]\!]$. We adopt a trusted initializer setting (e.g., [30, 31]); that is, BTs are generated by the party other than two computing nodes and then distributed. In the online phase of $\textsf {MULT}$, each i-th party $P_i$ ($i \in \{0,1\}$) can compute the multiplication share $[\![z]\!] = [\![xy]\!]$ as follows:

1)
$P_i$ first computes $([\![x]\!]_i - a_i)$ and $([\![y]\!]_i - b_i)$, and sends them to $P_{1-i}$.
2)
$P_i$ reconstructs $x'= x - a$ and $y' = y - b$.
3)
$P_0$ computes $[\![z]\!]_0 = x'y' + x'b_0 + y'a_0 + c_0$, and $P_1$ computes $[\![z]\!]_1 =x'b_1 + y'a_1 + c_1$.

Here, $[\![z]\!]_0$ and $[\![z]\!]_1$ calculated with the above procedures are valid shares of xy; that is, $\mathsf {Reconst}([\![z]\!]_0, [\![z]\!]_1) = xy$. We shorten the notations and write the $\textsf {ADD}$ and $\textsf {MULT}$ protocols simply as $[\![x]\!] + [\![y]\!]$ and $[\![x]\!] \cdot [\![y]\!]$, respectively.

We also write $\textsf {ADD}(\textsf {ADD}([\![x_\mathrm{A}]\!], [\![x_\mathrm{B}]\!]), [\![x_\mathrm{C}]\!])$ as $\Sigma _{c=\{ \mathrm{A}, \mathrm{B}, \mathrm{C} \}} [\![x_c]\!]$. Note that, similarly to the $\textsf {ADD}$ protocol, we can also locally compute multiplication by constant c, denoted by $c \cdot [\![x]\!]$. We can easily extend the above protocols to Boolean gates. By converting $+$ and − into $\oplus$ in the arithmetic $\textsf {ADD}$ and $\textsf {MULT}$ protocols, we can obtain the $\textsf {XOR}$ and $\textsf {AND}$ protocols, respectively. We can construct $\textsf {NOT}$ and $\textsf {OR}$ protocols from the properties of these gates. When we compute $\textsf {NOT}([\![x]\!]^B_0, [\![x]\!]^B_1)$, $P_0$ and $P_1$ output $\lnot [\![x]\!]^B_0$ and $[\![x]\!]^B_1$, respectively. When we compute $\textsf {OR}([\![x]\!]^B, [\![y]\!]^B)$, we compute $\lnot \textsf {AND}(\lnot [\![x]\!]^B, \lnot [\![y]\!]^B)$. We shorten the notations and write $\textsf {XOR}$, $\textsf {AND}$, $\textsf {NOT}$, and $\textsf {OR}$ simply as $[\![x]\!] \oplus [\![y]\!]$, $[\![x]\!] \wedge [\![y]\!]$, $\lnot [\![x]\!]$, and $[\![x]\!] \vee [\![y]\!]$, respectively. By combining the above gates, we can securely compute higher-level protocols. The functionality of the secure subprotocols [15] used in this paper are shown in Table 1. Due to space limits, we omit the details of their construction. Note that we can compute $\mathsf {Choose}$ by $[\![z]\!] = [\![y]\!] + [\![e]\!] \cdot ([\![x]\!] - [\![y]\!])$. In this paper, we consider the standard simulation-based security notion in the presence of semi-honest adversaries (for 2PC), as in [32]. We show the definition in Appendix 2. Roughly speaking, this security notion guarantees the privacy of the secret under the condition that computing nodes do not deviate from the protocol; that is, although computing nodes are allowed to execute arbitrary attacks in their local, they do not (maliciously) manipulate transmission data to other parties. The building blocks we adopt in this paper satisfy this security notion. Moreover, as described in [32], the composition theorem for the semi-honest model holds; that is, any protocol is privately computed as long as its subroutines are privately computed.

Index structure for string search

Notation and definition $\Sigma$ denotes a set of ordered symbols. A string consists of symbols in $\Sigma$. We denote a lexicographical order of two strings S and $S'$ by $S \le S'$ (i.e., A < C < G < T and AAA < AAC). We denote the i-th letter of a string S by S[i] and a substring starting from the i-th letter to the j-th letter by S[i, j]. The index starts with 0. The length of S is denoted by |S|. A reverse string of S (i.e., $S[|S|-1],\ldots ,S[0]$) is denoted by $\hat{S}$. We consider a direction from the i-th position to the j-th position as rightward if $i < j$ and leftward otherwise.

Given a query $w$ and a database S, we define the longest prefix that matches a database string (LPM) by $\max _{(0, j)}\{ j | w[0,\ldots ,j] = S[k,\ldots ,l] \}$, where $0 \le j < \ell$ and $0 \le k \le l< N$, and the longest maximal exact match (LMEM) by $\max _{(i, j)}\{ j-i | w[i,\ldots ,j] = S[k,\ldots ,l] \}$, where $0 \le i \le j < \ell$ and $0 \le k \le l< N$.

FM-Index and related data structures FM-Index [24] and related data structures [27] are widely used for genome sequence search. Given a query string $w$ of length $\ell$ and a database string S of length N, [24] enables LPM to be found in $O(\ell )$ time regardless of N, and it also enables LMEM to be found in $O(\ell )$ if auxiliary data structures are used [27]. Given all the suffixes of a string S: $S[0,\ldots ,|S|-1]$, $S[1,\ldots ,|S|-1], \ldots , S[|S|-1]$, a suffix array is an array of positions $(p_0, \ldots , p_{|S|-1})$ such that $S[p_0,\ldots ,|S|-1] \le S[p_1,\ldots ,|S|-1] \le S[p_2,\ldots , |S|-1], \ldots , \le S[p_{|S|-1},\ldots , |S|-1]$. We denote the suffix array of S by SA and denote its i-th element by SA[i]. A Burrows-Wheeler transform (BWT) is a permutation of the sequence S such that its i-th letter becomes $S[SA[i] - 1]$. We denote a BWT of S by L and denote its i-th letter by L[i]. Let us define a rank of S for a letter $c\in \Sigma$ at position t by $\mathsf {Rank}_{c}(t,S) = |\{ j | S[j]=c, 0\le j < t \} |$ and a count of occurrences of letters that are lexicographically smaller than c in S by $\mathsf {CF}_{c}(S) = \sum _{r < c} \mathsf {Rank}_{r}(|S|,S)$, and the operation $\mathsf {LF}_{c}(i, S) = \mathsf {CF}_{c}(L) + \mathsf {Rank}_{c}(i, L)$. The match between $w$ and S is reported as a form of left-closed and right-open interval on SA, and the lower and upper bounds of the interval are respectively computed by $\mathsf {LF}$. Given a letter c and an interval [f, g) that corresponds to suffixes that share the prefix x (i.e., [f, g) reports the locations of the substring x in S), we can find a new interval that corresponds to all suffixes that share the prefix cx (i.e., locations of the substring cx) by

$$\begin{aligned}{}[f', g') = [\mathsf {LF}_c(f, S), \mathsf {LF}_c(g, S) ). \end{aligned}$$

(1)

The leftward extension of the match is called a backward search, which is the main functionality of FM-Index. By starting the search with the initial interval [0, N) and conducting the backward searches for $w[\ell -1], w[\ell -2], \ldots$, the longest suffix match is detected when $f=g$. $\mathsf {Rank}$ and $\mathsf {CF}$ are precomputed and stored in an efficient from that can be searched in constant time. Therefore, the longest suffix match can be computed in $O(\ell )$ time. LPM is found if the search is conducted on $\hat{S}$ and match is extended by $w[0], w[1], \ldots , w[\ell -1]$.

Searching LMEM by repeating LPM for $w[0, \ldots , \ell -1], w[1,\ldots , \ell -1], w[2,\ldots , \ell -1], \ldots , w[\ell -1]$ takes $O(\ell ^2)$ time. We can improve it to $O(\ell )$ time by using the longest common prefix (LCP) array and related data structures [27]. The LCP array, denoted by $\mathsf {LCP}$, is an array that stores the length of the longest prefix of $S[\mathsf {SA}[i-1] , |S|-1]$ and $S[\mathsf {SA}[i] , |S|-1]$ in $\mathsf {LCP}[i]$ for $0 < i \le N$. The lcp-interval [i, j) of lcp-value d is an interval such that it satisfies $\mathsf {LCP}[i]<d$, $\mathsf {LCP}[j]<d$, $\mathsf {LCP}[k] > d$ for all $k\in \{i+1,\ldots , j-1\}$, and $\mathsf {LCP}[k]=d$ for at least one $k\in \{i+1,\ldots , j-1\}$, and is denoted by $d-[i,j)$. $d-[i,j)$ corresponds to all the suffixes that share the prefix $S[SA[i],\ldots , SA[i]+d-1]$. The parent interval of $d-[i,j)$ is the lcp-interval $h-[m, n)$ such that $h<d$ and $0 \le m \le i< j \le n <N$, and there is no other lcp-interval $t-[r, s)$ such that $h<t<d$ and $0 \le m \le r \le i< j \le s \le n <N$. The parent of the lcp-interval [f, g) can be found by

$$\begin{aligned}{}[f', g')= {\left\{ \begin{array}{ll} [\mathsf {PSV}[f_i], \mathsf {NSV}[f_i]) &{} \mathsf {LCP}[g_i] \le \mathsf {LCP}[f_i] \\ {[}\mathsf {PSV}[g_i], \mathsf {NSV}[g_i]) &{} (otherwise), \end{array}\right. } \end{aligned}$$

(2)

where $\mathsf {PSV}[i] = \max \{ j | 0 \le j< i \wedge \mathsf {LCP}[j] < \mathsf {LCP}[i] \}$ and $\mathsf {NSV}[i] = \min \{ j | i \le j< N \wedge \mathsf {LCP}[j] < \mathsf {LCP}[i] \}$. By finding a parent interval using $\mathsf {PSV}$ and $\mathsf {NSV}$ whenever it fails to extend the match, we can avoid useless backward searches, and thus LMEM is found at most $2\ell$ backward searches. $\mathsf {LCP}$, $\mathsf {PSV}$ and $\mathsf {NSV}$ are precomputed and stored in an efficient form that can be searched in constant time, so we can find LMEM in $O(\ell )$ time. See section 5.2 of [27] for more details of the data structures. Examples of the search by FM-Index, $\mathsf {LCP}$, $\mathsf {PSV}$, and $\mathsf {NSV}$ are provided in Appendix 1.

Table 2 Summary of complexities for our protocols and related protocols

Full size table

Proposed protocols

Problem setting and outline of our protocols

We assume that a query holder $\mathcal {A}$, a database holder $\mathcal {B}$, and two computing nodes $P_0$ and $P_1$ participate the protocol. $\mathcal {A}$ holds a query string $w$ of length $\ell$ and $\mathcal {B}$ holds a database string $T$ of length $N$. After the protocol is run, only $\mathcal {A}$ knows LPM or LMEM between $w$ and $T$. $P_0$ and $P_1$ do not obtain any information of $w$ and $T$, except for $\ell$ and $N$.

Our protocol consists of offline, DB preparation, and Search phases. In the offline phase, $\mathcal {B}$ generates BTs (correlated randomness used for multiplication) and sends them to $P_0$ and $P_1$. In the DB preparation phase, $\mathcal {B}$ creates a lookup table and distributes its shares to $P_0$ and $P_1$. In the Search phase, $\mathcal {A}$ generates shares of the query and sends them to $P_0$ and $P_1$, and $P_0$ and $P_1$ jointly compute the result without obtaining any information of the lookup table. Finally, $\mathcal {A}$ obtains the results. Figure 2 shows the schematic view of our goal and model. Note that the offline and DB preparation phases do not depend on a query string, so they can be computed in advance for multiple queries.

In section "Secret-shared recursive oblivious transfer", we propose the important building block ss-ROT that enables recursive reference to a lookup table. In section "Secure LPM", we describe how to design the lookup table based on FM-Index, and propose an efficient protocol for LPM by using the lookup table and ss-ROT. In section "Secure LMEM", we describe the additional table design for auxiliary data structures, and propose the complete protocol for LMEM. Table 2 summarizes the theoretical complexities of the three protocols. For comparison, the complexities of the baseline protocols and a previous method for LPM based on an additive homomorphic encryption [17, 19] are shown. As we mentioned in section "Introduction", the baseline protocols are designed using major techniques of secret-sharing-based protocols. The detailed algorithms are described in Appendix 3.

Secret-shared recursive oblivious transfer

We define a problem called a secret-shared recursive oblivious transfer (ss-ROT) as follows.

Definition 1

We assume a database holder $\mathcal {B}$ and two computing nodes $P_0$ and $P_1$ participate the protocol. $\mathcal {B}$ holds a vector V of length $N$ and $0 \le V[i] < N$. Given the initial position $p_0$ and the depth of recursion $\ell$ $(2 \le \ell )$, the secret-shared recursive oblivious transfer protocol outputs shares of

$$\begin{aligned} \underbrace{V[V[\cdots V}_{\ell }[ p_0 ]\cdots ]] \end{aligned}$$

(3)

without leaking V to $P_0$ and $P_1$.

For simplicity, we denote the recursion of Eq. 3 by $V^{(\ell )}[p_0]$ (e.g., $V[V[p_0]]$ is denoted by $V^{(2)}[p_0]$). In our protocol, all the random values are uniformly generated from $\mathbb {Z}_{2^n}$.

DB preparation phase $\mathcal {B}$ generates $\ell -1$ random values $r^0,\ldots ,r^{\ell -2}$ and computes the following vectors $R^0 , \ldots , R^{\ell -1}$. Each vector $R^j$ has $N$ elements.

$$\begin{aligned} R^j[i] = {\left\{ \begin{array}{ll} (V[i]+r^j)_{\bmod {N}} &{} (j=0) \\ (V[(i-r^{j-1})_{\bmod {N}}]+r^j)_{\bmod {N}} &{} (1\le j\le \ell - 2) \\ (V[(i-r^{j-1})_{\bmod {N}}])_{\bmod {N}} &{} (j= \ell -1) \\ \end{array}\right. } \end{aligned}$$

(4)

$\mathcal {B}$ computes $\mathsf {Share}(R^j[i])$ and sends $[\![R^j[i]]\!]_0$ and $[\![R^j[i]]\!]_1$ to $P_0$ and $P_1$, for $i=0, \ldots , N-1$ and $j=0,\ldots ,\ell -1$.

Search phase The Search phase consists of two steps and is described in Lines 2–5 of Protocol 1. The input is the initial position $p_0$ and shares of R. The output is $[\![V^{(\ell )}[p_0] ]\!]$. An example of a search is illustrated in Fig. 3.

Security intuition

In the DB preparation phase of ss-ROT, $\mathcal {B}$ does not disclose any private values, and $P_0$ and $P_1$ receive the shares. In the Search phase, all the messages exchanged between $P_0$ and $P_1$ are shares except for the result of $\mathsf {Reconst}$ in Step 1. In the j-th step of the loop in Step 1, $p_{j+1} = R^j[p_{j}] = (V^{(j+1)}[p_0]+r^{j})_{\bmod {N}}$ is reconstructed. Since the reconstructed value is randomized by $r^{j}$, no information is leaked. Note that for each vector $R^j$, all the elements $R^j[0], \ldots , R^j[N-1]$ are randomized by the same value $r^{j}$, but only one of them is reconstructed, and different random numbers $r^{0}, \ldots , r^{\ell -1}$ are used for $R^0, \ldots , R^{\ell -1}$. In Step 2, $P_0$ and $P_1$ output a result, and no information other than the result is leaked.

Security

Theorem 1

ss-ROT is correct and secure in the semi-honest model.

Proof

Correctness and security of ss-ROT protocol are proved as follows.

Correctness. We assume the following equation.

$$\begin{aligned} p_{i} = (V^{(i)}[p_0]+r^{i-1})_{\bmod {N}} \end{aligned}$$

(5)

In Step1, for $j=0$, the protocol computes $p_{1}$ by reconstructing $R^0[p_0]$. From the definition of $R^j[i]$ in Eq. 4,

$$\begin{aligned} p_{1} = R^0[p_0] = (V^{(1)}[p_0]+r^0)_{\bmod {N}}. \end{aligned}$$

(6)

For $j=k$, the protocol computes $p_{k+1}$ by reconstructing $R^{k}[p_k]$. From the definition of $R^j[i]$ in Eq. 4 and the assumption of Eq. 5,

$$\begin{aligned} p_{k+1} = R^{k}[p_k]= & {} (V[\, (p_k - r^{k-1})_{\bmod {N}} \,]+r^k)_{\bmod {N}} \nonumber \\= & {} (V[\, V^{(k)}[p_0] \,] +r^k)_{\bmod {N}} \nonumber \\= & {} (V^{(k+1)}[p_0]+r^k)_{\bmod {N}}. \end{aligned}$$

(7)

Eq. 5 holds for $i=1$ by Eq. 6. It also holds for $i=k+1$ under the assumption that Eq. 5 holds for $i=k$. Therefore by induction, Eq. 5 holds for $i=1,\ldots , \ell -1$.

In Step 2, $P_0$ and $P_1$ output $[\![R^{\ell -1}[p_{\ell -1}]]\!]$. Since Eq. 5 holds for $i=\ell -1$,

$$\begin{aligned} R^{\ell -1}[p_{\ell -1}] = (V[( p_{\ell -1} -r^{\ell -2})_{\bmod {N}}])_{\bmod {N}} \end{aligned}$$

is transformed into $(V^{(\ell )}[p_0])_{\bmod {N}}$ by plugging in $p_{\ell - 1} = V^{(\ell -1)}[p_0]+r^{\ell -2}$. Therefore the final output of ss-ROT becomes $(V^{(\ell )}[p_0])_{\bmod {N}}$. The above argument completes the proof of correctness of Theorem 1.

Security. Since the roles of $P_0$ and $P_1$ are symmetric, it is sufficient to consider the case when $P_0$ is corrupted. The input to $P_0$ is $p_0$ and $\ell$, and output of $P_0$ is $V^{(\ell )}[p_0]$. The function achieved by Protocol 1 is deterministic and the protocol is correct. Therefore, to ensure the security of Protocol 1, we need to prove existence of a probabilistic polynomial-time simulator ${\mathcal {S}}$ such that

$$\begin{aligned} \{(\mathcal {S}(p_0, \ell , V^{(\ell )}[p_0]), V^{(\ell )}[p_0])\} \equiv \{(X, V^{(\ell )}[p_0])\}, \end{aligned}$$

(8)

where X is $P_0$’s view. X consists of:

$[\![ R^j[i] ]\!]_0$ for $i=0,\ldots ,N-1$ and $j=0,\ldots ,\ell -1$ (a message from $\mathcal {B}$)
$[\![ R^j[p_j] ]\!]_1$ (j-th message from $P_1$) for $j=0,\ldots ,\ell -1$
$p_j$ (j-th value obtained by $\mathsf {Reconst}([\![R^j[p_{j}]]\!]_0, [\![R^j[p_{j}]]\!]_1)$ in Step1) for $j=1,\ldots ,\ell -1$.

All the messages from $\mathcal {B}$ and $P_1$ are uniformly at random in $\mathbb {Z}_{2^n}$, as they are generated by $\mathsf {Share}$. $p_j+1 = \mathsf {Reconst}([\![R^j[p_{j}]]\!]_0, [\![R^j[p_{j}]]\!]_1)$ holds for $j=0,\ldots ,\ell -2$, and $V^{(\ell )}[p_0] = \mathsf {Reconst}([\![R^{\ell -1}[p_{\ell -1}]]\!]_0, [\![R^{\ell -1}[p_{\ell -1}]]\!]_1)$ holds. $p_1=R^{0}[p_{0}],\; p_2=R^{1}[p_{1}], \ldots , p_{\ell -1}=R^{\ell -2}[p_{\ell -2}]$ are uniformly at random in $\mathbb {Z}_{N}$ from the definition of Eq. 4.

Let us denote a random number u chosen from a set ${\mathcal {U}}$ uniformly at random by $u{\mathop {\in }\limits ^{R}} {\mathcal {U}}$. We construct ${\mathcal {S}}$ as described in Protocol 2. The output of ${\mathcal {S}}$ is $\tilde{R_0} \in \mathbb {Z}_{2^n}^{\ell \times N}$, $\tilde{R_1} \in \mathbb {Z}_{2^n}^{\ell }$, and $\tilde{p_1},\ldots ,\tilde{p}_{\ell -1}$. In Line 6 and Line 9, $\tilde{p_1},\ldots ,\tilde{p}_{\ell -1}$ are generated such that they are uniformly at random in $\mathbb {Z}_{N}$. In Line 7, $\tilde{R_0}^{j}[{p_0}]$ and $\tilde{R_1}[0]$ are generated by $\mathsf {Share}$ such that they are shares of $\tilde{p}_1$ and uniformly at random in $\mathbb {Z}_{2^n}$. In Line 10, $\tilde{R_0}^{j}[{\tilde{p}_j}]$ and $\tilde{R_1}[j]$ are generated by $\mathsf {Share}$ such that they are shares of $\tilde{p}_{j+1}$ and uniformly at random in $\mathbb {Z}_{2^n}$ for $j=1,\ldots ,\ell -2$. In Line 12, $\tilde{R_0}^{j}[\tilde{p}_{\ell -1}]$ and $\tilde{R_1}[\ell -1]$ are generated by $\mathsf {Share}$ such that they are shares of $V^{(\ell )}[p_0]$ and uniformly at random in $\mathbb {Z}_{2^n}$. All the elements of $\tilde{R_0}$ except for $\tilde{R_0}^{0}[p_0]$ and $\tilde{R_0}^{j}[\tilde{p}_j]$ ($j=1,\ldots ,\ell -1$) are uniformly at random in $\mathbb {Z}_{2^n}$ by Line 3. Therefore, Eq. 8 holds. By the above discussion, we find our ss-ROT satisfies security in the semi-honest model. $\square$

Complexities

In the DB preparation phase, $\mathcal {B}$ generates shares of V of length $N$ for $\ell$ times. Therefore, time and communication complexities are $O(\ell N)$. For the Search phase, $\mathsf {Reconst}$ is computed $\ell$ times in Step 1. Since the time, communication, and round complexities of $\mathsf {Reconst}$ are O(1), those of the Search phase become $O(\ell )$.

Secure LPM

Construction of lookup table The goal is to find LPM securely. To apply FM-Index for a prefix search, the reverse string of $T$ (i.e., $\hat{T}$) is used. The backward search of FM-Index is formulated by Eq. 1. If we precompute $\mathsf {LF}_c(i, \hat{T})$ for $i=0, \ldots , N$ and $c \in \{$A,T,G,C$\}$, and store them in a lookup table that consists of four vectors: $V_\mathrm{A}$, $V_\mathrm{C}$, $V_\mathrm{G}$, and $V_\mathrm{T}$ such that $V_c[i] = \mathsf {LF}_c(i, \hat{T})$, Eq. 1 is replaced by the following table lookup

$$\begin{aligned} f_{k+1}=V_{w[k]}[f_k],\qquad g_{k+1}=V_{w[k]}[g_k]. \end{aligned}$$

(9)

I.e., starting with the initial interval $[f_0 = 0, g_0 = N)$, we can compute the match by recursively referring to the lookup table while $f<g$.

Protocol overview The key idea of Secure LPM is to refer to V by ss-ROT, i.e., $P_0$ and $P_1$ jointly refer to V $\ell$ times in a recursive manner. To achieve backward search, $P_0$ and $P_1$ need to select $V_x[\cdot ]$ for each reference, where x is a query letter to be searched with. This is achieved by expressing the query letter by unary code (Eq. 11 ) and computing the inner product of Eq. 11 and $(V_\mathrm{A}[\cdot ], V_\mathrm{C}[\cdot ], V_\mathrm{G}[\cdot ], V_\mathrm{T}[\cdot ])$. To find LPM, $P_0$ and $P_1$ need to check $f=g$ for each reference. We use the subprotocol $\mathsf {Equality}$ to check it securely. Since V is randomized with different numbers for searching f and g, the difference of the random numbers is precomputed and removed securely upon the equality check. $\mathcal {A}$ receives only the result of each equality check to know LPM. For example, LPM is the prefix of length $i-1$ when $f=g$ for the i-th reference. If $f \ne g$ for all references, LPM is the entire query.

DB preparation phase$\mathcal {B}$ creates a lookup table and generates the following $4\ell$ vectors in a similar manner to ss-ROT. For simplicity, we denote the length of $V_c$ by $N'=N+1$.

$$\begin{aligned} R_{c,f}^j[i]= {\left\{ \begin{array}{ll} (V_c[i]+r_f^j)_{\bmod {N'}} &{} (j=0)\\ (V_c[(i-r_{f}^{j-1})_{\bmod {N'}}]+r_f^j)_{\bmod {N'}} &{} (1\le j <\ell ) \end{array}\right. } \end{aligned}$$

(10)

$R_{c,f}^j[i]$ is used for computing the lower bound f of the interval [f, g). We also generate $R_{c,g}^j[i]$ for the upper bound g. R consists of $8\ell$ vectors, each of length $N'$. Since the longest match is found when $f=g$, $\mathcal {B}$ also generates a vector $r'[j]=(r_f^j-r_g^j)_{\bmod {N'}}$ that is used for equality check of f and g. Then, $\mathcal {B}$ sends shares of $R_{c,f}^j[i]$, $R_{c,g}^j[i]$, and $r'[j]$ to $P_0$ and $P_1$.

Search phase Protocol 3 describes the algorithm in detail. $\mathcal {A}$ generates four vectors ${q}_\mathrm{A}$, ${q}_\mathrm{C}$, ${q}_\mathrm{G}$, ${q}_\mathrm{T}$, each of length $\ell$, as follows.

$$\begin{aligned} q_c[j] = {\left\{ \begin{array}{ll} 1 &{} (c=w[j])\\ 0 &{} (c\ne w[j]) \end{array}\right. } \end{aligned}$$

(11)

For each j, $(q_\mathrm{A}[j], q_\mathrm{C}[j], q_\mathrm{G}[j], q_\mathrm{T}[j])$ encodes $w[j]$ (e.g., $(q_\mathrm{A}[j], q_\mathrm{C}[j], q_\mathrm{G}[j], q_\mathrm{T}[j])=(1,0,0,0)$ if $w[j]=\mathrm{A}$). The aim of the encode is to compute $[\![ R_x[j] ]\!] = [\![ \sum _{c\in \Sigma } q_c[j] \cdot R_c[j] ]\!]$ when $w[j]=x$. Figure 4 illustrates an example of the table lookup.

$\mathcal {A}$ generates shares of ${q}_\mathrm{A}$, ${q}_\mathrm{C}$, ${q}_\mathrm{G}$, ${q}_\mathrm{T}$ and distributes them to $P_0$ and $P_1$. $P_0$ and $P_1$ compute $\mathsf {LF}_{w[j]}(f',\hat{T}) + r^j_f$ and $\mathsf {LF}_{w[j]}(g',\hat{T}) + r^j_g$ in Lines 5–8 without leaking $f'$ and $g'$, where $[f', g')$ corresponds to the match of w[0, j] and $\hat{T}$. In Lines 10–13, the equality of $f'$ and $g'$ is examined for all rounds. Note that different values $r^{j-1}_f$ and $r^{j-1}_g$ are used for $f_j = (f'-r^{j-1}_f)_{\bmod {N'}}$ and $g_j = (g'-r^{j-1}_g)_{\bmod {N'}}$ in order to conceal $f'$ and $g'$. Since $f'$, $g'$, $r^{j-1}_f$, $r^{j-1}_g, r'[j-1]\in \{0,\ldots ,N'-1\}$, it is sufficient to check if $f_j - g_j - r'[j-1]$ is equal to either one of $-N', 0,$ and $N'$. In Lines 16–18, $\mathcal {A}$ receives all the results of equality checks (i.e., $[\![o[1]]\!]^B ,\ldots ,[\![o[\ell ]]\!]^B$) from $P_0$ and $P_1$, and knows LPM by reconstructing them. For example, if $w=$GCT and $o=(0,0,1)$, $\mathcal {A}$ knows that LPM is GC.