Terminal Embeddings in Sublinear Time

Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $\rho\ge 1$ if $\rho$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C \rho d_X(x, q) . \end{equation*} When $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+\epsilon$ is achievable via such a terminal embedding with $m = O(\epsilon^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside of prior work is that evaluating their embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $\Theta(n)$ constraints in~$m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $O^* (n^{1-\Theta(\epsilon^2)} + d)$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.


Introduction
Definition 2.2 (Outer Extension). Given X ⊂ R d and f : R d → R k , we say that g : R d → R k ′ for k ′ > k is an outer extension of f on X if: ∀x ∈ X : g(x) = ( f (x), 0, . . . , 0).
All previous approaches [EFN17,MMMR18,NN19] as well as ours all construct terminal embeddings by extending a standard distance preserving embedding on X by a single coordinate. Therefore, in all our subsequent discussions, we restrict ourselves to the case where k ′ = k + 1. However, [NN19] require the stronger property that the distance preserving embedding being extended satisfies ε-convex hull distortion, allowing them to obtain the optimal embedding dimension of O(ε −2 log n): Definition 2.3. Given X = {x i } n i=1 ⊂ R d and ε > 0, we say that a matrix Π ∈ R k×d satisfies εconvex hull distortion for X if ∀z ∈ Conv(T) : | Πz − z | ε, where T = x−y x−y : ∀x, y ∈ X .
Furthermore, [NN19] show that a matrix with i.i.d. subgaussian entries satisfies this property with high probability. We now formally describe the construction analyzed in [NN19]. Given query q and Π ∈ R k×d satisfying ε-convex hull distortion for X, they construct a terminal embedding for q by first finding v ∈ R k satisfying the following constraints: wherex = arg min x∈X x − q ; that is, the closest neighbor of q in X. It is shown in [NN19], building upon [MMMR18], that such a point indeed exists and furthermore, the above set of constraints are convex implying the existence of polynomial time algorithms to find v. Given v * satisfying the above constraints, it is then shown that one can set f (q) := z q = (v * , q −x 2 − v * − Πx 2 ). While Prog is not the convex program we solve in our work, it is still instructive to analyze it to construct an algorithm with fast query time albeit with large memory. We do this in two steps: 1. We assume access to a separating oracle for the set K := {v ∈ R k : v satisfies Prog} and analyze the complexity of an optimization algorithm making queries to the oracle.
2. We then construct a separating oracle O for Prog.
First, observe that k = O(ε −2 log n), as was shown in [NN19]. Now let O be a separating oracle for the convex set K = {v ∈ R k : v satisfies Prog}; that is, given v ∈ R k , O either reports that v ∈ K or outputs u = 0 such that for all y ∈ K, we have y − v, u 0. Then standard results on the Ellipsoid Algorithm [NY83,Ber16] imply that one can find a feasible v by making O(1) calls to O with each call incurring additional time O(1). Hence, we may restrict ourselves to the task of designing a fast separating oracle for K.
To implement a fast separation oracle, first note that the first constraint in Prog can be checked explicitly in time O(d) and if the constraint is violated, the oracle may return Πx − v. If the first constraint is satisfied, consider the following re-arrangement of the second constraint: By denoting v y = (q−y,−(v−Πy)) (q−y,−(v−Πy)) and v y,z = (y−z,Π(y−z)) (y−z,Π(y−z)) , we see that the above constraint essentially tries to enforce that vx, v x,x is close to 0. Note that from the constraints that we have already checked previously, we have that v x,x , vx ≈ 1 and hence, the condition | vx, v x,x | ≈ 0 is equivalent to vx ± v x,x = √ 2. Conversely, | vx, v x,x | ≫ Cε implies vx ± v x,x √ 2 − Cε for some signing. Since, v x,y for x, y ∈ X are independent of q, we may build a fast separating oracle by building n nearest neighbor data structures, one for each x ∈ X, with the point set {± v y,x } y∈X and at query time, constructing vx and querying the data structure corresponding tox. Despite this approach yielding a fast separating oracle, it has three significant shortcomings: 1. Computing exact nearest neighbor in high dimensions is inefficient 2. Queries may be adaptive, violating guarantees of known randomized approximate nearest neighbor data structures 3. The space complexity of the separating oracle is rather large.
The first two points are easily resolved: the correctness guarantees can be straightforwardly extended to the setting where one works with an approximate nearest neighbor and adopting the framework from [CN20] in tandem with known reductions from Approximate Nearest Neighbor to Approximate Near Neighbor yield an adaptive approximate nearest neighbor algorithm.
For the third point, note that in the approach just described, we construct n data structures with n data points each. Hence, even if each data structure can be implemented in O(n) space, this still yields a data structure of at least quadratic space complexity. The rest of our discussion is dedicated to addressing this difficulty.
We first generalize Prog, somewhat paradoxically, by adding more constraints: ∀y ∈ X : v − Πy (1 + ε) q − y ∀x, y ∈ X : | v − Πy, Π (x − y) − q − y, x − y | ε q − y x − y . (Gen-Prog) Despite these constraints (approximately) implying those in Prog, they are also implied by those from Prog. Hence, in some sense the two programs are equivalent but it will be more convenient to describe our approach as it relates to the generalized set of constraints. These constraints may be interpreted as a multi-centered characterization of the set of constraints in Prog. While Prog only has constraints corresponding to a centering of q with respect tox and its projection, Gen-Prog instead requires v to satisfy similar constraints irrespective of centering. The first key observation behind the construction of our oracle is that it is not necessary to find a point satisfying all the constraints Gen-Prog. It suffices to construct an oracle, O, satisfying: 1. If O outputs FAIL, v can be extended to a terminal embedding for q; that is v may be appended with one more element to form a valid distance-preserving embedding for q.
2. Otherwise, O outputs a separating hyperplane for Gen-Prog. This is weaker than the one just constructed for Prog in two significant ways: the oracle may output FAIL even if v does not satisfy Prog, and the expanded set of constraints allow a greater range of candidate separating hyperplanes, hence, making the work of the oracle "easier". The second key observation underlying the design of our oracle is one that allows restricting the set of relevant constraints for each input substantially. Concretely, to ensure that v can be extended to a point preserving its distance to x, it is sufficient to satisfy the following two constraints: for any y ∈ X satisfying q − y = O( q − x ). In particular, the point y may be much farther from q thanx but still constitutes a good "centering point" for x. Therefore, we simply need to ensure that our oracle checks at least one constraint involving a valid centering point for x for all x ∈ X. However, at this point, several difficulties remain: 1. Which constraints should we satisfy for any input query q?
2. How do we build a succinct data structure quickly checking these constraints?
These difficulties are further exacerbated by the fact that the precise set of constraints may depend on the query q which may be chosen adaptively. In the next two subsections, we address these issues with the following strategy where we still make use of nearest neighbor data structures over { v x,y } x,y∈X : 1. Each nearest neighbor data structure consists of points { v y,x } y∈S for some small S ⊂ X 2. Use a smaller number of data structures 3. Only a small number of relevant data structures are queried when presented with query q For each of these three choices, we will exploit recent developments in approximate near neighbor search [ALRW17]. In Subsection 2.1, we describe our approach to construct an oracle for a fixed scale setting where the distance to the nearest neighbor is known up to a polynomial factor and finally, describe the reduction to the fixed scale setting in Subsection 2.2.

Fixed Scale Oracle
In this subsection, we outline the construction of a suitable separation oracle for all q satisfying r q −x poly(n) r for some known r. For the sake of illustration, in this subsection we only consider the case where our oracle has space complexity n 1+o(1) d though our results yield faster oracles if more space is allowed (see Theorem 3.1). In order to decide which points to use to construct our nearest neighbor data structures, we will make strong use of the following randomized partitioning procedures. Given X ⊂ R d , these data structures construct a set of subsets such that for a typical input point, x, sets in h(x) contain points that are close to x and exclude points far from x. The data structure is formally defined below: Definition 2.4. We say a randomized data structure is an where the probability is taken over the random decisions used to construct D.
The first condition in the above definition restricts the space complexity of the data structure and the sum of the number of points stored in all of the sets in S while the third condition states that each point is replicated very few times across all the sets in S in expectation. The second condition states that for any input x, h is computable quickly and maps x to not too many sets in S. Finally, the last two conditions ensure that points far from x are rarely in the sets x maps to and that points close to x are likely to be found in these sets.
Data structures satisfying the above definition have a been a cornerstone of LSH based approaches for approximate nearest neighbor search which yield state-of-the-art results and nearly optimal time-space tradeoffs [HIM12, AI08, ALRW17]. The conditions with probability 1 can be ensured by truncating the construction of the data structure if its space complexity grows too large or by truncating the execution of h on x if its runtime exceeds a certain threshold. Finally, the events with probability 0.99 can be boosted from arbitrary constant probability by repetition. In this subsection's setting of almost linear memory, any (0, c)-AP data structure suffices for any c < 1, and [ALRW17] show that an (0, 7/16)-AP data structure exists.
For the sake of exposition, assume that the second, third and fourth conditions in the above definition hold deterministically; that is, assume that each data point is only replicated in n ρ u +o(1) many of the S i , for each x only n ρ c +o(1) many points farther away from x than 2r are present in the sets mapped to by h and each point in the dataset within r of x is present in one of the sets that x maps to. In our formal proof, we show that any issues caused due to the randomness are easily addressed. We now instantiate O (log(n)/γ) independent (0, 7/16)-AP data structures, for the point set X with r i = (1 + γ) i r and γ ≈ 1/ log 3 n. Note that this only results in O(1) data structures in total. Now, for each i and S ∈ S i , we pick l ≈ log n random points, Z i,S = z j l j=1 , and instantiate nearest neighbor data structures for the points ± v x,z j x∈S and assign any point y ∈ S within 4r i of z j to z j for S. Note that these assignments are only for the set S; for a distinct set S ′ such that y ∈ S ′ , y need not be assigned to any point. Each of these data structures only stores n 1+o(1) points in total and the existence of near neighbor data structures with space complexity dn 1+o(1) (and query time dn 1− cε 2 +o(1) ) [ALRW17] complete the bound on the space complexity of a data structure at a single scale. Since we only instantiate O(1) many such data structures, this completes the bound on the space complexity of the data structure. By choosing the points randomly in this way, one can show that for any x, the total number of unassigned points in h i (x) is at most n ρ c +o(1) by partitioning X into a set with points close to x and those far away and analyzing their respective probabilities of being assigned in each set. Furthermore, note the total number of points stored in all the h i (x) is trivially at most n 1+o(1) . At query time, suppose we are given query q,x and a candidate v and we wish to check whether v can be extended to a valid terminal embedding for q. While having access to the exact nearest neighbor is an optimistic assumption, extending the argument to use an approximate nearest neighbor is straightforward. We query the data structure as follows, we query each data structure D i withx and for each S ∈ h i (x), we check whether each unassigned point, x, satisfies | v x,x , vx | ≈ 0. Then, for each point in z j ∈ Z i,S , we query its nearest neighbor data structure with v z j . If any of the data structures report a point significantly violating the inner product condition, we return that data point as a violator to our set of constraints.
We now prove the correctness and bound the runtime of the oracle. We start by bounding the runtime of this procedure. For a single scale, r i , we query at most n ρ c +o(1) unassigned points and their contribution to the runtime is correspondingly bounded. Intuitively, this is true because h(x) contains at most n ρ c +o(1) points far from x and if there are more than n ρ c +o(1) points close to x, they tend to be assigned. For assigned points, there are at most n 1+o(1) many of them (repetitions included) spread between n ρ c +o(1) many nearest neighbor data structures (as |h i (x)| n ρ c +o(1) ) and each of these data structures has query time l 1− cε 2 +o(1) where l is the number of points assigned to the data structure. A simple convexity argument shows that the time taken to query all of these is at most n 1− cε 2 +o(1) . This bounds the query time of the procedure.
To establish its correctness, observe that when the algorithm outputs a hyperplane, the correctness is trivial. Suppose now that the oracle outputs FAIL. Note that any point, x ∈ X, very far away fromx may be safely ignored (say, those poly(n) r away fromx) as an embedding that preserves distance tox is also preserves distance to x by the triangle inequality. We now show that for any other x, it satisfies | v x,y , v y | for some y with q − y C q − x . Any such x must satisfy x −x 2 q − x by the triangle inequality and the fact thatx is the nearest neighbor. As a consequence, there exists i such that x −x r i and x − q 0.5r i−1 . For this i, there exists S ∈ h i (x) containing x. In the case that x is not assigned, we check | v x,x , vx | and correctness is trivial. In case x is assigned, it is assigned to y with y − x 4r i and we have: where the final inequality follows from the fact that x − q 0.5r i−1 . This proves that the inner product condition for x is satisfied with respect to y with q − y 10 x − q . This concludes the proof in the second case where the oracle outputs FAIL.
The argument outlined in the last two paragraphs concludes the construction of our weaker oracle when an estimate of q −x is known in advance. The crucial property provided by the existence of the (ρ u , ρ c )-AP procedure is that there are at most n 1+o(1) many points used to construct the near neighbor data structures for the points v x,y (as opposed to n 2 for the previous construction). This crucially constrains us to having either a large number of near neighbor data structures with few points or a small number with a large number of points but not both. However, the precise choice of how the algorithm trades off these two competing factors is dependent on the set of data points and the scale being considered. The savings in query time follow from the fact that at most n c+o(1) of these data structures are consulted for any query for some c < 1.

Reduction to Fixed Scale Oracle
We reduce the general case to the fixed scale setting from the previous subsection. To define our reduction, we will need a data structure we will refer to as a Partition Tree that has previously played a crucial part in reductions from the Approximate Nearest Neighbor to Approximate Near Neighbor [HIM12]. We show that the same data structure also allows us to reduce the oracle problem from the general case to the fixed scale setting. Describing the data structure and our reduction, requires some definitions from [HIM12]: ⊂ R d and r > 0. We will use GG(X, r) to denote the graph with nodes indexed by x i and an edge between x i and x j if x i − x j r. The connected components of this graph will be denoted by CC(X, r); that is, CC(X, r) = {C j } m j=1 is a partitioning of X with x ∈ C j if and only if x − y r for some y ∈ C j \ {x}.
Note from the above definition that CC(X, r) results in increasingly fine partitions of X as r decreases. This notion is made precise in the following straightforward definition: Definition 2.6. For a data set X = {x i } n i=1 ⊂ R d , we say that a partition C refines a partition C ′ if for all C ∈ C, C ⊆ C ′ for some C ′ in C ′ . This will be denoted by C ′ ⊑ C.
Next, define r med (X) as: We are now ready to define a Partition Tree: Tree of X is a tree, T , whose nodes are labeled by Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx where Z, C rep ⊂ X, {T C } C∈C low ∪ {T rep } represent its children, C low , C high are partitions of Z and r apx > 0 satisfying the following conditions: For the sake of notational simplicity, we will use T ′ ∈ T both to refer to a node in the tree as well as the subtree rooted at that node and Size(T ′ ) to refer to the sum of the number of points stored in the subtree T ′ . The above condition implies Size(T ) O(n log n) [HIM12]. While deterministic data structures with the same near linear runtime are also known [Har01], we include, for the sake of completeness, a simple probabilistic algorithm to compute a partition tree with probability 1 − δ in time O(nd log 2 n/δ). Having defined the data structure, we now describe how it may be used to define our reduction.
At a high level, the reduction traverses the data structure starting at the root and at each step either terminating at the node currently being explored or proceeds to one of its children. By the definition of the data structure, the number of points in the node currently being explored drops by at least a factor of 2 in each step. Therefore, the procedure explores at most ⌈log n⌉ nodes. For any node T ′ ∈ T , with associated point set Z, currently being traversed, we will aim to enforce the following two conditions: 1. An approximate nearest neighbor of q in Z is also an approximate nearest neighbor in X 2. A terminal embedding of q for Z is also valid for X.
For simplicity, we assume the existence of near neighbor data structures; that is, data structures which when instantiated with (X, r) and given a query q, output a candidate y ∈ X such that y − q r if min x∈X x − q r (we refrain from assuming access to nearest neighbor data structures here as this reduction will also be used to construct our nearest neighbor data structures). For each node T ′ = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx , we first decide two thresholds r low = r apx /poly(n) and r high = poly(n)r apx and interpolate the range with roughly m ≈ (log r high /r low )/γ many near neighbor data structures, {D i } m i=0 , with r i = (1 + γ) i r low for the point set Z. Note, we set γ ≈ 1/ log 3 n which implies that we instantiate at most O(1) many near neighbor data structures.
At query time, suppose we are at node T ′ = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx with associated near neighbor data structures, D i . We query each of the data structures D i with q and we have three possible cases: 1. The nearest neighbor to q is within a distance of r low 2. The nearest neighbor to q is beyond r high of it 3. The nearest neighbor to q is between r low and r high of q The first case occurs when D 0 returns a candidate nearest neighbor, the second when none of the near neighbor data structures return a candidate and the third when, D i succeeds but D i−1 fails for some i. If the third case occurs, the reduction is complete. If the second case occurs, let x ∈ C ∈ C high and x ∈ C rep ∩ C. We have by the triangle inequality and Definitions 2.5 and 2.7 x − x poly(n)r med ≪ r high and hence q − x ≈ q − x and hence we recurse in C rep still satisfying the two conditions stated above. For the first case, let x ∈ C ∈ C low such that q − x r low . From Definitions 2.5 and 2.7, any x / ∈ C satisfies x − x r med /poly(n) ≫ r low and hence, both conditions are again maintained as the nearest neighbor of q in Z is in C and a terminal embedding for q in C is a terminal embedding in Z by the triangle inequality.

Runtime Improvements and Adaptivity
We conclude with other technical considerations glossed over in the previous discussion. As remarked before, we assumed access to exact nearest and near neighbor data structures which perform correctly even when faced with adaptive queries. While the arguments outlined in the previous two subsections extend straightforwardly to the setting where approximate nearest neighbors are used, the guarantees provided by previous approaches to the approximate near neighbors are not robust in the face of adaptivity.
Definition 2.8. For ρ u , ρ c 0 and c > 1, we say a randomized data structure is an (ρ u , ρ c , c)-Approximate Near Neighbor (ANN) data structure for Approximate Near Neighbor if instantiated with a set of data points X = {x i } n i=1 and r > 0 constructs, D, satisfying: where the probability is taken over both the random decisions used to construct D and those used by D to answer the query q. Additionally, q is assumed to be independent of D.
A simple repetition argument can then be used to devise adaptive near neighbor data structure with similar guarantees [CN20]. Used in tandem with the reduction outlined in the previous subsection yields the adaptive nearest neighbor data structures used in Subsection 2.1.
Finally, the argument outlined previously enabled computing terminal embeddings in time dn ρ+o(1) for some ρ < 1 which while being sublinear in n is suboptimal in its interaction with the dimension. This is in contrast to state-of-the-art approaches to nearest neighbor search which yield runtimes scaling as dn o(1) + n ρ+o(1) . However, all these approaches deduce this result by first projecting onto a lower dimensional space using a Johnson-Lindenstrauss (JL) projection and building the data structure in the lower dimensional space. This is not feasible in our setting as JL projections are not robust to adaptivity and we require an alternate strategy.
To improve our runtime, suppose Then, to construct a terminal embedding, we may construct the vectors v x,y and v x by using the vectors (Π ′ x, Π ′ y, Π ′ q) instead of the vectors in the high dimensional space. Assuming the projections for x ∈ X are pre-computed, we may do this projection in O(d) time and the rest of the procedure to compute terminal embeddings takes time n ρ+o(1) . When q is independent of the data structure, a standard JL-sketch satisfies Eq. (1) with high probability but this is not true when q depends on d. Thankfully, we show that if one draws O(d) many JL-sketches, at least 95% satisfy Eq.
(1) for any query q with high probability. Note that the order of the quantifiers in the statement make the proof more challenging than previous work using such ideas [CN20] and requires a careful gridding argument which we carry out in Section 5.

Terminal Embeddings
In this section, prove the main theorem of the paper. Note that in the following theorem the query q can be chosen with full knowledge of the data structure. That is, conditioned on the successful creation of the data structure, the randomized construction of z q is only over the random decisions taken at query-time and not during the creation of the data structure. The main theorem of the paper is stated below: Partitioning data structure, a (ρ 1 , ρ 2 , (1 + ε † ))-Approximate Near Neighbor data structure for ε † = cε for some small enough c > 0 and parameter ρ rep constructs a data structure, (D, Π ∈ R k×d ), satisfying the following guarantees: 1. Π has ε-convex hull distortion for X 2. Given q ∈ R d , D produces with probability at least 1 − 1/poly(n) a vector z q ∈ R k+1 such that: with probability at least 1 − 1/poly(n).
Before we proceed, we will instantiate the above theorem in three specific cases. We note that the state-of-the-art algorithms for approximate nearest neighbor search are implemented in terms of approximate partitioning schemes as defined here [ALRW17] who show for any (ρ u , ρ c ) satisfying: there exist both an (ρ u , ρ c , c)-Approximate Near Neighbor data structure and a (ρ u , ρ c )-Approximate Partitioning scheme when c = 2. By instantiating this, we get a data structure to compute εterminal embeddings for (possibly different) universal c, C > 0: The previous three results capture a range of potential time space tradeoffs for data structures computing terminal embeddings. We now move on to the proof of Theorem 3.1. As discussed in Section 2, our algorithm operates by constructing a weak separating oracle for a convex program. In Subsection 3.1, we define the convex program for which we define our oracle, Subsection 3.2 defines our weak oracle for a fixed scale and in Subsection 3.3 we incorporate the reduction of the general case to the fixed scale scenario and complete the proof of the theorem.

Generalized Characterization of Terminal Embeddings
To start, we recall a key lemma from [NN19]: Finally, the convex program for which we will construct our oracle is defined in the following generalization of [NN19,MMMR18]. As remarked in Section 2, we generalize the convex program to an expanded set of constraints but crucially do not attempt to satisfy all of them in our algorithm. For the convergence guarantees for our weak oracle to hold (Lemma B.2), we first need to show the feasibility of the convex program.
Then, for any q ∈ R d , there exists z ∈ R k such that: Proof. Let x * = arg min x∈X q − x . As in [NN19,MMMR18], we consider a bilinear game: where the exchange of the min and max follows from von Neumann's minimax theorem. Considering the first formulation, let λ 1 1. We have for such a λ: where the first inequality follows from Cauchy-Schwarz and the final inequality follows from the fact that Π has ε-convex hull distortion. From the previous result, we may conclude that there exists w ∈ R k satisfying: Now consider the vector z = w + Πx * . For this z, we have for any x, y ∈ X: where the second inequality is due to the condition on z, the third inequality is a consequence of Lemma B.3 and the second to last inequality follows from the fact that q − x * q − x . This establishes the first claim of the lemma. For the second claim, we have for any x ∈ X: where the first inequality follows from the fact that z − Πx * q − x * , the second inequality follows from the fact that Π has ε-convex hull distortion for X, the fourth from our condition on z and the fifth follows from the triangle inequality and the fact that q − x * q − x . By re-arranging the above inequality and taking square roots, we get: concluding the proof of the lemma.

Fixed Scale Violator Detection
Note that as a consequence of Theorem 4.1, we may assume that the function can instantiate (ρ 1 , ρ 2 , (1 + ε † ))-Adaptive Approximate Nearest Neighbor (AANN) data structures and (ρ 3 , ρ 4 )-Approximate Partitioning (AP) data structures. Our data structure for constructing an oracle at a fixed scale is constructed in Algorithm 1 and the query procedure is outlined in Algorithm 2. Algorithm 1 takes as input a set of data points X, a projection matrix Π, a memory parameter ρ rep and a failure probability δ. When we instantiate this data structure, the point sets used to construct it will be the subsets of points corresponding to the nodes of the Partition Tree constructed in Lemma A.4. Algorithm 2 takes as input the data structure constructed by Algorithm 1, the query point q, a candidate solution to the convex program v, an approximate nearest neighbor of q in X and another tolerance parameter whose role will become clear when incorporating this data structure into the reduction.
To state the correctness guarantees for Algorithm 1, we start by introducing some notation. For x ∈ X and r > 0, define the local neighborhood of x as follows: N loc (x, r) = {y ∈ X : y − x 2r}. For x ∈ X such that |N loc (x, r)| n 1−ρ rep , our success event is simply that there exists z ∈ Z such that z − x 2r and that D z is instantiated successfully. For x ∈ X such that |N loc (x, r)| n 1−ρ rep , the success event is more complicated. Informally, we will require that most of AP data structures produce appropriate partitions for x and furthermore, that each y ∈ X with y − x r is well represented in these data structures. This is formally described in the following lemma: and δ ∈ (0, 1), Algorithm 1 produces with probability 1 − δ, a data structure D with the following guarantees: 1. For x ∈ X such that |N loc (x, r)| n 1−ρ rep , we have that there exists z ∈ Z such that x − z 2r. That is, x is assigned in the first stage of the algorithm.
Proof. For the third claim, note that the total number of D z instantiated is at most n rep . Therefore, the probability that all the D z are instantiated correctly is at least 1 − δ/16. As for the D i,S,w , note that at most l pn 1+ρ 3 +o(1) many of these are instantiated as there are l AP data structures, each of which has at most n 1+ρ 3 +o(1) subsets and each subset has at most p AANN data structures. Therefore, again by the union bound the probability that each of these are instantiated correctly is at least 1 − δ/16. Hence, the probability that all AANN data structures are instantiated correctly is at least 1 − δ/8. For the last claim, note that the space occupied by each of the D z data structures is at most O(n 1+ρ 1 +o(1) ) and there are n rep of these. The space required to store each of the l AP data structures is at most O(n 1+ρ 3 +o(1) ). Finally, the space occupied by the D i,S,w data structures is given by: by the fact that |S| n for each S, ∑ S∈S i |S| n 1+ρ 3 +o(1) for each l and the convexity of the function f (x) = x 1+ρ 3 . From the previously established bounds, the space complexity follows.
For the first claim, let x ∈ X such that |N loc (x, r)| n 1−ρ rep and K i = 1 {z i ∈ N loc (x, r)}. We have P(K i = 1) n −ρ rep . Therefore, we have that E K := ∑ n rep i=1 K i C log(n/δ) for some large constant C. By noting that Var(K) E[K], we have by Bernstein's inequality that with probability at least δ/(16n) that there exists K i = 1 for some i ∈ [n rep ]. Therefore, there exists z ∈ Z such that z − x 2r. By a union bound, this establishes the first claim with probability at least 1 − δ/16. We now prove each of the three conclusions of the second claim of the lemma separately. First note that from Definition 2.4, the linearity of expectation and the fact that |N loc (x, r)| n 1−ρ rep , we have for any i ∈ [l]: By an application of Markov's Inequality on the second inequality, we have by the union bound: Letting L i be the indicator random variable for the above event for x in the data structure D i , we have by the Chernoff bound that P ∑ l i=1 L i 0.98l 1 − δ/(16n). Therefore, the first conclusion of the second claim holds for x with probability at least 1 − δ/(16n). A union bound now establishes the lemma for all x ∈ X with probability at least 1 − δ/16.
For the second conclusion of the second claim, let i be such that with more points from the local neighborhood of x than far away points. For any such S, the probability that w j ∈ N loc (x, r) is at least 1/2. Therefore, we have be the definition of p that with probability at least 1 − δ ‡ that there exists w ∈ W i,S such that w ∈ N loc (x, r). Note that all the points in N loc (x, r) ∩ S are assigned to w in this case. By the union bound, the probability that this happens for all such S ∈ h i (x) is at least 1 − δ/(ln 2+(ρ 3 +ρ 4 ) ). Alternatively, for S such that |N loc (x, r) ∩ S| |(X \ N loc (x, r)) ∩ S|, the total number of unassigned points is upper bounded by 2|(X \ N loc (x, r)) ∩ S| and by summing over all such S the conclusion follows from the definition of L i . Therefore, by a union bound, we get that the second conclusion of the second claim holds for x ∈ X for all i such that L i = 1 with probability at least 1 − δ/(n 2+(ρ 3 +ρ 4 ) ). The conclusion for all x ∈ X follows from another union bound with probability at least 1 − δ/16.
Finally, for the last conclusion of the second claim, let x, y ∈ X such that x − y r and M i = 1{∃S ∈ h i (x) : y ∈ S}. From Definition 2.4, we have that P {M i = 1} 0.99. Therefore, we have by the Chernoff bound that with probability at least 1 − δ/(16n 3 ) that the conclusion holds for specific x, y ∈ X satisfying x − y r. Through a union bound, the conclusion holds for all x, y ∈ X with x − y r with probability at least 1 − δ/(16n). A final union bound over all the events described in the proof gives us the required guarantees on the running of Algorithm 1 with probability at least 1 − δ/2.
We delay the analysis of the query procedure to the next subsection where we incorporate the fixed scale data structure into a multi scale separating oracle and conclude the proof of Theorem 3.1.

Multi Scale Reduction and Proof of Theorem 3.1
In this subsection, we wrap up the proof of Theorem 3.1 by incorporating the fixed scale violator detection method from Subsection 3.2 into a multi-scale procedure. We first define the auxiliary data structures that we will assume access to for the rest of the proof: 1. From Lemma 3.2, we may assume Π ∈ R k×d satisfies ε † -convex hull distortion for X with ε † = cε for some suitably small constant c > 0 with k = O(ε −2 log n) 2. From Lemma A.4, we may assume access to a partition tree T satisfying Definition 2.7 3. We may assume access to a (ρ 3 , ρ 4 , (1 + ε))-Adaptive Approximate Nearest Neighbor data structure for X built on T from Theorem 4.1.
The following straightforward lemma is the only guarantee we will require of Algorithm 3: We are now ready to conclude the proof of Theorem 3.1. Suppose q ∈ R d and we are required to construct a valid terminal embedding for q with respect to the point set X. We may assume Return:x 4: else ifx is assigned to z ∈ Z then 5:    access to (x, T ′ ) forx being an approximate nearest neighbor of q in X and T ′ ∈ T satisfying the conclusion of Theorem 4.1. Also, it suffices to construct a valid terminal embedding for the set of points in T ′ . We may assume T ′ has more than one element as in the one element case, any point on a sphere of radius q −x around (Πx, 0) suffices from Theorem 4.1. Our algorithm is based on the following set of convex constraints: Let K denote the convex subset of R k satisfying the above set of constraints. Letr = q −x . By Lemma 3.3 and Cauchy-Schwarz, K is non empty and there exists z ∈ K with B( z, ε †r ) ⊂ K (there exists z satisfying the above constraints with the right-hand sides replaced by 15ε † and 8ε † respectively). Also, note that K ⊆ B(Πx, 2r) from the above set of constraints. While one can show a feasible point in K can be used to construct a valid terminal embedding, we construct a slightly weaker oracle. We construct an oracle O satisfying: From Lemma B.2 and the fact that K contains a ball of radius ε †r and is contained in a ball of radius 2r, we see that there is a procedure which with n o(1) many queries to O and n o(1) total additional computation, outputs a point v * for which O outputs FAIL. Our oracle implementing this property is defined simply by Algorithm 4. Through the rest of this subsection, we focus our attention on proving the correctness of the oracle. Let T ′ = {Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx } and m = |Z|. First, we focus on the easy case when Algorithm 4 does not output FAIL.

Lemma 3.6. If Algorithm 4 does not output FAIL on v, it outputs a separating hyperplane for v from K.
Proof. Note that this only happens when some x or pair (x, y) is returned by Algorithm 2. Now, Algorithm 2 may return an x in one of two ways: (1 + 10ε † ) q − x for x returned by the algorithm. In this case, the correctness of the procedure trivially follows from Req.
Case 2: For x, y ∈ Z, the vectors v 1 = v x and v 2 = v y,x satisfy | v 1 , v 2 | 20ε † . We have: which is again a violator of Req which proves correctness in this case as well.
This concludes the proof of the lemma.
Next, we consider the alternate vase where Algorithm 4 outputs FAIL, where we show that the input v can be used to construct a valid terminal embedding for q with respect to Z and consequently for X from the guarantees of Theorem 4.1.

Lemma 3.7. If Algorithm 4 outputs FAIL on v, z q
Proof. Note that since Algorithm 2 returned FAIL for all fixed scale data structures, we must have v − Πx (1 + 10ε † ) q −x . As a consequence, we get that z q − (Πx, 0) (1 + 10ε † ) q −x . Now, we need to show that z q is a valid terminal embedding of q for an arbitrary x ∈ Z. We first consider the case where x −x 0.5r term high as defined in Algorithm 3 for the node T ′ . For this point, we have: x − x where the last inequality comes from the condition on q −x from Theorem 4.1 and the definition of r term high . Now, we get: where the last inequality follows from the inequality proved in the previous display, the fact that Π has ε † -convex hull distortion for X and the relationship between ε and ε † . We now consider the alternative case where x −x 0.5r term high . In this case, from the definition of p in Algorithm 3, let i * be the smallest i ∈ [p] such that x −x (1 + γ) i * r term low . Note that i * is finite from our condition on x −x . For this i * , consider the fixed scale data ,S∈D i constructed in Algorithm 3. Letting r = (1 + γ) i * r term low , we prove the lemma in two cases corresponding to the structure of D T ′ ,i * as described in Lemma 3.4: Case 1:x ∈ A z for some z ∈ Z. Note that this necessarily happens if |N loc (x, r)| m 1−ρ rep .
In either case, the following two simple claims will be crucial in bounding the error terms: Claim 3.8. For all x ∈ Z, we have: By re-arranging the above inequality, we obtain our result.
Proof. If i * = 0, we have: 10nd) 5 r term low from our assumption on q −x (Theorem 4.1). When i * > 0, we have (1 + γ) −1 r x − x r and the conclusion follows from Claim 3.8.
In the first case, we get thatx is assigned to some z ∈ Z with x − z 2 r which implies by the triangle inequality x − z 3 r. We now prove another claim we will use through the rest of this proof: Claim 3.10. For all w ∈ Z, let w = v w,x and v = vx. If | w, v | 20ε † : Proof. We have: where the second inequality follows from the fact that Π satisfies ε † -convex hull distortion for X, our claim on z q − (Πx, 0) and our condition on w, v while the final inequality follows from Claim 3.8 and the fact thatx is a (1 + ε + o(1))-Approximate Nearest Neighbor of q. By factorizing the LHS and dividing by w − q , we get the desired result.
Returning to our proof, recall x − z 3 r and from Claim 3.9, q − x 5 12 r. We claim since D T ′ ,i * returns FAIL that | v x,z , v z | 4ε. To see this, suppose v x,z , v z 4ε. In this case, we have: Therefore, D z on input v z returns y such that y − v z (1 + ε + o(1))(1 − 2ε) √ 2 from the fact that D z is successfully instantiated. From this, we get that y − v z √ 2(1 − 0.8ε). From this expression, we obtain: which contradicts the assumption that D T ′ ,i * returns FAIL. The proof for v z , v x,z −4ε is similar by replacing v z by − v z in the above proof and using the fact that we query D z with − v z as well. Hence, we have | v z , v x,z | 4ε. Finally, bound the deviation of z q − (Πx, 0) from q − x : where the second inequality follows from the fact that Π has ε † -convex hull distortion for X, Claim 3.10 and the fact that | v z , v x,z | 4ε, the fourth inequality follows from the fact that x − z 3 r and the last and second-to-last inequalities follow from the fact that (a + b) 2 2(a 2 + b 2 ) and Claim 3.9. Factorizing the LHS now establishes the lemma in this case.
We now consider the alternate case where |N loc (x, r)| < m 1−ρ rep . For the fixed scale data structure, D T ′ ,i * , recall that we have from Lemma 3.4 that: In particular, we have by the union bound, there exists a set J ⊂ [l] with |J | 0.9l satisfying all three of the above requirements. For any j ∈ J , from Algorithm 2, all the subsets in h j (x) are fully explored. For some j ∈ J , consider S ∈ h j (x) such that x ∈ S. Now, if x / ∈ A j,S or x ∈ W j,S , conclusion of the lemma follows from Claim 3.10. Hence, the only case left to consider is the case where x ∈ A j,S . Let x be assigned to w ∈ W j,S such that x − w 4 r. In this case, we proceed identically to the previous case wherex is assigned to some z ∈ Z. Since D T ′ ,i * returns FAIL, | v w , v x,w | 4ε. To see this, assume v w , v x,w 4ε and we have: Therefore, D j,S,w on input v w returns y such that y − v w (1 + ε + o(1))(1 − 2ε) √ 2 from the fact that D j,S,w is successfully instantiated. From this, we get y − v w √ 2(1 − 0.8ε). We obtain: which contradicts the assumption that D T ′ ,i * returns FAIL. The proof when v w , v x,w −4ε is similar by replacing v w by − v w and using the fact that we query D j,S,w with − v w as well. Hence, we have that | v w , v x,w | 4ε. As before, we bound the deviation of z q − (Πx, 0) from q − x : where the second inequality follows from the fact that Π has ε † -convex hull distortion for X, Claim 3.10 and the fact that | v w , v x,w | 4ε, the fourth inequality follows from the fact that x − w 4 r and the last and second-to-last inequalities follow from the fact that (a + b) 2 2(a 2 + b 2 ) and Claim 3.9. Factorizing the LHS now establishes the lemma in this case concluding the proof of the lemma.
Proof. Note that we may restrict ourselves to bounding the runtime of Algorithm 2 as we query at most O(1) many of them. For a single fixed scale data structure, D T ′ ,i being queried, computing the sets h i (x) takes time dn ρ 4 +o(1) . Ifx is assigned to a point z, the nearest neighbor procedure takes time dn ρ 2 +o(1) . Otherwise, processing the unassigned points takes time dn ρ 4 +o(1) and for the assigned points, there are at most n 1−ρ rep +ρ 3 +o(1) in at most n ρ 4 +o(1) many sets. From the concavity of the function f (x) = x ρ 2 , we get that the maximum time taken to query all of these nearest neighbor data structures is at most dn ρ 4 +(1+ρ 3 −ρ 4 −ρ rep )ρ 2 +o(1) . We now discuss how to decouple the dimensionality term from the term dependent on n.
As explained in Subsection 2.3, it suffices for our argument to use Π ′ (x − y) instead of x − y as the first component in the construction of the vectors used to instantiate the data structures D z and D i,S,w in Algorithm 1 for any Π ′ satisfying: From Theorem 5.1, we get that if we instantiate m = Ω(d) JL sketches, {Π i } i∈ [m] with Θ(log n log log n) rows each, the above condition is satisfied for at least 95% of them with probability at least 1 − n −10 . To construct a our final violator detection subroutine, we instantiate m copies of our violator detection algorithm for the projections of the data points with respect to each of the sketches Π i . At query time, we simple sample Θ(log n) many of these sketches for a possible violator. We then check the validity of each of the returned candidates which can be done in time Ω(d). Since, 95% of the sketches satisfy the above condition, at least one of the sampled sketches will with probability at least 1 − n −10 and hence, satisfies the guarantees required of our oracle.

Adaptive Approximate Nearest Neighbor
In this section, we prove the following theorem regarding the existence of adaptive algorithms for approximate nearest neighbor search based on adapting ideas from [CN20] to the nearest neighbor to near neighbor reduction. These data structures will play a crucial role in designing algorithms to compute terminal embeddings. Note again that the probability of success only depends on the random choices made by data structure at query-time which may be made arbitrarily high by repetition assuming successful instantiation of the data structure. Also, the second property of the tuple returned by the data structure is irrelevant for computing an approximate nearest neighbor but play a crucial part in our algorithm for computing terminal embeddings.
Theorem 4.1. Let c > 1 and ρ u , ρ c > 0. Then, there is a randomized procedure which when instantiated with a dataset X = {x i } n i=1 ⊂ R d and a (ρ u , ρ c , c)-Approximate Near Neighbor data structure produces a data structure, (D, T ), satisfying: 1. Given any q ∈ R d , D produces (x ∈ X, T res ∈ T ) satisfying: Then: and if |Z| > 1: with probability at least 1 − 1/poly(n).

T is a valid Partition Tree of X (Definition 2.7)
3. The space complexity of D is dn 1+ρ u +o(1) log 1/δ 4. The runtime of D on any q ∈ R d is at most dn o(1) + n ρ c +o(1) .
We will refer to any data structure satisfying the above guarantees as a (ρ u , ρ c , c)-Adaptive Approximate Nearest Neighbor (AANN) data structure. Through the rest of the section, we prove Theorem 4.1. In Subsection 4.1, we overview the construction of our data structure given a partition tree constructed by say, Lemma A.4 and in Subsection 4.2, we show how to query the data structure where we show that it is sufficient to construct terminal embeddings for the set of points in the node we terminate our traversal of the Partition Tree.

Constructing the Data Structure
In this subsection, we describe how the data structure for our adaptive nearest neighbor algorithm is constructed. The procedure is outlined in Algorithm 5. The procedure takes as input a partition tree produced by Algorithm 8 satisfying the conclusion of Lemma A.4, a failure probability δ ∈ (0, 1) and the total number of points n. r low ← r apx C range ·(nd) 10 , r high ← C range · (nd) 10 r apx 5: To state the correctness guarantees on the data structure, we will require most of the ANN data structures to be accurate for an appropriately chosen discretization of R d . The reason for this will become clear when defining the operation of the query procedure on the data structure produced by the algorithm. To construct the discrete set of points, consider a particular node T ′ ∈ T . Let T ′ = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx ∈ T and G(ν) be the discrete subset of R d whose coordinates are integral multiples of ν = γ 1000(nd) 20 · r low where r low = r apx C range ·(nd) 10 and γ = c step log 3 n as in Algorithm 5. Again, letting r high = C range · (nd) 10 r apx as in Algorithm 5, we define the grid of points corresponding to T ′ as follows: H T ′ = x∈Z B(x, 10 6 · (nd) 20 · r high ) ∩ G(ν).
That is H corresponds to the set of points in G within 10 6 · (nd) 20 r high of some point in Z. Finally, the set of points where we would like to ensure the correctness of our procedure is given by: We now state the main result concerning the correctness of Algorithm 5: with probability at least 1 − δ. Furthermore, the space complexity of D is O(n 1+ρ u +o(1) (d + log 1/δ)).

Proof.
We start by bounding the amount of space occupied by the data structure. For the nearest neighbor data structures instantiated, note that the space utilization of a single data structure with n points scales as O(n 1+ρ u + f (n) ) where f (n) = o(1). Let T be such that f (n) 1 for all n T and let M = max i∈[T] f (i). To account for the space occupied by the data structure, first consider nodes in the tree with less than log n points. Let D = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx be such a node. For this node, the space occupied by the nearest neighbor data structures is at most O(|Z| 1+ρ u +M (d + log 1/δ)) as we instantiate O(d + log 1/δ) nearest neighbor data structures in each node. Since, |Z| log n and there are at most O(n log n) many of these nodes (Lemma A.4), the total amount of space occupied by these nodes is at most O(n(d + log 1/δ)). Now, consider the alternate case where |Z| log n. Through similar reasoning, the total amount of space occupied by such a node is O(|Z| 1+ρ u + f (|Z|) (d + log 1/δ)). Now, summing over all the nodes in the tree and by using the convexity of the function f (x) = x 1+ρ for ρ > 0, we get that the total amount of space occupied by nodes with more than log n points is at most O(n 1+ρ u +o(1) (d + log 1/δ)). This completes the bound on the space complexity of the data structure produced by Algorithm 5.
We will now finally establish the correctness guarantees required of D. To bound the size of J , first consider a single H(T ′ ) in the definition of J . For a single term in the definition of H(T ′ ), V = B x, 10 6 · (nd) 20 · r high ∩ G(ν), note that V is a ν packing of B x, (10 6 + 1) · (nd) 20 · r high . Therefore, from standard bounds on packing and covering numbers and the definitions of ν and r high in terms of r apx , we get that |V | (nd) O(d) [Ver18, Section 4.2]. By taking a union bound over the O(n log n) nodes in the tree and the at most n points in each node, we get that |J | (nd) O(d) . Now, for a particular z ∈ J and a particular node T ′ ∈ T with data structure D T ′ = {D i,j } i∈{0}∪[l],j∈ [s] , the probability that D i,j incorrectly answers the Approximate Near Neighbor query is at most 0.01. Therefore, we have by Hoeffding's inequality that the probability that more than 0.05s of the D i,j answer z incorrectly is at most δ/(10 · |T | · |J |) from our setting of s and our bounds on |J | and |T |. A union bound over J and the nodes of the tree establishes the lemma.

Querying the Data Structure and Proof of Theorem 4.1
The procedure to query the data structure returned by Algorithm 5 is described in Algorithm 6. The procedure takes as inputs the partition tree, T , and the data structure output by Algorithm 5 which has a separate data structure for each T ′ ∈ T . The procedure recursively explores the nodes of T starting at the root and moving down the tree and stops when the approximate nearest neighbor found is within poly(n)r apx of the nodes in the tree. In addition, in anticipation of its application towards computing terminal embeddings, we show that it is sufficient to construct a terminal embedding for the set of points in the node the algorithm terminates in. We now state the main lemma of this subsection which shows both that the data point x returned by Algorithm 6 is an approximate nearest neighbor of q as well as establishing that it suffices to construct a terminal embedding for data points in the node of the Partition tree the algorithm terminates in.
T be a valid Partition Tree of X and D = T , {D T ′ } T ′ ∈T be a data structure satisfying the conclusion of Lemma 4.2. Then, Algorithm 6, when run with inputs D and any q ∈ R d returns a tuple (x, T res ) satisfying: and for T res = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx , let y ∈ R k satisfy for some ε ‡ ∈ [ε † , 1): Then, we have that: and if |Z| > 1: Additionally, Algorithm 6 runs in time O(n ρ c +o(1) (d + log 1/δ)).
Proof. We first set up some notation. Note that Algorithm 6 traverses the partition tree, T , using the data structure D defined in Algorithm 5. Let T (0) , . . . , T (K) denote the sequence of nodes traversed by the algorithm with apx . Note that K ⌈log n⌉ + 1 as the number of data points drops by at least a factor of 2 each time the algorithm explores a new node. To prove the first claim regarding the correctness ofx, let r * = min x∈X q − x . We now have the following claim: Proof. We will prove the claim via induction on i. The base case where i = 0 is trivially true. Now, suppose that the claim holds true for all nodes up to T (k) and we wish to establish the claim from T (k+1) . For the algorithm to reach node T (k+1) from node T (k) , one of the following two cases must have occurred: Now, let q (k) be the discretization of q when T (k) is being processed. We first handle the case where q (k) / ∈ J , we have by the triangle inequality: 5 · 10 5 · (nd) 20 · r high and the first case occurs. Now assume that q (k) ∈ J . If the first case occurs, we again have by the triangle inequality: 9r high by the fact that Algorithm 6 recurses on C (k) rep and the conclusion of Lemma 4.2 which ensures that q (k) does not have a neighbor in Z (k) within a distance of r (k) high . Now, let z * be the closest neighbor to q in Z (k) ; that is, z * = arg min z∈Z (k) q − z . We know that there exists C ∈ C high such that z * ∈ C. Furthermore, we have by the triangle inequality and the fact that CC(Z (k) , 1000n 2 r apx for all z ∈ C. Note that C has representativeẑ in the construction of C Along with the induction hypothesis, the above fact concludes the proof of the claim in this case. Finally, for the second case, again let z * = arg min z∈Z (k) q − z and C ∈ C (k) low such that z * ∈ C. From the definition of C (k) low and Lemma A.4, we also know for all z ∈ Z (k) \ C, z * − z r apx (1000n 3 ) as C low ⊑ CC Z (k) , r apx 1000n 3 . We have by the triangle inequality for all z ∈ Z (k) \ C: Therefore, all x ∈ Z (k) such that q − x cr (k) low must belong to C. Consequently, we recurse on T C with z * ∈ C ∈ C low which establishes the inductive hypothesis in this case as well.
To finish the proof of the first claim, let T (K) = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx with associated data structure D T (K) = {D i,j } i∈{0}∪[l],j∈ [s] . If |Z| = 1, a direct application of Claim 4.4 establishes the lemma. Alternatively, we must have 0 < i * l when T (K) is being processed by Algorithm 6. From the guarantees on D i,j (Lemma 4.2), we get min z∈Z z − q > (1 + γ) i * −1 r low . Letting x ∈ Z such that x − q c(1 + γ) i * r low returned by Algorithm 6, we get by another application of the triangle inequality: where the first inequality is valid as min z∈Z q − z − q − q > 0 from our setting of ν and the condition on min z∈Z q − z and the final inequality similarly follows from our setting of ν. Another application of Claim 4.4 with the fact that K ⌈log n⌉ + 1, establishes the first claim of the lemma.
To prove the second claim of the lemma, we prove an analogous claim to Claim 4.4 for terminal embeddings: Claim 4.5. We have for all i ∈ {0, . . . , K}: Proof. We will prove the claim by reverse induction on i. For i = K, the claim is implied by the assumptions on y. Now, suppose the claim holds for i = k + 1 and we wish to establish the claim for i = k. As in the proof of Claim 4.4, we have two cases when T (k) is being processed: As in Claim 4.4, when the first case occurs, we have min z∈Z (k) q − z 0.9r high . Now, for any high . Note that we must have by the triangle inequality and the fact that CC(Z (k) , 1000n 2 r , the inductive hypothesis and the assumption on y i , y j from the lemma, we get: where the last inequality follows from the fact that (1 + a) b (1 + a) b−1 + a for a 0, b 1. For the other direction, we have by a similar calculation: This establishes the claim in the first case. For the second case, let z * = arg min z∈Z (k) q − z . As in the proof of Claim 4.5, we have z * ∈ Z (k+1) = C ∈ C (k) low and for all z ∈ Z (k) \ C: and furthermore, we have: Now, the claim is already established for all x i ∈ Z (k+1) . For z i ∈ Z (k) \ Z (k+1) , letting y * = y j for x j = z * : where the third inequality is due to the fact that K ⌈log n⌉ + 1 and the fact that (1 + a) b e ab for a, b 0 and the final two inequalities follow from the upper bound on q − z * and the lower bound on x i − q established previously. For the lower bound, we have by a similar computation: The assumption that ε ‡ ε † finishes the proof of the claim by induction.
Claim 4.5 establishes the second conclusion of lemma from the fact that K ⌈log n⌉ + 1. When |Z| > 1 in T res , we get by the triangle inequality, the fact that q −x r low and the definition of q in Algorithm 6 that: Finally, for the bound on the runtime, note that Algorithm 6 queries at most Kl(s + 1) near neighbor data structures built with at most n points. Therefore, each query takes time at most O(n ρ c +o(1) ). This proves our bound on the runtime of Algorithm 6.
Lemmas 4.2, 4.3 and A.4 now establish Theorem 4.1 barring the decoupling of d and the n ρ c term in the runtime guarantee. To do this, note that we may simply use the Median-JL data structure (see Theorem 5.1 in Section 5) by instantiating l = O(d) many JL-sketches and for each of them, instantiate an adaptive approximate nearest neighbor data structure in the low dimensional space. At query time, we simple pick Ω(log n) of these sketches uniformly at random, query each of the corresponding nearest neighbor data structures with the projection of q and return the best answer. This yields the improved runtimes from Theorem 4.1.

Median -JL
In this section, we will prove the following lemma which will enable us to speed up our algorithms by effectively projecting onto a low dimensional subspace essentially decoupling the terms that depend on d and n. Note that standard techniques for dimensionality reduction may not be used in our setting as the queries may be chosen based on the data structure itself. For example, for the Johnson-Lindenstrauss lemma, the queries may be chosen orthogonal to the rows of the projection matrix violating its correctness guarantees. We start by defining an approximate inner product condition we will use frequently through the rest of the section.
For x, y, z ∈ R d and ε > 0, we say a matrix Π satisfies AP-IP(ε, x, y, z) if it satisfies AP-IP. For a dataset, X ⊂ R d , Π satisfies AP-IP(ε, X) if it satisfies AP-IP(ε, x, y, z) for all x, y, z ∈ X.
By setting x = y in AP-IP, we see that the above theorem is a generalization of the standard Johnson-Lindenstrauss condition where in addition to maintaining distances between points, Π is also required to approximately maintain relative inner products between the points in the augmented dataset. To begin the proof, we start by recalling the standard Johnson-Lindenstrauss lemma (see, for example, [Ver18]).
Lemma 5.2. Let Π ∈ R k×d be distributed according to Π i,j iid ∼ N (0, 1/k), δ ∈ (0, 1) and k C log 1/δ ε 2 for some absolute constant C > 0. Then, for any v ∈ R d , we have: We obtain via a union bound over all pairs of points in the dataset, X, the JL guarantee: . Then, we have: with probability at least 1 − δ.
A second corollary we will make frequent use of is the following where we show that Π also approximately preserves inner products.
Corollary 5.4. Let Π ∈ R k×d be distributed according to Π i,j iid ∼ N (0, 1/k) and k C log 1/δ ε 2 . Then, for any x, y ∈ R d , we have: Proof. If either x or y are 0, the conclusion follows trivially. Assume, x, y = 0. By scaling both sides by x y , we may assume that x = y = 1. We now have the following: By subtracting both equations, we get: By the union bound, the triangle inequality and Lemma 5.2, we get with probability at least 1 − δ: The inequalities in the previous display imply the lemma.
We start by establishing a simple lemma on the norms of the matrices Π i .
Proof. Let G be a γ-net over S d−1 with γ = c (nd) 10 for some small enough constant c. Furthermore, we may assume |G| (nd) O(d) . Now, for u ∈ G, let: We have from Corollary 5.4, that P(W i (u) = 1) 0.995. Therefore, we have by an application of Hoeffding's inequality and a union bound over G that: We now condition on the event from the previous equation and the conclusion of Lemma 5.5. To extend from the net G to the whole sphere, consider v ∈ S d−1 and its nearest neighbor u ∈ G. Note We have: Furthermore, we have for all x, y ∈ X: Since, for any u ∈ G, at least 0.98m of the Π i satisfy W i (u) = 1 and Π i F O( √ d) with probability at least 1 − δ/2, the conclusion of the lemma follows.
Finally to establish Theorem 5.1, we will need to use a more intricate multi-scale gridding argument than the one used to prove Lemma 5.6. Using a single grid of resolution γ does not suffice as the dataset, X, may contain pairs of points separated by much less than γ. Bounding the error of the embedding of q in terms of its nearest neighbor in the net does not suffice in such situations. On the other hand, using a finer net whose resolution is comparable to the minimum distance between the points in the dataset leads to a choice of m dependent on the aspect ratio of X. The multi-scale argument presented here allows us to circumvent these difficulties.
To define the grid, let r ij = x i − x j for x i , x j ∈ X and G ij be a γ-net of B(x i , 2(Cnd) 10 r ij ) with γ = (Cnd) −10 · r ij for some large enough constant C. The grid in our argument will consist of the union of all the G ij ; that is G = i,j∈[n] G ij . Now, for u ∈ G, define W i (u) as follows: From Corollary 5.4, we have P(W i (u) = 1) 0.995. Noting that |G| (2nd) O(d) , we have by Hoeffding's Inequality and the union bound that with probability at least 1 − δ/4, we have for all u ∈ G: ∑ m i=1 W i (u) 0.99m. For the rest of the argument, we will also condition on the conclusions of Lemma 5.6. Note, that this event occurs with probability at least 1 − δ from the union bound. Therefore, we have by the union bound with probability at least 1 − δ: Letting Y i (u, v) denote the indicator in the above expression, we now condition on the above event for the rest of the proof. Let q ∈ R d and x q = arg min x∈X q − x (its closest neighbor in X). Note that the case where q = x q is already covered by the condition on the W i . Therefore, we assume q = x q . With v q = (q−x q ) q−x q and q = arg min u∈G q − u , let J (q) = {i : Y i ( q, v q ) = 1}. We will now prove for all i ∈ J (q): Now, we have: Similarly, we conclude from the following inequality implied by the definition of J : The four cases just enumerated establish the theorem assuming |J (q)| 0.95m for all q ∈ R d . As shown before, this occurs with probability at least 1 − δ concluding the proof of the theorem.
also not require computing CC(X, r) exactly but appropriate refinements and coarsenings suffice. We first restate a simple lemma from [HIM12]: Lemma A.1. Given X = {x i } n i=1 ⊂ R d and δ ∈ (0, 1), there is a randomized algorithm, CompRmed, that computes in time O(nd log 1/δ) and outputs an estimate r apx satisfying: P r apx r med (X) = 1 and P r apx nr med (X) 1 − δ.
Proof. Let C ∈ CC(X, r med (X)) be such that |C| n/2. Picking a point, x, uniformly at random from X picks a point in C with probability at least 1/2. Now, we compute distances and output their median, r apx . Conditioned on x ∈ C, we have by the triangle inequality, that r apx nr med (X) which proves the second claim with probability at least 1/2. For the first, note that x belongs to a connected component in CC(X, r apx ) of size at least n/2. This establishes the first claim of the lemma. By repeating this procedure Ω(log 1/δ) times and taking the minimum of the returned estimates establishes the lemma by an application of Hoeffding's inequality.
Proof. The randomized algorithm is detailed in the following pseudocode.
Algorithm 7 ConstructPartition(X, r, δ) 1: Input: Point set X = {x i } n i=1 ⊂ R d , Resolution r, Failure Probability δ 2: K ← 10 log(n/δ), τ ← 10r, ν ← 1000n 2 r 3: V ← X, E ← φ 4: for k = 1 : K do i ← max(i + 1, i n ) 13: end while 14: end for 15: C ← ConnectedComponents(V, E) 16: Return: C By the definition of Algorithm 7, we see that for every (x, y) ∈ E, we must have x − y ν. Therefore, we obtain C refines CC(X, 1000n 2 r). We now show that CC(X, r) refines C. To do this, we will need the following claim: Proof. We will first assume that the functions CompRmed and ConstructPartition run successfully (that is, they satisfy the conclusions of Lemmas A.1 and A.2 respectively) in every recursive call of Algorithm 8 and then finally bound the probability of this event. Note that when CompRmed runs successfully, r apx always satisfies r med r apx nr med for every recursive call of Algorithm 8 (Lemma A.1). Together with the correctness of ConstructPartition (Lemma A.2), we get that: CC(Z, 1000n 2 r apx ) ⊑ C high ⊑ CC(Z, r apx ) ⊑ CC(Z, r med ) ⊑ CC Z, r apx 10n ⊑ C low ⊑ CC Z, r apx 1000n 3 for every node T ′ = Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx ∈ T . Furthermore, for any such T ′ , we get from the fact that CC(Z, r med ) ⊑ C low , that all C ∈ C low satisfy |C| |Z|/2. In addition, the definition of r med and the fact that C high ⊑ CC(Z, r med ) yield |C rep | |Z|/2. To bound, first note that |C rep | |C low | from the fact that C high ⊑ C low and the construction of C rep . To bound Size(T ), define B(n) as follows: From the definition of B(n), we see that B(n) is monotonic in n and from this, we get that B(n) is an upper bound on Size(T ). We now recall the following claim from [HIM12]: HIM12]). For all n 3, B(n) Cn log n.
The above claim establishes the bound on Size(T ). Finally, we bound the probability that any execution of CompRmed and ConstructPartition fail. We start by bounding the probability that any of the first 5B(n) runs of CompRmed and ConstructPartition fail. From the definition of δ † , the probability that any of the 5B(n) runs of CompRmed and ConstructPartition fail is at most 1 − δ by the union bound. However, the preceding argument shows that the algorithm terminates with fewer than B(n) recursive calls if none of the executions of CompRmed and ConstructPartition fail. Therefore, the probability that any of the executions of CompRmed and ConstructPartition fail in the running of the algorithm is at most 1 − δ. This yields the previously derived conclusions with probability at least 1 − δ. Return: (X, φ, φ, φ, φ, φ) 4: end if 5: δ † ← c prob δ n 2 6: r apx ← CompRmed(Z, δ † ) 7: C low ← ConstructPartition(Z, r apx /(1000n 3 ), δ † ), C high ← ConstructPartition(Z, r apx , δ † ) 8: For C ∈ C low , let T C ← ConstructPartitionTree(C, n, δ) 9: For C ∈ C high , pick representative x ∈ C and add it to C rep 10: T rep ← ConstructPartitionTree(C rep , n, δ) 11: Return: Z, {T C } C∈C low , T rep , C low , C high , C rep , r apx

B Miscellaneous Results
In this section, we develop some standard tools needed for our constructions. In Appendix B.1, we recall some basic facts about the Ellipsoid algorithm for convex optimization [NY83,Ber16] and analyze it when it's instantiated with a weak oracle as in our terminal embedding construction.
where the final inequality follows from the fact that x+y 2 ∈ Conv(T), the fact that x , y 1 and the assumption that Π has ε-convex hull distortion for X. A similar inequality for the second term yields: | Πx, Πy − x, y | 1 4 (12ε + 12ε) = 6ε concluding the proof of the lemma.