A simple polynomial-time approximation algorithm for the total variation distance between two product distributions

We give a simple polynomial-time approximation algorithm for the total variation distance between two product distributions.


Introduction
The total variation (TV) distance is a fundamental metric to measure the difference between two distributions. It is essentially the 1 distance. Unlike many other quantities for similar uses, such as the relative entropy and the 2 -divergence, the TV distance does not tensorise over product distributions. In fact, it was discovered recently that, somewhat surprisingly, exact computation of the total variation distance, even between product distributions over the Boolean domain, is #P-hard [1].
This leaves open the question of approximation complexity of the TV distance. In [1], the authors give polynomial-time randomised approximation algorithms in two special cases over the Boolean domain, when one of the distribution has marginals over 1/2 and dominates the other, or when one of the distribution has a constant number of distinct marginals. Their method is based on Dyer's dynamic programming algorithm for approximating the number of knapsack solutions [2].
In this note, we give a simple polynomial-time approximation algorithm for total variation distance between two product distributions. Our algorithm is based on the Monte Carlo method and does not have further restrictions.  To approximate the TV distance, the naïve Monte Carlo algorithm works well when the two distributions are sufficiently far away. However, when the TV distance is exponentially small, naïve Monte Carlo may require exponentially many samples to return an accurate estimate.
Our idea is to consider a distribution that can be efficiently sampled from and yet boosts the probability that the two distributions are different. Ideally, we would want to use the optimal coupling, but that is difficult to compute. We use instead the coordinate-wise greedy coupling as a proxy, where each coordinate is coupled optimally independently. We further condition on the (potentially very unlikely) event that the two samples are different. Normally, conditioning on an unlikely event is a bad move since computational tasks would become hard. However, here they are still easy thanks to the independence of the coordinates under the coupling. With this conditional distribution, our estimator is the ratio between the probabilities of the assignment in the optimal coupling and in the greedy coupling. We show that this estimator is always bounded from above by 1 and its expectation is at least 1/ . This means that the standard Monte Carlo method will succeed with high probability using only polynomially many samples.
One remaining question is if a deterministic approximation algorithm exists for the TV distance. The answer might be positive, because of the connection with counting knapsack solutions established by Bhattacharyya, Gayen, Meel, Myrisiotis, Pavan, and Vinodchandran [1], and the deterministic approximation algorithm for the latter problem by Gopalan, Klivans, Meka,Štefankovič, Vempala, and Vigoda [3, 4, 5].

Preliminaries
Let Ω be a (finite) state space, and and be two distributions over Ω. The total variation distance is defined by (1) The above equation holds because (1) for any valid coupling C, it holds that Pr C [ = = ] ≤ min{ ( ), ( )}; (2) to achieve the optimal coupling, every must achieve the equality. We have
Let C be the coordinate-wise greedy coupling. Namely, for each coordinate and ∈ [ ], Pr C [ = = ] = min{ ( ), ( )}, and the remaining probability can be assigned arbitrarily as long as C is a valid coupling (but each coordinate is independent). In other words, for each ∈ [ ], C couples and optimally and independently. Note that can be computed exactly.
Consider the distribution such that We may assume and are not identical, as otherwise the algorithm just outputs 0. This makes sure that the distribution is well-defined. The following lemma shows that we can draw random samples from efficiently. We show how to compute the numerator next, and the denominator can be computed similarly.

By definition
.
In the coupling C, every pair of and is coupled optimally and independently. We have (by (1) Let be a random sample from . Now consider the following estimator: where the second equality is due to (2). This estimator is well-defined, because when Pr C [ = ∧ ≠ ] = 0, ( ) = 0 as well and will not be drawn. Since ( ) > 0, it holds that ( ) > 0. Using (5), we have We have the following: 1 ≤ E ≤ 1.
For (8) On the other hand, by the union bound, For (9), the lower bound is trivial. For the upper bound, we only need to consider ∈ Ω + such that ( ) > ( ). In this case .
Since C couples each coordinate independently, This finishes the proof of (9).
We claim that Assuming that (11) holds, by the Chernoff bound, it holds that Finally, we prove the claim (11). Note that the expectation and the variance of the random variable satisfy that E = E and Var = 1 Var . By Chebyshev's inequality, .