Theory of Computing Blog Aggregator

I recently finished my qualifying exam for which I surveyed worst-case to average-case reductions for problems in NP. As part of my survey, I reviewed recent results on average-case hardness in P (a.k.a. average-case fine-grained hardness). I would like to give an overview of some of these results in a series of blog posts, and I want to start by giving some motivations.

Why do we care about average-case fine-grained hardness?

(i) We hope to explain why we are struggling to find faster algorithms for problems in P such as DNA sequencing even for some random large datasets.

(ii) Average-case hard problems in P is useful to cryptography. For example, constructing one-way functions (i.e., functions that are easy to compute but hard to invert) based on worst-case complexity assumptions such as NP \not\subseteq BPP has been a long-standing open question. In this context, “easy” means polynomial-time computable. If we consider “easy” to be computable in time like O(n^{100}) instead, then a natural alternative question is whether we can construct such one-way functions based on worst-case fine-grained complexity assumptions. Another example is proof of work [DN92]. When a miner tries to mine the cryptocurrency, the miner is asked to solve a random puzzle, that is average-case hard but still tractable, and then the miner needs to prove they have indeed solved the puzzle through efficient protocols. An interesting and more recent follow-up is proof of useful work [BRSV17b], which proposes that instead of wasting computing power on a random puzzle that comes from nowhere, we can first reduce a computational task of practical interest to multiple random puzzles and then ask the miner to solve those puzzles.

The most common approach for proving average-case fine-grained hardness is arguably worst-case to average-case reduction, i.e., reducing an arbitrary instance to a number of random instances of which the distribution is polynomial-time sampable. Before I give concrete examples, I want to describe a general recipe for designing such worst-case to average-case reductions. (Some reader might notice that the step 3 of the recipe is same as the argument for proving computing permanent is self-reducible [L91], which essentially uses local decoding for Reed-Muller codes.)

General recipe for worst-case to average-case reductions:

  1. Choose our favorite hard problem L:\{0,1\}^n\to\mathbf{Z}_{\ge 0} in P.
  2. Construct a low-degree (degree-d) polynomial f_{L} on \mathbf{F}_{p}^n such that f_{L}(x)=L(x) for all x\in\{0,1\}^n.
  3. To solve L for worst-case x\in\{0,1\}^n: sample a uniformly random y\in\mathbf{F}_{p}^n, compute f_{L}(x+t_1 y),\dots,f_{L}(x+t_{d+1} y) for distinct nonzero t_1,\dots,t_{d+1} using average-case solver (note each x+t_{i} y is uniformly random), interpolate univariate polynomial g(t):=f_{L}(x+t y) using these points, and output g(0) which is f_{L}(x)=L(x). (This step can be replaced by decoding algorithms for Reed-Solomon codes to handle larger errors of average-case solver.)
  4. (The above steps already show that evaluating f_{L} for d+1 uniformly random inputs is as hard as solving L for worst-case input.) If we want to show L itself is average-case fine-grained hard, it suffices to give a reduction from computing f_{L}(x) for x\in\mathbf{F}_{p}^n back to solving L.

Let us go through a concrete example. Consider one of the flagship problems in fine-grained complexity — orthogonal vector problem (OV): Given X=\{x_1,\dots,x_n\}, Y=\{y_1,\dots,y_n\}, where each x_i,y_i\in\{0,1\}^d for d\in\omega(\log n)\cap n^{o(1)}, decide if there are x_i,y_j such that \langle x_i,y_j\rangle=0. It is known OV has no sub-quadratic algorithm assuming strong exponential-time hypothesis (SETH) [W05], and there is a generalization of OV called k-OV problem that has no O(n^{k-o(1)}) algorithm under the same assumption [W18].

Polynomial evaluation. Given an OV instance X,Y with d=n^{o(1)}, we construct a degree-2d polynomial f_{\textrm{OV}}(X,Y):=\sum_{i,j\in[n]}\prod_{t\in[d]}(1-x_{i,t}y_{i,t}) over \mathbf{F}_{p}^{2nd} for p>n^2, where x_{i,t} denotes the t-th coordinate of x_i\in X. Observe that f_{\textrm{OV}} simply enumerates all the pairs of vectors and counts the number of orthogonal pairs for the OV instance, and obviously counting is at least as hard as decision. Using the aforementioned general recipe, it follows immediately that evaluating f_{\textrm{OV}} requires O(n^{2-o(1)}) time for average case assuming randomized version of SETH. This result was shown in [BRSV17a], and analogously, they constructed a polynomial for k-OV, which implies an average-case time hierarchy assuming randomized SETH.

However, the polynomial evaluation problem is algebraic. Now let us consider a natural combinatorial problem — counting t-cliques in an n-vertices graph (for simplicity, think of t as a large constant). This problem has worst-case complexity n^{\Theta(t)} assuming ETH [CHKX06].

Counting t-cliques. It was first shown in [GR18] that there is an \widetilde{O}(n^2)-time reduction from counting t-cliques in any graph to counting t-cliques with error probability <\frac{1}{4} in some polynomial-time sampable random graph. The proof also follows our general recipe. First, we construct a degree-\binom{t}{2} polynomial f_{t\textrm{-clique}}(X):=\sum_{\textrm{size-}t\, T\subseteq [n]}\prod_{i<j\in T} X_{i,j} over \mathbf{F}_{p}^{n^2} for p>n^t, where X is the adjacency matrix. Observe that f_{t\textrm{-clique}} counts t-cliques by enumerating each size-t subset of vertices and checking whether it is a t-clique. It remains to work out the step 4 of the general recipe. This was done by a gadget reduction, that runs in \widetilde{O}(p^{t^2}n) time, from evaluating f_{t\textrm{-clique}} on Y\in\mathbf{F}_{p}^{n^2} to counting t-cliques in an \widetilde{O}(p^{t^2}n)-vertices graph [GR18].

Although this gadget reduction is nice, I will not explain it here, because later works [BBB19, DLW20] show that if the polynomial constructed at the step 2 has certain structure, then there is a general technique to reduce evaluating this polynomial back to the original problem (at the cost of requiring smaller error probability of average-case solver), which I will discuss in the next post. Finally, let me point out an issue in the previous paragraph — p is too large for the gadget reduction to be useful! We need p>n^t (note n^t is a trivial upper bound of the number of t-cliques) such that f_{t\textrm{-clique}} indeed outputs the number of t-cliques, but the gadget reduction takes \widetilde{O}(p^{t^2}n) time, and moreover, we do not know how to find a prime >n^t in \widetilde{O}(n^2) time. This issue was handled in [GR18] using Chinese remainder theorem. Specifically, we choose O(t\log n) many distinct primes p_i‘s upper bounded by O(t\log n) whose product >n^t (the existence of such primes follows from asymptotic distribution of the prime numbers). Then, we apply the whole reduction described so far to evaluating f_{t\textrm{-clique}}(X)\mod p_i for each p_i and then use Chinese remainder theorem to recover f_{t\textrm{-clique}}(X). (We can use Chinese remaindering with errors [GRS99] to handle larger error probability.)

Hopefully, the above examples give you a flavor of worst-case to average-case reductions for fine-grained hard problems. As promised, in the next post, I will continue to discuss the new technique in [BBB19, DLW20], which automatizes the step 4 of the general recipe by requiring more structural properties for the polynomial constructed in the step 2 of the general recipe.

Acknowledgements. I would like to thank my quals committee — Aviad Rubinstein, Tselil Schramm, Li-Yang Tan for valuable feedback to my quals talk.

by Junyao Zhao at July 23, 2021 01:50 PM UTC

Integrals appear everywhere in all scientific fields, and their numerical computation is an active area of research. In the playbook of approximation techniques, my personal favorite is “la méthode de Laplace”, a must-know for students that like to cut integrals into pieces, that comes with lots of applications.

We will be concerned with integrals of the form $$I(t) = \int_K h(x) e^{ \, – \, t f(x) } dx, $$ where \(K\) is compact (bounded and closed) subset of \(\mathbb{R}^d\), and \(h\) and \(f\) are two real-valued functions defined on \(K\) such that the integral is well defined for large enough \(t \in \mathbb{R}\). The goal is to obtain an asymptotic equivalent when \(t\) tends to infinity.

Within machine learning, as explained below, this is useful where computing and approximating integrals is important. This thus comes up naturally in Bayesian inference where \(t\) will be the number of observations in a statistical model. But let’s start with presenting this neat asymptotic result that Laplace discovered for a particular one-dimensional case in 1774 [1].

Laplace approximation

Let’s first state the main result. Assume that:

  • \(h\) is continuous on \(K\) and that \(f\) is twice continuously differentiable on \(K\).
  • \(f\) has a strict global minimizer \(x_\ast\) on \(K\), which is in the interior of \(K\), where the gradient \(f^\prime(x_\ast)\) is thus equal to zero, and where the Hessian \(f^{\prime \prime}(x_\ast)\) is a positive definite matrix (it is always positive semi-definite because \(x_\ast\) is a local minimizer of \(f\)); moreover, \(h(x_\ast) \neq 0\).

Then, as \(t\) tends to infinity, we have the following asymptotic equivalent: $$ I(t) \sim \frac{h(x_\ast)}{\sqrt{ \det f^{\prime \prime}(x_\ast) }} \Big( \frac{2\pi}{t}\Big)^{d/2} e^{ \, – \, t f(x_\ast) }.$$

Where does it come from? The idea is quite simple: for \(t>0\), the exponential term \(e^{ \, – \, t f(x) }\) is largest when \(x\) is equal to the minimizer \(x_\ast\). Hence only contributions close to \(x_\ast\) will count in the integral. Then we can do Taylor expansions of the two functions around \(x_\ast\), as \(h(x) \approx h(x_\ast)\) and \(f(x) \approx f(x_\ast) + \frac{1}{2} ( x – x_\ast)^\top f^{\prime \prime}(x_\ast) (x-x_\ast)\), and approximate \(I(t)\) as $$ I(t) \approx \int_K h(x_\ast) \exp\Big[ – t f(x_\ast) \ – \frac{t}{2} ( x – x_\ast)^\top f^{\prime \prime}(x_\ast) (x-x_\ast)\Big] dx.$$ We can then make a change of variable \(y = \sqrt{t} f^{\prime \prime}(x_\ast)^{1/2}( x – x_\ast)\) (where \(f^{\prime \prime}(x_\ast)^{1/2}\) is the positive square root of \(f^{\prime \prime}(x_\ast)\)), to get, with the Jacobian of the transformation leading to the term \( (\det f^{\prime \prime}(x_\ast) )^{1/2} t^{d/2} \): $$I(t) \approx \frac{ h(x_\ast) e^{ \, – \, t f(x_\ast) }}{( \det f^{\prime \prime}(x_\ast) )^{1/2}t^{d/2} } \int_{\sqrt{t}f^{\prime \prime}(x_\ast)^{1/2}( K \ – x_\ast)} \!\!\! \exp\big[ – \frac{1}{2} y^\top y \big] dy.$$ Since \(x_\ast\) is in the interior of \(K\), when \(t\) tends to infinity, the set \(\sqrt{t} f^{\prime \prime}(x_\ast)^{1/2} ( K \ – x_\ast)\) tends to \(\mathbb{R}^d\) (see illustration below), and we obtain the usual Gaussian integral that leads to the normalizing constant of the Gaussian distribution, which is equal to \((2\pi)^{d/2} \).

Left: set \(K\) with \(x_\ast\) in its interior. Right: set \(\sqrt{t} f^{\prime \prime}(x_\ast)^{1/2} ( K \ – x_\ast)\), which is a translated (by \(x_\ast\)), tilted (by \(f^{\prime\prime}(x_\ast)\)) and scaled (by \(\sqrt{t}\)) version of \(K\).

Formal proof. Let’s first make our life simpler: without loss of generality, we may assume that \(f(x_\ast) = 0\) (by subtracting the minimal value), that \(f^{\prime \prime}(x_\ast) = I\) (by a change of variable whose Jacobian is the square root of \(f^{\prime \prime}(x_\ast)\), leading to the determinant term), and that \(x_\ast = 0\). With the dominated convergence theorem (which was unfortunately unknown to me when I first learned about the method in high school, forcing students to cut integrals into multiple pieces), the proof sketch above is almost exact. We simply need a simple argument, based on the existence of a continuous function \(g: K \to \mathbb{R}\) such that $$ g(x) = \frac{f(x) } {\| x \|^2}$$ for \(x \neq 0\) and \(g(0) = \frac{1}{2}\) (here \(\| \! \cdot \! \|\) denotes the standard Euclidean norm). The function \(g\) is trivially defined and continuous on \(K \backslash \{0\}\), the value and continuity at zero is a simple consequence of the Taylor expansion $$ f(x) = f(0) + f^\prime(0)^\top x + \frac{1}{2} x^\top f^{\prime\prime}(0)x + o( \| x\|^2) = \frac{1}{2} \| x\|^2 + o( \| x\|^2). $$

We may then use the same change of variable as above to get the equality: $$I(t) = \frac{ 1 }{ t^{d/2} } \int_{\sqrt{t} K } h\big( \frac{1}{\sqrt{t}} y\big) \exp\big[ – \frac{1}{2} y^\top y \cdot g\big( \frac{1}{\sqrt{t}} y\big) \big] dy.$$ We can write the integral part of the expression above as $$J(t) = \int_{\mathbb{R}^d} a(y,t) dy, $$ with $$a(y,t) = 1_{ y \in \sqrt{t} K } h\big( \frac{1}{\sqrt{t}} y\big)\exp\big[ – \frac{1}{2} y^\top y \cdot g\big( \frac{1}{\sqrt{t}} y\big) \big].$$ We have for all \(t>0\) and \(y \in \mathbb{R}^d\), \(|a(y,t)| \leqslant \max_{z \in K} | h(z)| \exp\Big[ – \| y\|^2 \cdot \min_{ z \in K } g(z) \Big]\), which is integrable because \(h\) is continuous on the compact set \(K\) and thus bounded, and \(g\) is strictly positive on \(K\) (since \(f\) is strictly positive except at zero as \(0\) is a strict global minimum), and by continuity, its minimal value is strictly positive. Thus by the dominated convergence theorem: $$\lim_{t \to +\infty} J(t) = \int_{\mathbb{R}^d} \big( \lim_{t \to +\infty} a(y,t) \big) dy = \int_{\mathbb{R}^d}\exp\big[ – \frac{1}{2} y^\top y \big] dy = ( 2\pi)^{d/2}.$$ This leads to the desired result since \(I(t) = J(t) / t^{d/2}\).

Classical examples

Two cute examples are often mentioned as applications (adapted from [2]).

Stirling’s formula. We have, by definition of the Gamma function \(\Gamma\), and the change of variable \(u = tx\):
$$\Gamma(1+t) = \int_0^\infty \!\! e^{-u} u^{t} du = \int_0^\infty \!\! e^{-tx}t^t x^t t dx = t^{t+1} \int_0^\infty \!\! e^{-t(x-\log x)} dx.$$ Since \(x \mapsto x\, – \log x\) is minimized at \(x=1\) with second derivative equal to \(1\), we get: $$\Gamma(1+t) \sim t^{t+1} e^{-t} \sqrt{2\pi / t} = \big( \frac{t}{e} \big)^t \sqrt{ 2\pi t}.,$$ which is exactly Stirling’s formula, often used when \(t\) is an integer, and then, \(\Gamma(1+t) = t!\).

Convergence of \(L_p\)-norms to the \(L_\infty\)-norm. We consider the \(L_p\)-norm of a positive twice continuously differentiable function on the compact set \(K\), with a unique global maximizer at \(x_\ast\) in the interior of \(K\). Then we can write its \(L_p\)-norm \(\|g\|_p\) through $$\| g\|_p^p = \int_K g(x)^p dx = \int_K e^{ p \log g(x)} dx.$$ The function \(f: x \mapsto\ – \log g(x)\) has gradient \(f^\prime(x) = \ – \frac{1}{g(x)}g^\prime(x)\) and Hessian \(f^{\prime\prime}(x)=\ – \frac{1}{g(x)} g^{\prime\prime}(x) + \frac{1}{g(x)^2} g^\prime(x) g^\prime(x)^\top .\) At \(x_\ast\), we get \(f^{\prime\prime}(x_\ast) = \ – \frac{1}{g(x_\ast)} g^{\prime \prime}(x_\ast)\). Thus, using Laplace’s method, we have: $$ \|g\|_p^p = \frac{A}{p^{d/2}} g(x_\ast)^p (1 + o(1) ),$$ with \(\displaystyle A = \frac{(2\pi g(x_\ast))^{d/2}}{(\det (-g^{\prime\prime}(x_\ast)))^{1/2}}\).

Taking the power \(1/p\), we get: $$ \|g\|_p = \exp \big( \frac{1}{p} \log \|g\|_p^p\big) = \exp \Big( \frac{1}{p} \log A \ – \frac{d}{2p} \log p + \log g(x_\ast) + o(1/p) \Big).$$ This leads to, using \(\exp(u) = 1+u + o(u)\) around zero: $$ \| g\|_p = g(x_\ast) \Big( 1 – \frac{d}{2p} \log p + \frac{1}{p} \log A + o(1/p) \Big) = g(x_\ast) \Big( 1 – \frac{d}{2p} \log p + O(1/p) \Big).$$ A surprising fact is that the second-order term does not depend on anything but \(g(x_\ast)\) (beyond \(p\) and \(d\)). Note that this applies also to continuous log-sum-exp functions.

Applications in Bayesian inference

It turns out that Laplace’s motivation in deriving this general technique for approximating integrals was exactly Bayesian inference, which he in fact essentially re-discovered and extended (see an interesting account here). Let me now explain how Laplace’s method applies.

We consider a parameterized family of probability distributions with density \(p(x|\theta)\) with respect to some base measure \(\mu\) on \(x \in \mathcal{X}\), with \(\theta \in \Theta \subset \mathbb{R}^d\) a set of parameters.

As a running example (the one from Laplace in 1774), we consider the classical Bernoulli distribution for \(x \in \mathcal{X} = \{0,1\}\), and densities with respect to the counting measure, that is, the parameter \(\theta \in [0,1]\) is the probability that \(x=1\).

From frequentist to Bayesian inference. We are given \(n\) independent and identically distributed observations \(x_1,\dots,x_n \in \mathcal{X}\), from an unknown probability distribution \(q(x)\). One classical goal of statistical inference is to find the parameter \(\theta \in \Theta\) so that \(p(\cdot| \theta)\) is as close to \(q\) as possible.

In the frequentist framework, maximum likelihood estimation amounts to maximizing $$ \sum_{i=1}^n \log p(x_i | \theta)$$ with respect to \(\theta \in \Theta\). It comes with a series of guarantees, in particular (but not only) when \(q = p(\cdot | \theta)\) for a certain \(\theta \in \Theta\). For our running example, the maximum likelihood estimator \(\hat{\theta}_{\rm ML} = \frac{1}{n} \sum_{i=1}^n x_i\) is the frequency of non-zero outcomes.

In the Bayesian framework, the data are assumed to be generated by a certain \(\theta\), but now with \(\theta\) itself random with some prior probability density \(p(\theta)\) (with respect to the Lebesgue measure). Statistical inference typically does not lead to a point estimator (like the maximum likelihood estimator), but to the full posterior distribution of the parameter \(\theta\) given the observations \(x_1,\dots,x_n\).

The celebrated Bayes’ rule states that the posterior density \(p(\theta | x_1,\dots,x_n)\) can be written as: $$p(\theta | x_1,\dots,x_n) = \frac{ p(\theta) \prod_{i=1}^n p(x_i|\theta) }{p(x_1,\dots,x_n)},$$ where \(p(x_1,\dots,x_n)\) is the marginal density of the data (once the parameter has been marginalized out).

If pressured, a Bayesian will end up giving you a point estimate, but a true Bayesian will not give away the maximum a posteriori estimate (see here for some reasons why), but if you give him or her a loss function (e.g., the square loss), he or she will give away (reluctantly) the estimate that minimizes the posterior risk (e.g., the posterior mean for the square loss).

Bernoulli and Beta distributions. In our running Bernoulli example for coin tossing, it is standard to put a conjugate prior on \(\theta \in [0,1]\), which is here a Beta distribution with parameters \(\alpha\) and \(\beta\), that is, $$p(\theta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1} ( 1- \theta)^{\beta-1}.$$ The posterior distribution is then also a Beta distribution with parameters \(\alpha + \sum_{i=1}^n x_i\) and \(\beta + \sum_{i=1}^n ( 1- x_i)\). The posterior mean is $$\mathbb{E} [ \theta | x_1,\dots,x_n ] = \frac{\alpha+ \sum_{i=1}^n x_i}{n + \alpha + \beta},$$ and corresponds to having pre-observed \(\alpha\) ones and \(\beta\) zeroes (this is sometimes referred to as Laplace smoothing). However, it is rarely possible to compute the posterior distribution in closed form, hence the need for approximations.

Laplace approximation. Thus, in Bayesian inference, integrals of the form $$ \int_{\Theta} h(\theta) p(\theta) \prod_{i=1}^n p(x_i|\theta) d\theta$$ for some function \(h: \Theta \to \mathbb{R}\), are needed. For example, computing the marginal likelihood corresponds to \(h=1\).

By taking logarithms, we can write $$\int_{\Theta} h(\theta) p(\theta) \prod_{i=1}^n p(x_i|\theta) d\theta = \int_{\Theta} h(\theta) \exp\Big( \log p(\theta) + \sum_{i=1}^n \log p(x_i|\theta) \Big) d\theta, $$ and with \(f_n(\theta) = \ – \frac{1}{n} \log p(\theta) \ – \frac{1}{n} \sum_{i=1}^n \log p(x_i|\theta),\) we have an integral in Laplace form, that is, $$\int_{\Theta} h(\theta) \exp( -n f_n(\theta))d\theta,$$ with a function \(f_n\) that now varies with \(n\). This simple variation does not matter as because of the law of large numbers, when \(n\) is large, \(f_n(\theta)\) tends to a fixed function \(\mathbb{E} \big[ \log p(x|\theta) \big]\).

The Laplace approximation thus requires to compute the minimizer of \(f_n(\theta)\), which is exactly the maximum a posteriori estimate \(\hat{\theta}_{\rm MAP}\), and use the approximation: $$ \int_{\Theta} h(\theta) \exp( -n f_n(\theta))d\theta \approx (2\pi / n)^{d/2} \frac{h( \hat{\theta}_{\rm MAP})}{(\det f_n^{\prime \prime}( \hat{\theta}_{\rm MAP}))^{1/2}} \exp( – n f_n( \hat{\theta}_{\rm MAP})).$$

Gaussian posterior approximation. Note that the Laplace approximation exactly corresponds to approximating the log-posterior density by a quadratic form and thus approximating the posterior by a Gaussian distribution with mean \(\hat{\theta}_{\rm MAP}\) and covariance matrix \(\frac{1}{n} f_n^{\prime \prime}( \hat{\theta}_{\rm MAP})^{-1}\). Note that Laplace’s method gives one natural example of such Gaussian approximation and that variational inference can be used to find better ones.

Example. We consider the Laplace approximation of a Beta random variable, that is a Gaussian with mean at the mode of the original density and variance equal to the inverse of the second derivative of the log-density. Below, the mean \(\alpha / (\alpha +\beta)\) is set to a constant, while the variance shrinks due to an increasing \(\alpha+\beta\) (which corresponds to the number of observations in the Bayesian interpretation above).

Left: densities. Right: negative log-densities translated so that they have matched first two derivatives at their minimum.

We see above, that for large variances (small \(\alpha +\beta\)), the Gaussian approximation is not tight, while it becomes tighter as the mass gets concentrated around the mode.


High-order expansion. The approximation is based on Taylor expansions of the functions \(h\) (order \(0\)) and \(f\) (order \(2\)). In order to obtain extra terms of the form \(t^{d/2+\nu}\), for \(\nu\) a positive integer, we need higher-order derivatives of \(h\) and \(f\). In more than one dimension, that quickly gets complicated (see, e.g., [3, 4]).

One particular case which is interesting in one dimension is \(h(x) \sim A ( x- x_\ast)^\alpha\), and \(f(x)-f(x_\ast) = B|x-x_\ast|^{\beta}\). Note that \(\alpha=0\) and \(\beta=2\) corresponds to the regular case. A short calculation gives the equivalent \(I(t) \sim \frac{2}{2+\beta} \frac{A \Gamma\big( \frac{\alpha+1}{\beta} \big)}{(tB)^{\frac{\alpha+1}{\beta}}}\).

Stationary phase approximation. We can also consider integrals of the form $$I(t) = \int_K h(x) \exp( – i t f(x) ) dx,$$ where \(i\) is the usual square root of \(-1\). Here, the main contribution also comes from vectors \(x\) where the gradient of \(f\) is zero. This can be further extended to more general complex integrals.

When Napoléon meets Laplace and Lagrange

As a conclusion, I cannot resist mentioning a classical (potentially not totally authentic) anecdote about encounters between Laplace and Lagrange, two mathematical heroes of mine, and Napoléon, as described in [5, page 343]:

Laplace went in state to beg Napoleon to accept a copy of his work, and the following account of the interview is well authenticated, and so characteristic of all the parties concerned that I quote it in full. Someone had told Napoleon that the book contained no mention of the name of God; Napoleon, who was fond of putting embarrassing questions, received it with the remark, “M. Laplace, they tell me you have written this large book on the system of the universe, and have never even mentioned its Creator.” Laplace, who, though the most supple of politicians, was as stiff as a martyr on every point of his philosophy, drew himself up and answered bluntly, “Je n’avais pas besoin de cette hypothèse-là.” Napoleon, greatly amused, told this reply to Lagrange, who exclaimed, “Ah! c’est une belle hypothèse; ça explique beaucoup de choses.”

W. W. Rouse Ball, A Short Account of the History of Mathematics, 1888.


[1] Pierre-Simon Laplace. Mémoire sur la probabilité des causes par les événements, Mémoires de l’Académie royale des sciences de Paris (Savants étrangers), t. VI. p. 621, 27-65, 1774.
[2] Norman Bleistein, Richard A. Handelsman. Asymptotic Expansions of Integrals. Dover
Publications, 1986.[3] Stephen M. Stigler. Laplace’s 1774 memoir on inverse probabilityStatistical Science, 1(3):359-378, 1986.
[3] Luke Tierney, Robert E. Kass, Joseph B. Kadane. Fully exponential Laplace approximations to expectations and variances of nonpositive functionsJournal of the American Statistical Association, 84(407): 710-716, 1989.
[4] Zhenming Shun, Peter McCullagh. Laplace approximation of high dimensional integralsJournal of the Royal Statistical Society: Series B (Methodological) 57(4): 749-760, 1995.
[5] W. W. Rouse Ball. A Short Account of the History of Mathematics, 1888.

by Francis Bach at July 23, 2021 06:37 AM UTC

The classic Impagliazzo--Nisan--Wigderson (INW) psesudorandom generator (PRG) (STOC `94) for space-bounded computation uses a seed of length $O(\log n \cdot \log(nwd/\varepsilon))$ to fool ordered branching programs of length $n$, width $w$, and alphabet size $d$ to within error $\varepsilon$. A series of works have shown that the analysis of the INW generator can be improved for the class of $\textit{permutation}$ branching programs or the more general $\textit{regular}$ branching programs, improving the $O(\log^2 n)$ dependence on the length $n$ to $O(\log n)$ or $\tilde{O}(\log n)$. However, when also considering the dependence on the other parameters, these analyses still fall short of the optimal PRG seed length $O(\log(nwd/\varepsilon))$. In this paper, we prove that any ``spectral analysis'' of the INW generator requires seed length $$\Omega\left(\log n\cdot \log\log(\min\{n,d\})+\log n\cdot \log(w/\varepsilon)+\log d\right)$$ to fool ordered permutation branching programs of length $n$, width $w$, and alphabet size $d$ to within error $\varepsilon$. By ``spectral analysis'' we mean an analysis of the INW generator that relies only on the spectral expansion of the graphs used to construct the generator; this encompasses all prior analyses of the INW generator. Our lower bound matches the upper bound of Braverman--Rao--Raz--Yehudayoff (FOCS 2010, SICOMP 2014) for regular branching programs of alphabet size $d=2$ except for a gap between their $O(\log n \cdot \log\log n)$ term and our $O(\log n \cdot \log\log \min\{n,d\})$ term. It also matches the upper bounds of Koucky--Nimbhorkar--Pudlak (STOC 2011), De (CCC 2011), and Steinke (ECCC 2012) for constant-width ($w=O(1)$) permutation branching programs of alphabet size $d=2$ to within a constant factor. To fool permutation branching programs in the stronger measure of $\textit{spectral norm}$, we prove that any spectral analysis of the INW generator requires a seed of length $\Omega(\log n\cdot \log\log n+\log n\cdot\log(1/\varepsilon)+\log d)$ when the width is at least polynomial in $n$ ($w=n^{\Omega(1)}$), matching the recent upper bound of Hoza--Pyne--Vadhan (ITCS `21) to within a constant factor.

at July 23, 2021 06:10 AM UTC

We introduce a new graph parameter called linear upper maximum induced matching width \textsc{lu-mim width}, denoted for a graph $G$ by $lu(G)$. We prove that the smallest size of the \textsc{obdd} for $\varphi$, the monotone 2-\textsc{cnf} corresponding to $G$, is sandwiched between $2^{lu(G)}$ and $n^{O(lu(G))}$. The upper bound is based on a combinatorial statement that might be of an independent interest. We show that the bounds in terms of this parameter are best possible. The new parameter is closely related to two existing parameters: linear maximum induced matching width (\textsc{lmim width}) and linear symmetric induced matching width (\textsc{lsim width}). We prove that \textsc{lu-mim width} lies strictly in between these two parameters, being dominated by \textsc{lsim width} and dominating \textsc{lmim width}. We conclude that neither of the two existing parameters can be used instead of \textsc{lu-mim width} to characterize the size of \textsc{obdd}s for monotone $2$-\textsc{cnf}s and this justifies introduction of the new parameter.

at July 23, 2021 06:08 AM UTC

Recently, there has been exciting progress in understanding the complexity of distributions. Here, the goal is to quantify the resources required to generate (or sample) a distribution. Proving lower bounds in this new setting is more challenging than in the classical setting, and has yielded interesting new techniques and surprising applications. In this work, we initiate a study of the complexity of sampling with limited memory, and prove small-space analogs of several results previously known only for sampling in AC$^0$. 1. We exhibit an explicit boolean function $b:\{0,1\}^n\to\{0,1\}$ that cannot be computed by width $2^{\Omega(n)}$ read-once branching programs (ROBPs), even on average, but such that $(\mathbf{U}_n,b(\mathbf{U}_n))$ can be exactly sampled by ROBPs of width $O(n)$. 2. We exhibit an explicit boolean function $b:\{0,1\}^n\to\{0,1\}$ such that any distribution sampled by an ROBP of width $2^{\Omega(n)}$ has statistical distance $\frac{1}{2}-2^{-\Omega(n)}$ from $(\mathbf{U}_n,b(\mathbf{U}_n))$. We show that any such $b$ witnesses exponentially small correlation bounds against ROBPs, and we extend these results to hold for the unknown-order setting. 3. We show that any distribution sampled by an ROBP of width $2^{\Omega(n)}$ has statistical distance $1-2^{-\Omega(n)}$ from any distribution that is uniform over a good code. More generally, we obtain sampling lower bounds for any list decodable code, which are nearly tight. Using a known connection, we also obtain data structure lower bounds for storing codewords. Along the way, we prove a direct product theorem and several equivalence theorems. These tools offer a generic method to construct distributions with strong sampling lower bounds, and translate these lower bounds into correlation bounds against ROBPs. As an application of our direct product theorem, we show a strong complexity separation between sampling with AC$^0$ circuits and sampling with ROBPs.

at July 23, 2021 05:15 AM UTC

Congratulations on the 2021 Gödel Prize

Composite including Richerby’s talk at STOC 2021

Andrey Bulatov, Martin Dyer and David Richerby, and Jin-Yi Cai and Xi Chen are the authors of three papers awarded the 2021 Gödel Prize. The citations by the ACM and EATCS say clearly that the papers not the people win the prize, but our “blog invariant” is to show photos of people at the top, so there they are.

Today we congratulate the people—not the papers—and tell some of the technical story behind this richly deserved award.

Dick and I have intended to write about this since the prize was announced in early May. Travel and personal events have slowed the blog, plus in my case, June was by far the most pressing month of the “chess cheating pandemic” since chess moved online, and that carried into July. The award to our friend Jin-Yi—who has been a faculty colleague of both of us—is the “something else about UW Madison on our mind” that was mentioned in the June 20 post, during which chess cases arrived in real time. I am thankful that this month, to judge by the weekly roundups of games by ChessBase and The Week in Chess, most major tournaments are back to being played in-person—where the yearly volume of cases coming to my attention has been on either side of ten, rather than in middle triple digits.

That the papers win the award calls something else to mind: We can no longer point to an “original manuscript” with papers being electronic. Perhaps the continued rise of non-fungible tokens (NFTs) will restore this distinction. Tim Berners-Lee recently created and auctioned off an NFT of his original code for the World Wide Web. This may be done to prize-winning papers in the future. Then the prize money could create an inherent base value, such as we discussed for “semi-fungible tokens” at the end of this post, so that the prize really would go to the papers.

The Beginning of Dichotomy

The term “dichotomy” in complexity theory first became prominent with Thomas Schaefer’s 1978 paper about {\mathrm{SAT}}-like problems. For a large and natural class {\mathcal{S}} of such problems, he proved that either the problem is {\mathsf{NP}}-complete like {\mathrm{SAT}} itself or it is in {\mathsf{P}}—never intermediate between them. The famous 1975 theorem of Richard Ladner showed that if {\mathsf{NP \neq P}} then plenty of intermediate languages exist, but no problem in {\mathcal{S}} can be among them. That is the dichotomy: everything in {\mathcal{S}} is either easy or maximally hard. A similar project was taken up for {\mathsf{\#P}} by Nadia Creignou and Nicolas (Miki) Hermann in a 1996 paper showing that every {\mathrm{\#SAT}}-like counting problem is either in {\mathsf{P}} or is {\mathsf{\#P}}-complete.

What does “{\mathrm{SAT}}-like” mean? Consider the clause {(\bar{x}_i \vee \bar{x}_j \vee x_k)}. It is satisfied if and only if the values of the variables belong to the relation

\displaystyle  R = \{(0,0,0),(0,0,1),(0,1,0),(0,1,1),(1,0,0),(1,0,1),(1,1,1)\},

which excludes only the non-satisfying assignment {(1,1,0)}. Note that if we had the clause {(\bar{x}_i \vee x_j \vee \bar{x}_k)} instead, we could re-use {R} by switching {x_j} and {x_k} in the given sequence of variables. Putting negated literals first, if we instead want to require exactly one literal to be true we could apply one of the following relations to each clause, depending on how many negated variables it has:

\displaystyle  \begin{array}{rcl}  R^0 &=& \{(1,0,0),(0,1,0),(0,0,1)\}\\ R^1 &=& \{(0,0,0),(1,1,0),(1,0,1)\}\\ R^2 &=& \{(0,1,0),(1,0,0),(1,1,1)\}\\ R^3 &=& \{(0,1,1),(1,0,1),(1,1,0)\} \end{array}

Then we can say that both the “Exactly One 3SAT” decision problem in {\mathsf{NP}} and the associated counting problem in {\mathsf{\#P}} are defined by the collection {\Gamma = \{R^0,R^1,R^2,R^3\}} of relations. The following shows how they are subcases of the general constraint satisfaction problem (CSP):

  • Parameter: A set {\Gamma} of relations over some domain {D}.

  • Instance: A set {\Phi} of tuples {C_1,\dots,C_m} over variables {X = \{x_1,\dots,x_n\}}, with each {C_j} associated to some relation {R_j} in {\Gamma}.

  • Question: Is there an assignment {\alpha: X \rightarrow D} that makes {\alpha(C_j) \in R_j} for each {j}? For the counting version, called {\mathsf{\#CSP}}, how many such assignments are there?

Here {\alpha(C_j)} means the ordered tuple of values {\alpha(x_i)} of the variables {x_i} in {C_j}. If {\Gamma} and {D} are considered part of the instance—where they must be finite—the problem is uniform, but if they are fixed (and possibly infinite), it is nonuniform. For a simple analogy: for every fixed context-free grammar {G}, its language {L(G)} belongs to {\mathsf{P}}, but a much stronger fact is that the uniform acceptance problem {A_{CFG} = \{\langle G,x\rangle: x \in L(G)\}} also belongs to {\mathsf{P}}. Another relaxation is to allow different domains {D_i} for different variables {x_i}.

In the Boolean case, {D} is fixed to be {\{0,1\}}. We can now say what Schaefer and Creignou-Hermann proved. Schaefer showed that every nonuniform Boolean CSP with finite {\Gamma} is either in {\mathsf{P}} or {\mathsf{NP}}-complete. He showed the former holds only when {\Gamma} falls into one of six explicit cases, where which case is also decidable in polynomial time given {\Gamma}. Creignou and Hermann gave a similar result for Boolean #CSP but with a smaller list of cases—because there are cases where the decision problem is in {\mathsf{P}} but the counting problem remains {\mathsf{\#P}}-complete. A simple such case is where {\Gamma} has only the relation {R = \{(1,0),(0,1),(1,1)\}}, which defines monotone {\mathrm{2SAT}}.

Extending the Reach

Also in the 1990s, Tomás Feder and Moshe Vardi wrote two papers on extending Schaefer’s dichotomy to a wider class of problems. To quote the first sentence of the latter’s abstract:

This paper starts with the project of finding a large subclass of {\mathsf{NP}} which exhibits a dichotomy.

The first step was to go to larger domains {D}. For example, consider {D = \{1,2,3\}}. Take {R = \{(1,2),(2,3),(1,3)\}}, which is simply the relation {\neq} on {D}. Given an undirected graph {G}, its edges become the tuples. The resulting CSP is satisfiable if and only if {G} is 3-colorable. Of course, this is {\mathsf{NP}}-complete. Feder and Vardi arrived at this by going down from large subclasses of {\mathsf{NP}} defined by logic in which Ladner-type constructions can still be expressed, so that dichotomy does not hold. When they imposed all of their conditions, they were left with nonuniform CSPs, and conjectured dichotomy for their decision problems.

The corresponding question for counting problems was solved first, by Bulatov. Dyer and Richerby followed by simplifying Bulatov’s proof in a way that also allows deciding whether a given finite {(\Gamma,D)} falls into one of the polynomial-time cases (assuming {\mathsf{\#P \neq P}} to begin with). The latter task runs in polynomial time with an oracle for testing what they call strong balance, which in turn polynomial-time mapping-reduces to Graph Isomorphism—so by Laci Babai’s celebrated theorem they can decide in time quasi-polynomial in {|\Gamma|} whether it is a hard or easy case.

Finally—and not awarded any of this prize money—a 2017 paper by Bulatov proved dichotomy for the decision case of nonuniform CSPs. Independently, so did a paper by Dmitriy Zhuk. We will explore just a few of the underlying ideas in the next section.

\displaystyle  {\S}

Here we continue with the further extension made by Jin-Yi and Xi Chen. This starts by writing the counting problem as

\displaystyle  \#\Phi = \sum_{\alpha: X \rightarrow D} \prod_{j=1}^m w(\alpha(C_j),R_j), \ \ \ \ \ (1)

where for simple #CSP, {w(\alpha(C_j),R_j) = 1} if the values assigned by {\alpha} to the variables in {C_j} satisfy the associated relation {R_j}, and {w(\alpha(C_j),R_j) = 0} otherwise. However, one need not score an unsatisfied constraint as zero. Giving a nonzero partial credit {\lambda} for a wrong answer {\alpha(C_j)} to {R_j} might make you more magnetic as a teacher. It would literally be magnetism: the simple case where {\Gamma} consists of just the binary equality relation on {\{0,1\}}, so that for all binary clauses (edges) {(u,v)} we have

\displaystyle  w(0,0) = w(1,1) = 1, \qquad w(0,1) = w(1,0) = \lambda,

yields the simplest case of the Ising model of magnetism. But we can also take a quantum leap and make you complex: we can allow the weights {w(\alpha(C_j),R_j)} to have complex values. In particular, we can have functions of the form

\displaystyle  \sum_{\alpha: X \rightarrow D} \prod_{j=1}^m \omega^{g(\alpha(C_j),R_j)},

where {\omega} is a complex root of unity. Writing the product of powers of {\omega} as just {\omega} raised to a sum of terms ranging over all the clauses {C_j} in the instance {\Phi}, and inserting a normalizing constant {r}, we obtain functions of the form

\displaystyle  Z(f_\Phi) = r\sum_{\alpha: X \rightarrow D} \omega^{f_\Phi(\alpha)},

which are generally called partition functions. Indeed, we long ago discussed various efficient ways to convert quantum circuits {C} into polynomials {f_C(x_1,\dots,x_n)} so that, with {D = \{0,1\}} and {r} depending only on {C}, {Z(f_C)} gives the acceptance amplitude of {C}. The acceptance probability can be represented in like manner. See also this paper for another view of the bridge from the Ising model to quantum circuits.

Thus, the dichotomy theorems that Jin-Yi and Xi have obtained for complex weighted CSPs take matters into the quantum realm. What this may say about the class {\mathsf{BQP}} (bounded-error quantum polynomial time) we leave to the end.

Some of the Ideas

One key idea can be approached as generalizing a familiar feature of linear algebra. Suppose we take some number {\ell} of vectors of length {k} that belong to a linear subspace {S}. We can arrange them into an {\ell \times k} matrix {M}. Now let us multiply {M} on the left by a row vector {a = (a_1,\dots,a_\ell)}. Then {aM} is another vector of length {k}. This vector belongs to {S} because it is just a linear combination of our {\ell}-many vectors in {S}.

The insight for abstraction is that we can regard {a} as having been applied individually to each of the {k} columns of {M}. We got one new {k}-tuple from this application, but it still belonged to {S}. So now, let us consider “multiplying” {M} on the left by an arbitrary {\ell}-ary function {h(x_1,\dots,x_\ell)}, so that {hM} is always a {k}-tuple. And let us relax {S} to be any (finite) set {R \subseteq D^k} where {D} is our (finite) domain.

Definition 1 {R} has {h} as a polymorphism if for any {M} made from {\ell} members of {R}, allowing repeats, {hM} belongs to {R}. A set {\Gamma} of relations {R} has {h} as a polymorphism if every {R \in \Gamma} has {h} as a polymorphism.

With {\ell = 2}, the binary and function is a polymorphism of the relation {R} above that represented 3-clauses with two negated literals. This is because the excluded triple {(1,1,0)} cannot be written as the and of the other triples. This extends to the relation {R'} for satisfiability with three negated literals, which excludes {(1,1,1)}, so and is a polymorphism of {\Gamma = \{R,R'\}}, and also for {k > 3} to pertain to Horn clauses in general. Dually, or is a polymorphism for satisfiability of dual-Horn clauses.

The case of {\mathrm{2SAT}} is trickier. Its {\Gamma} has three constituent relations, depending on the number of negated literals:

\displaystyle  \begin{array}{rcl}  R^0 &=& \{(1,0),(0,1),(1,1)\}\\ R^1 &=& \{(0,0),(0,1),(1,1)\}\\ R^2 &=& \{(0,0),(0,1),(1,0)\}. \end{array}

Is there an {h} that is simultaneously a polymorphism of each one? Neither and nor or works: the former fails on {R^0} because the component-wise and of {(1,0)} and {(0,1)} is the excluded {(0,0)}, and or similarly fails on {R^2}. In fact, none of the sixteen binary Boolean functions is a homomorphism of all three. But when we go up to {\ell = 3}, however, it turns out that the majority-of-3 function is a polymorphism: any duplicated tuple rules, so only the three instances of three separate tuples need to be checked.

Going back to the linear dependence idea, the ternary function {h(x_1,x_2,x_3) = x_1 + x_2 + x_3} is a polymorphism of affine relations. For {D = \{0,1\}} where {+} is the same as exclusive-or, it has the property that {h(x,y,y) = h(y,y,x) = x}. This property is named for Anatoly Maltsev, who used it long before any of this complexity work began. It turned out to be a key to demarcating the easy cases for the dichotomy in higher dimensions. Affine relations were already the lone easy case in Creignou-Hermann for Boolean #CSP, while we can now crisply state Schaefer’s original dichotomy for decision CSP: affine, majority-of-3, and, or, plus two trivial cases where the all-zero or all-one assignment always works, are the polymorphisms that characterize the SAT-like problems in {\mathsf{P}}.

A second basic idea is to re-cast the counting problems as ones of counting homomorphisms between relational structures. For instance, let {G} be any graph and {K_r} the complete graph on {r} nodes. A homomorphism {h} has the property that when {(u,v)} is an edge of {G}, {(h(u),h(v)} is an edge of {K_r}—in particular, {h(u) \neq h(v)}. This is possible if and only if {G} is {r}-colorable. This approach was greatly enhanced in a 1998 paper by Peter Jeavons and its follow-ups with other authors. It was particularly useful to the decidability part of Dyer and Richerby’s paper.

A third idea used by Cai and Chen originated in a 2010 paper of theirs with Pinyan Lu for the case of non-negative weights. It is to stratify the sum in (1) over all {t \leq n} by mapping

\displaystyle  f(a_1,\dots,a_t) = \sum_{\alpha: \alpha(x_1)=a_1,\dots,\alpha(x_t)=a_t} \;\;\; \prod_{j=1}^m w(\alpha(C_j),R_j).

With {t = n} this is just a single evaluation; with {t=0} this is the original function. They create an oracle that either evaluates the whole sum or allows progressing from {t-1} to {t}. They create a matrix {M} with {d^{t-1}} rows for the possible fixed parts {\vec{a} = (a_1,\dots,a_{t-1})} and {d} columns for the possible choices of {a_t = \alpha(x_t)}, where {d = |D|}. The entries are {M[\vec{a},c] = f(a_1,\dots,a_{t-1},c)}, and {M[\vec{a},\cdot]} stands for the row vector over all {c \in D}. The oracle, given {\vec{a}}, returns the multiple of {M[\vec{a},\cdot]} that normalizes the first non-zero entry to {1}, or the all-zero vector if {M[\vec{a},\cdot]} is all zero.

A big extra challenge in the complex-weights case is how to handle cancellations, as begun in this paper by Cai, Chen, and Lu. Implementing the oracle efficiently requires the Maltsev property, a kind of “proto-unitarity” condition on {M} (under which {M} has at most {d} pairwise linearly independent rows and every two such rows are orthogonal), and a third property they call the “Type Partition” condition to avert blowup in the number of queries. Whether the classification they get is decidable remains open.

Here is where we say to go to the papers for details, but we hope we have expressed ideas and their motivation as a guide. We have a little more to say about interpretation and making further progress on the dichotomy.

Taking Dichotomy Further

Progress on the counting dichotomy increases the “internal evidence” for {\mathsf{\#P \neq P}}, which of course is implied by {\mathsf{NP \neq P}}. To paraphrase how our blogging friends Scott and Lance and Bill have often put it:

Why would all this finely filigreed structure with sharp demarcation between hard and easy cases exist if {\mathsf{P \neq NP}} were a false illusion?

At the very least, this puts the onus on a believer in {\mathsf{P = NP}} to give an alternative explanation of how such structure could arise.

It strikes us, however, that this also puts a squeeze on {\mathsf{BQP}} under the conventional wisdom that it is intermediate between {\mathsf{P}} and {\mathsf{\#P}}, without being {\mathsf{NP}}-hard in particular. This is not in technical conflict with the Cai-Chen dichotomy because exactly computing the acceptance probabilities of a class of quantum circuits (on inputs built into the circuits) can be {\mathsf{\#P}}-complete while the circuits use only a gap in probabilities to solve problems that are easy or intermediate. But we suspect that this still limits possible ways to capture {\mathsf{BQP}} “naturally” by descriptive logic.

The prize paper with Chen is just one part of a long-term project by Jin-Yi and co-workers on classifying counting problems and showing where dichotomy applies. A 2010 paper of Jin-Yi with Lu and Sangxia Huang bridged from #CSP and a subcase of counting homomorphisms between graphs to a wider kind of sum-over-products formula originally called a holant here. Recent progress on dichotomy for holants is represented by this FOCS 2020 paper with Shuai Shao. This is also both directly relevant to quantum computation and maps out progress that is still out there to make. The paper has a one-page prologue that summons characters from legendary fantasy (and Star Wars) to conquer quantum issues.

Open Problems

Again, we congratulate the winners—the human winners. What will the full implications of this work become in the years ahead?

[some minor word changes]

by KWRegan at July 23, 2021 01:32 AM UTC

Recently, Forbes, Kumar and Saptharishi [CCC, 2016] proved that there exists an explicit $d^{O(1)}$-variate and degree $d$ polynomial $P_{d} \in VNP$ such that if any depth four circuit $C$ of bounded formal degree $d$ which computes a polynomial of bounded individual degree $O(1)$, that is functionally equivalent to $P_d$, then $C$ must have size $2^{\Omega(\sqrt{d}\log{d})}$. The motivation for their work comes from Boolean Circuit Complexity. Based on a characterization for $ACC^0$ circuits by Yao [FOCS, 1985] and Beigel and Tarui [CC, 1994], Forbes, Kumar and Saptharishi [CCC, 2016] observed that functions in $ACC^0$ can also be computed by algebraic $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits (i.e., circuits of the form -- sums of powers of polynomials) of $2^{\log^{O(1)}n}$ size. Thus they argued that a $2^{\omega(\log^{O(1)}{n})}$ "functional" lower bound for an explicit polynomial $Q$ against $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits would imply a lower bound for the "corresponding Boolean function" of $Q$ against non-uniform $ACC^0$. In their work, they ask if their lower bound be extended to $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits. In this paper, for large integers $n$ and $d$ such that $\Omega(\log^2{n})\leq d\leq n^{0.01}$, we show that any $\Sigma\mathord{\wedge}\Sigma\Pi$ circuit of bounded individual degree at most $O(\frac{d}{k^2})$ that functionally computes Iterated Matrix Multiplication polynomial $IMM_{n,d}$ ($\in VP$) over $\{0,1\}^{n^2d}$ must have size $n^{\Omega(k)}$. Since Iterated Matrix Multiplication $IMM_{n,d}$ over $\{0,1\}^{n^2d}$ is functionally in $GapL$, improvement of the afore mentioned lower bound to hold for quasipolynomially large values of individual degree would imply a fine-grained separation of $ACC^0$ from $GapL$. For the sake of completeness, we also show a syntactic size lower bound against any $\Sigma\mathord{\wedge}\Sigma\Pi$ circuit computing $IMM_{n,d}$ (for the same regime of $d$) which is tight over large fields. Like Forbes, Kumar and Saptharishi [CCC, 2016], we too prove lower bounds against circuits of bounded formal degree which functionally compute $IMM_{n,d}$, for a slightly larger range of individual degree.

at July 22, 2021 01:09 PM UTC

Authors: Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan
Download: PDF
Abstract: We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{\mathsf{T}}\sigma({W}^\mathsf{T}x+b)$, where $x$ is drawn from the Gaussian distribution, and $\sigma(t) := \max(t,0)$ is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias $b$ is zero. In order to deal with the presence of the bias terms, our proposed algorithm consists of robustly decomposing multiple higher order tensors arising from the Hermite expansion of the function $f(x)$. Using these ideas we also establish identifiability of the network parameters under minimal assumptions.

at July 22, 2021 11:02 PM UTC

Authors: Raja Selvarajan, Vivek Dixit, Xingshan Cui, Travis S. Humble, Sabre Kais
Download: PDF
Abstract: The road to computing on quantum devices has been accelerated by the promises that come from using Shor's algorithm to reduce the complexity of prime factorization. However, this promise hast not yet been realized due to noisy qubits and lack of robust error correction schemes. Here we explore a promising, alternative method for prime factorization that uses well-established techniques from variational imaginary time evolution. We create a Hamiltonian whose ground state encodes the solution to the problem and use variational techniques to evolve a state iteratively towards these prime factors. We show that the number of circuits evaluated in each iteration scales as O(n^{5}d), where n is the bit-length of the number to be factorized and $d$ is the depth of the circuit. We use a single layer of entangling gates to factorize several numbers represented using 7, 8, and 9-qubit Hamiltonians. We also verify the method's performance by implementing it on the IBMQ Lima hardware.

at July 22, 2021 10:44 PM UTC

Authors: Jayadev Acharya, Clément L. Canonne, Aditya Vikram Singh, Himanshu Tyagi
Download: PDF
Abstract: We consider density estimation for Besov spaces when each sample is quantized to only a limited number of bits. We provide a noninteractive adaptive estimator that exploits the sparsity of wavelet bases, along with a simulate-and-infer technique from parametric estimation under communication constraints. We show that our estimator is nearly rate-optimal by deriving minimax lower bounds that hold even when interactive protocols are allowed. Interestingly, while our wavelet-based estimator is almost rate-optimal for Sobolev spaces as well, it is unclear whether the standard Fourier basis, which arise naturally for those spaces, can be used to achieve the same performance.

at July 22, 2021 10:39 PM UTC

Authors: Chao Wang, Gustavo Petri, Yi Lv, Teng Long, Zhiming Liu
Download: PDF
Abstract: An important property of concurrent objects is whether they support progress -a special case of liveness-guarantees, which ensure the termination of individual method calls under system fairness assumptions. Liveness properties have been proposed for concurrent objects. Typical liveness properties includelock-freedom,wait-freedom,deadlock-freedom,starvation-freedom and obstruction-freedom. It is known that the five liveness properties above are decidable on the Sequential Consistency (SC) memory model for a bounded number of processes. However, the problem of decidability of liveness for finite state concurrent programs running on relaxed memory models remains open. In this paper we address this problem for the Total Store Order (TSO) memory model,as found in the x86 architecture. We prove that lock-freedom, wait-freedom,deadlock-freedom and starvation-freedom are undecidable on TSO for a bounded number of processes, while obstruction-freedom is decidable.

at July 22, 2021 10:47 PM UTC

Authors: Naoto Ohsaka
Download: PDF
Abstract: We study the problem of deciding reconfigurability of target sets of a graph. Given a graph $G$ with vertex thresholds $\tau$, consider a dynamic process in which vertex $v$ becomes activated once at least $\tau(v)$ of its neighbors are activated. A vertex set $S$ is called a target set if all vertices of $G$ would be activated when initially activating vertices of $S$. In the Target Set Reconfiguration problem, given two target sets $X$ and $Y$ of the same size, we are required to determine whether $X$ can be transformed into $Y$ by repeatedly swapping one vertex in the current set with another vertex not in the current set preserving every intermediate set as a target set. In this paper, we investigate the complexity of Target Set Reconfiguration in restricted cases. On the hardness side, we prove that Target Set Reconfiguration is PSPACE-complete on bipartite planar graphs of degree $3$ or $4$ and of threshold $2$, bipartite $3$-regular graphs of threshold $1$ or $2$, and split graphs, which is in contrast to the fact that a special case called Vertex Cover Reconfiguration is in P for the last graph class. On the positive side, we present a polynomial-time algorithm for Target Set Reconfiguration on graphs of maximum degree $2$ and trees. The latter result can be thought of as a generalization of that for Vertex Cover Reconfiguration.

at July 22, 2021 10:44 PM UTC

Authors: Bruno Benedetti, Crystal Lai, Davide Lofano, Frank H. Lutz
Download: PDF
Abstract: We implement an algorithm RSHT (Random Simple-Homotopy) to study the simple-homotopy types of simplicial complexes, with a particular focus on contractible spaces and on finding substructures in higher-dimensional complexes. The algorithm combines elementary simplicial collapses with pure elementary expansions. For triangulated d-manifolds with d < 7, we show that RSHT reduces to (random) bistellar flips.

Among the many examples on which we test RSHT, we describe an explicit 15-vertex triangulation of the Abalone, and more generally, (14k+1)-vertex triangulations of Bing's houses with k rooms, which all can be deformed to a point using only six pure elementary expansions.

at July 22, 2021 11:03 PM UTC

Authors: Navid Assadian, Sima Hajiaghaei Shanjani, Alireza Zarei
Download: PDF
Abstract: In this paper we study the following problem: Given $k$ disjoint sets of points, $P_1, \ldots, P_k$ on the plane, find a minimum cardinality set $\mathcal{T}$ of arbitrary rectangles such that each rectangle contains points of just one set $P_i$ but not the others. We prove the NP-hardness of this problem.

at July 22, 2021 11:04 PM UTC

Authors: Michela Meister, Jon Kleinberg
Download: PDF
Abstract: Contact tracing is a key tool for managing epidemic diseases like HIV, tuberculosis, and COVID-19. Manual investigations by human contact tracers remain a dominant way in which this is carried out. This process is limited by the number of contact tracers available, who are often overburdened during an outbreak or epidemic. As a result, a crucial decision in any contact tracing strategy is, given a set of contacts, which person should a tracer trace next? In this work, we develop a formal model that articulates these questions and provides a framework for comparing contact tracing strategies. Through analyzing our model, we give provably optimal prioritization policies via a clean connection to a tool from operations research called a "branching bandit". Examining these policies gives qualitative insight into trade-offs in contact tracing applications.

at July 22, 2021 10:43 PM UTC

Authors: Farzam Ebrahimnejad, James R. Lee
Download: PDF
Abstract: Benjamini and Papasoglou (2011) showed that planar graphs with uniform polynomial volume growth admit $1$-dimensional annular separators: The vertices at graph distance $R$ from any vertex can be separated from those at distance $2R$ by removing at most $O(R)$ vertices. They asked whether geometric $d$-dimensional graphs with uniform polynomial volume growth similarly admit $(d-1)$-dimensional annular separators when $d > 2$. We show that this fails in a strong sense: For any $d \geq 3$ and every $s \geq 1$, there is a collection of interior-disjoint spheres in $\mathbb{R}^d$ whose tangency graph $G$ has uniform polynomial growth, but such that all annular separators in $G$ have cardinality at least $R^s$.

at July 22, 2021 10:38 PM UTC

Authors: Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, Sergei Vassilvitskii
Download: PDF
Abstract: A recent line of research investigates how algorithms can be augmented with machine-learned predictions to overcome worst case lower bounds. This area has revealed interesting algorithmic insights into problems, with particular success in the design of competitive online algorithms. However, the question of improving algorithm running times with predictions has largely been unexplored.

We take a first step in this direction by combining the idea of machine-learned predictions with the idea of "warm-starting" primal-dual algorithms. We consider one of the most important primitives in combinatorial optimization: weighted bipartite matching and its generalization to $b$-matching. We identify three key challenges when using learned dual variables in a primal-dual algorithm. First, predicted duals may be infeasible, so we give an algorithm that efficiently maps predicted infeasible duals to nearby feasible solutions. Second, once the duals are feasible, they may not be optimal, so we show that they can be used to quickly find an optimal solution. Finally, such predictions are useful only if they can be learned, so we show that the problem of learning duals for matching has low sample complexity. We validate our theoretical findings through experiments on both real and synthetic data. As a result we give a rigorous, practical, and empirically effective method to compute bipartite matchings.

at July 22, 2021 10:41 PM UTC

Authors: Maxwell Allman, Venus Lo, S. Thomas McCormick
Download: PDF
Abstract: There are many applications of max flow with capacities that depend on one or more parameters. Many of these applications fall into the "Source-Sink Monotone" framework, a special case of Topkis's monotonic optimization framework, which implies that the parametric min cuts are nested. When there is a single parameter, this property implies that the number of distinct min cuts is linear in the number of nodes, which is quite useful for constructing algorithms to identify all possible min cuts.

When there are multiple Source-Sink Monotone parameters and the vector of parameters are ordered in the usual vector sense, the resulting min cuts are still nested. However, the number of distinct min cuts was an open question. We show that even with only two parameters, the number of distinct min cuts can be exponential in the number of nodes.

at July 22, 2021 10:45 PM UTC

Authors: Itzik Mizrahi, Shai Avidan
Download: PDF
Abstract: Deep Neural Networks require large amounts of labeled data for their training. Collecting this data at scale inevitably causes label noise.Hence,the need to develop learning algorithms that are robust to label noise. In recent years, k Nearest Neighbors (kNN) emerged as a viable solution to this problem. Despite its success, kNN is not without its problems. Mainly, it requires a huge memory footprint to store all the training samples and it needs an advanced data structure to allow for fast retrieval of the relevant examples, given a query sample. We propose a neural network, termed kNet, that learns to perform kNN. Once trained, we no longer need to store the training data, and processing a query sample is a simple matter of inference. To use kNet, we first train a preliminary network on the data set, and then train kNet on the penultimate layer of the preliminary network.We find that kNet gives a smooth approximation of kNN,and cannot handle the sharp label changes between samples that kNN can exhibit. This indicates that currently kNet is best suited to approximate kNN with a fairly large k. Experiments on two data sets show that this is the regime in which kNN works best,and can therefore be replaced by kNet.In practice, kNet consistently improve the results of all preliminary networks, in all label noise regimes, by up to 3%.

at July 22, 2021 10:38 PM UTC

Authors: Suryajith Chillara
Download: PDF
Abstract: Recently, Forbes, Kumar and Saptharishi [CCC, 2016] proved that there exists an explicit $d^{O(1)}$-variate and degree $d$ polynomial $P_{d}\in VNP$ such that if any depth four circuit $C$ of bounded formal degree $d$ which computes a polynomial of bounded individual degree $O(1)$, that is functionally equivalent to $P_d$, then $C$ must have size $2^{\Omega(\sqrt{d}\log{d})}$.

The motivation for their work comes from Boolean Circuit Complexity. Based on a characterization for $ACC^0$ circuits by Yao [FOCS, 1985] and Beigel and Tarui [CC, 1994], Forbes, Kumar and Saptharishi [CCC, 2016] observed that functions in $ACC^0$ can also be computed by algebraic $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits (i.e., circuits of the form -- sums of powers of polynomials) of $2^{\log^{O(1)}n}$ size. Thus they argued that a $2^{\omega(\log^{O(1)}{n})}$ "functional" lower bound for an explicit polynomial $Q$ against $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits would imply a lower bound for the "corresponding Boolean function" of $Q$ against non-uniform $ACC^0$. In their work, they ask if their lower bound be extended to $\Sigma\mathord{\wedge}\Sigma\Pi$ circuits.

In this paper, for large integers $n$ and $d$ such that $\omega(\log^2n)\leq d\leq n^{0.01}$, we show that any $\Sigma\mathord{\wedge}\Sigma\Pi$ circuit of bounded individual degree at most $O\left(\frac{d}{k^2}\right)$ that functionally computes Iterated Matrix Multiplication polynomial $IMM_{n,d}$ ($\in VP$) over $\{0,1\}^{n^2d}$ must have size $n^{\Omega(k)}$. Since Iterated Matrix Multiplication $IMM_{n,d}$ over $\{0,1\}^{n^2d}$ is functionally in $GapL$, improvement of the afore mentioned lower bound to hold for quasipolynomially large values of individual degree would imply a fine-grained separation of $ACC^0$ from $GapL$.

at July 22, 2021 10:37 PM UTC

Alright everyone:

  1. Victor Galitski has an impassioned rant against out-of-control quantum computing hype, which I enjoyed and enthusiastically recommend, although I wished Galitski had engaged a bit more with the strongest arguments for optimism (e.g., the recent sampling-based supremacy experiments, the extrapolations that show gate fidelities crossing the fault-tolerance threshold within the next decade). Even if I’ve been saying similar things on this blog for 15 years, I clearly haven’t been doing so in a style that works for everyone. Quantum information needs as many people as possible who will tell the truth as best they see it, unencumbered by any competing interests, and has nothing legitimate to fear from that. The modern intersection of quantum theory and computer science has raised profound scientific questions that will be with us for decades to come. It’s a lily that need not be gilded with hype.
  2. Last month Limaye, Srinivasan, and Tavenas posted an exciting preprint to ECCC, which apparently proves the first (slightly) superpolynomial lower bound on the size of constant-depth arithmetic circuits, over fields of characteristic 0. Assuming it’s correct, this is another small victory in the generations-long war against the P vs. NP problem.
  3. I’m grateful to the Texas Democratic legislators who fled the state to prevent the legislature, a couple miles from my house, having a quorum to enact new voting restrictions, and who thereby drew national attention to the enormity of what’s at stake. It should go without saying that, if a minority gets to rule indefinitely by forcing through laws to suppress the votes of a majority that would otherwise unseat it, thereby giving itself the power to force through more such laws, etc., then we no longer live in a democracy but in a banana republic. And there’s no symmetry to the situation: no matter how terrified you (or I) might feel about wokeists and their denunciation campaigns, the Democrats have no comparable effort to suppress Republican votes. Alas, I don’t know of any solutions beyond the obvious one, of trying to deal the conspiracy-addled grievance party crushing defeats in 2022 and 2024.

by Scott at July 21, 2021 08:57 PM UTC

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series is disseminated as guest posts on different blogs in machine learning and theoretical computer science. This initiative is organized by the Learning Theory Alliance and is overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

This is the sixth and final post in the series, on trends in machine learning theory, written by Margalit GlasgowMichal Moshkovitz, and Cyrus Rashtchian.

Throughout the last few decades, we have witnessed unprecedented growth of machine learning. Originally a topic formalized by a small group of computer scientists, machine learning now impacts many areas: the physical sciences, medicine, commerce, finance, urban planning, and more. The rapid growth of machine learning can be partially attributed to the availability of large amounts of data and the development of powerful computing devices. Another important factor is that machine learning has foundations in many other fields, such as theoretical computer science, algorithms, applied mathematics, statistics, and optimization. 

If machine learning is already mathematically rooted in many existing research areas, why do we need a field solely dedicated to learning theory? According to Daniel Hsu, “Learning theory serves (at least) two purposes: to help make sense of machine learning, and also to explore the capabilities and limitations of learning algorithms.” Besides finding innovative applications for existing tools, learning theorists also provide answers to long-standing problems and ask new fundamental questions. 

Modern learning theory goes beyond classical statistical and computer science paradigms by: 

  • developing insights about specific computational models (e.g., neural networks) 
  • analyzing popular learning algorithms (e.g., stochastic gradient descent)
  • taking into account data distributions (e.g., margin bounds or manifold assumptions)
  • adding auxiliary goals (e.g., robustness or privacy), and 
  • rethinking how algorithms interact with and access data (e.g., online or reinforcement learning).

By digging deep into the basic questions, researchers generate new concepts and models that change the way we solve problems and help us understand emerging phenomena.

This article provides a brief overview of three key areas in machine learning theory: new learning paradigms, trustworthy machine learning, and reinforcement learning. We describe the main thrust of each of these areas, as well as point to a few papers from ALT 2021 (the 32nd International Conference on Algorithmic Learning Theory) that touch each of these topics. To share a broader view, we also asked experts in the areas to comment on the field and on their recent papers. Needless to say, this article only scratches the surface. At the end, we point to places to learn more about learning theory.

New Machine Learning Paradigms
The traditional learning theory framework, probably approximately correct (PAC) learning, defines what it means to learn a ground-truth classifier from a candidate class of possible classifiers. Alongside PAC learning is Vapnik-Chervonenkis (VC) theory, which characterizes the number of samples needed and sufficient to learn a classifier from a given class. The generalization analysis from VC theory is restricted to guarantees that hold independently of the data distribution — that is, even for worst-case distributions. Additionally, the VC/PAC learning paradigm suggests that whenever learning is possible, it can be accomplished by choosing the classifier that minimizes loss on the training data, called the empirical risk minimizer (ERM). 

This classical framework unfortunately fails to explain the empirical success of machine learning (ML). “The distribution-free setting, while it comes with the elegant VC theory, turned out to be unsatisfactory,” says Csaba Szepesvári. “Due to the oversimplified setting, the theory could not contribute meaningfully to understanding all kinds of learning methods such as learning with trees, boosting, neural networks, SVMs, or using any other nonparametric methods.” Researchers posit that stronger guarantees should be possible if we leverage natural assumptions about the data distribution, though identifying the right “natural assumptions” is a challenging task. Similarly, understanding which of many possible ERM solutions a learning algorithm chooses may yield better generalization results than those yielded by VC theory. 

Methods that provide distribution-specific guarantees aren’t new to learning theory. A canonical example is known as a margin bound, where the test error of a classifier is analyzed in terms of the margin that separates the different prediction categories. In one of the ALT best papers, Steve Hanneke and Aryeh Kontorovich prove generalization guarantees in terms of the size of the margin for two popular classification algorithms: support vector machines (SVMs) and the perceptron algorithm. The authors answer a core open question, showing that SVMs achieve the optimal margin bound!

Further work at ALT uses an assumption that data lies on a low-dimensional manifold to prove guarantees for generative models. Generative models synthesize original samples, such as images or text, that resemble training data, but without copying the data directly. While so-called generative adversarial networks work well in practice, few guarantees exist because it is challenging to statistically formulate the requirement of generating original samples. Nicolas Schreuder, Victor-Emmanuel Brunel, and Arnak S. Dalalyan consider a new framework in their paper, in which they guarantee originality by outputting a continuous distribution, which ensures that it is very unlikely to output training examples. If the training data is generated from a low-dimensional manifold, they show that it is possible to learn a good generator, which outputs a smooth transformation of a random point.

In another paper, Christopher Tosh, Akshay Krishnamurthy, and Daniel Hsu use distributional assumptions to show when unsupervised methods can use unlabeled data to learn useful representations of data. A linear function trained on these representations and some labeled data can then be used for downstream prediction of the labels. The key idea is to look at when data has multiview redundancy (MVR), which arises, for instance, when data is augmented: Under MVR, each data point can be viewed as a pair (X, Z), and the label Y can be predicted almost as well from X or Z as from the full pair. For instance, each pair might be two halves of an article, or two rotations of an image. The authors show how a theoretical approach called landmark embedding can produce a representation that enables low-error linear classification. Additionally, they analyze when the representations are learned implicitly while training a model to predict whether two views X and Z correspond to the same example, which is close to what is done in practice. 

Another new paradigm considers how the specific training algorithm affects which of many candidate ERM solutions are chosen. This is called the implicit bias of a training algorithm: If there are multiple equally good solutions, then why does an algorithm choose one over the other? This is particularly relevant when studying neural networks, which are typically overparameterized and can be trained to find many solutions with zero empirical risk. In one paper, Ziwei Ji and Matus Telgarsky characterize the implicit bias of using gradient descent to train a linear classifier with a general loss function. They show that the solution relates to the optimizer of a particular smoothed margin function. While this paper does not yield generalization guarantees, this type of implicit bias analysis can sometimes lead to generalization guarantees via margin bounds.

The goal of this area is to go beyond traditional learning theory paradigms by leveraging distributional and algorithmic properties. But a major open question is designing a more general mathematical theory that exploits such properties. Shay Moran offers a standard for new theory: “I hope that in the next 10 years we will develop more realistic models of learning, but I will insist that they still be mathematically clean.”

Trustworthy ML
Machine learning has inspired many new areas and technologies: personalized health care, drug discovery, advertising, résumé screening, credit loans, and more. However, these critical and user-centered applications require a higher standard of testing and verification because mistakes may deeply affect many people. Addressing these challenges has inspired a new field of research centered on making machine learning more trustworthy and reliable, which is the motivation for many of the ALT papers this year as well. An expert in the area, Kamalika Chaudhuri, says, “For my field, which is trustworthy ML, the theoretical goal and challenge remains modeling and frameworks.” She elaborates: “Coming up with new conceptual frameworks for learning has always been one of the core challenges in learning theory since its early days, and it is doubly important now.” Researchers have been exploring this direction in many areas, including privacy, data deletion, robustness, fairness, interpretability, and causality. 

Several ALT 2021 papers cover questions in privacy. The general goal is to understand how to modify existing learning methods to take into account privacy constraints. One of the papers, by Di Wang, Huanyu Zhang, Marco Gaboardi, and Jinhui Xu, considers generalized linear models in a differential privacy model. A central motivation is to understand the role of public, unlabeled data in improving the learnability of these problems in a private setting. Summarizing another direction, Gautam Kamath comments on his paper with co-authors Ishaq Aden-Ali and Hassan Ashtiani, “This paper focuses on a very simple but surprisingly challenging question: Can we learn a general multivariate Gaussian under the constraint of differential privacy? Prior works focused on restricted settings — for example, with bounded parameters or known covariance. We gave the first finite sample complexity bound for this problem, which evidence suggests is near optimal. The next question is to design a computationally efficient algorithm for this problem.” Another paper on privacy, by Mark Cesar and Ryan Rogers, studies the composition of various privacy mechanisms in the context of real-world data analytics pipelines.

Another aspect of respecting user privacy is allowing people to choose to stop sharing their data. Concretely, this means removing their data from data sets and ensuring that existing and future models do not make use of their data in any way. One name for this process is machine unlearning, and the main challenge is removing the data efficiently without retraining all models from scratch. One paper, by Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi, addresses this challenge. They propose ways to strategically update the model by using modified gradient descent methods. They also analyze this approach, and prove new upper and lower bounds for updating models after data deletion with their new optimization algorithm. 

Robust methods for machine learning and statistics aim to provide rigorous guarantees in the presence of outliers or adversarially modified data points. The field of robustness has been steadily growing as researchers uncover more and more models where deviations in the data can lead to unexpected and dramatic changes in model behavior. In ALT 2021, a paper by Jie Shen and Chicheng Zhang covers learning half-spaces nearly optimally even in the presence of malicious noise.

As a final and thought-provoking direction in trustworthy ML, Omer Reingold points out that we need to better understand “the meaning of individual probabilities/risk scores,” which are common ways that ML systems summarize or justify decisions. Ideally, the output of a model should be something that people can interpret directly and use to potentially modify their future actions. He elaborates that it is important to think about “the individual quantities (which imply important decisions) that ML is trying to approximate” and to answer “what does fitting the parameters of a model on the entire population imply for individuals and subcommunities?” This question brings to the forefront the fact that ML systems affect both individuals and groups of people, which is an important consideration when formulating rigorous definitions of fairness (e.g., see this book or this one).

Reinforcement Learning
Reinforcement learning (RL) is a framework for interactive learning where an agent interacts with an environment, and the agent’s actions govern the rewards it receives from the environment. Part of the motivation for studying RL is that relevant problems are everywhere. Sometimes the agents are autonomous vehicles. Other times, they are programs playing games like chess or go. People interact more and more with ML models, and hence, living our lives is actually being a part of a multiagent game where we, humans, and the ML models are the agents. “The most exciting direction in learning theory of recent years,” says Elad Hazan, “is adding rigorous theoretical guarantees to reinforcement learning.” 

The RL environment is typically modeled as a Markov decision process (MDP): a set of states, actions, and transition probabilities that determine the next state and reward given the agent’s current state and action. The agent uses a policy to choose its action from each state with the goal of maximizing its cumulative reward over time. A central challenge is balancing exploration (learning about the environment) and exploitation (spending time choosing actions in states where they can collect high rewards).

In the most basic setting, multi-armed bandits, there is a single state, and each action (or “arm”) leads to a stochastic reward. “Here, the theory is quite mature, though interesting problems remain in connection to the limits of how structure can be exploited,” Csaba Szepesvári says. In two works at ALT, by Marc Jourdan, Mojmír Mutný, Johannes Kirschner, and Andreas Krause and by Thibaut Cuvelier, Richard Combes, and Eric Gourdin, the authors show that efficient exploration is possible in a combinatorial semi-bandit setting. Here, the agent can choose an allowed set of arms in each step, and it receives a distinct reward for each chosen arm. While the number of action choices for the agent is larger, this more detailed feedback makes the problem tractable.

Beyond the stateless bandit setting, ML theorists are still figuring out how fast agents can learn how to play optimally in MDPs with finite state spaces and action spaces. Recent progress on this front has given lower bounds on the sample complexity required for agents to learn the best policy. In one ALT paper, the authors give a unified view of lower bounds for three distinct, but related, problems in RL. The hard MDP instances they construct to show lower bounds are based on hard instances for multi-armed bandit problems.

In the more challenging setting where the state space is infinite, a central question is whether the agent can learn from exploring a finite number of states, and generalize to perform well on unknown areas of the environment. For certain MDPs, generalizing is impossible, but some assumptions on the structure of the MDP may enable generalization. “While algorithm independent problem formulations existed and have been studied in the finite case, a quite recent development is to extend these to the case of ‘large’ environments where the use of function approximation techniques becomes crucial for achieving nontrivial results,” explains Csaba Szepesvári. 

Function approximation has to do with the optimal action-value function, which captures the long-term reward of playing a certain action from a given state. This function can sometimes be approximated by some simple class of functions. One of the strongest such assumptions is linear realizability, where the optimal action-value function is a linear function of some representation of the action and state. In one of the papers receiving a best paper award in ALT, Gellert Weisz, Philip Amortila, and Csaba Szepesvári show that even under this strong assumption of linear realizability, the agent needs a number of samples exponential in the length of the episode or the dimension of the representation in order to generalize. Looking forward, the goal is to follow the lead of these papers and better understand the landscape of sample complexity: When can we learn models with a polynomial number of samples, and when is an exponential number necessary?

Nearly every offline learning problem can be studied in an interactive setting, where inputs arrive in an online fashion and need to be processed immediately, which is common in many real-world settings. Models for interactive machine learning provide a framework for studying problems and algorithms in this more challenging setting. Beyond the MDP setting, interactive learning spans online learning (e.g., no-substitution clustering), nonstochastic control theory (e.g., robust controllers for dynamical systems), online convex optimization, and many more domains. 

We hope this provides a fairly broad view on some of the topics that people are researching right now in learning theory. Of course, there are many more areas that we don’t have space to describe: theory of deep neural networks, quantum algorithms for machine learning problems, human-centered considerations, learning with strategic agents and multiplayer games, convex/nonconvex optimization, federated and distributed learning algorithms, and many more. In general, as Gautam Kamath observes, “A lot of important questions in learning theory arise through interplay between the theoretical and applied machine learning communities.” To have a greater impact, it is important to collaborate with people doing empirical research, and to learn from the front lines about the most interesting phenomena to explain, or the challenges that do not seem surmountable by combining existing tools.

To learn more and to get more involved, we have listed a variety of resources (blogs, workshops, videos, etc.) that can help you get started in this area. As a final motivation for writing this article, we remark that people in the area are keenly aware that we need more young talent to help uncover truth and contribute groundbreaking ideas. As Gautam Kamath puts it, “There are far more interesting questions in learning theory than there are researchers to solve them.”

Places to learn more

Blogs: UCSD ML blog, Off the Convex Path, Windows On Theory, I’m a bandit, Francis Bach’s blog, Differential Privacy blog, Distill, The Gradient


Podcasts: TWIML, Gradient Dissent, Joy of x, The Robot Brains, Talking Machines, TalkRL, Underrated ML

Videos: Simons Institute, IAS deep learning workshop, One World ML, Trustworthy ML, Foundations of Data Science, RL Theory Virtual Seminars, iMSi: The Multifaceted Complexity of Machine Learning, Control Meets Learning 

Acknowledgments: We thank Kamalika Chaudhuri, Elad Hazan, Daniel Hsu, Gautam Kamath, Shay Moran, Omer Reingold, Csaba Szepesvári, and Claire Vernade for helpful comments and thoughtful quotes. We thank Kush Bhatia, Lee Cohen, Neha Gupta, Nika Haghtalab, Max Hopkins, Gautam Kamath, Gaurav Mahajan, and Uri Sherman for helpful feedback on initial drafts.

Margalit Glasgow is a PhD student in Stanford’s Computer Science Department, advised by Mary Wootters. Her research focuses on theoretical machine learning and random matrices.

Michal Moshkovitz is a postdoc at the Qualcomm Institute at UC San Diego. She received her PhD and MSc in computational neuroscience from the Hebrew University of Jerusalem, and her MSc in computer science from Tel Aviv University. Her research focuses on the foundations of AI, exploring how different constraints affect learning. She has worked on bounded-memory learning, explainable machine learning, and online decision-making in unsupervised learning. 

Cyrus Rashtchian is a postdoc at UC San Diego in computer science and engineering, and he received his PhD from the University of Washington. His research focuses on trustworthy machine learning, algorithms for big data, statistical reconstruction, and DNA data storage.

by 1737780 at July 19, 2021 09:20 PM UTC

 1) Trump and other reps have said the following about the Jan 6 event at various times:

a) The Jan 6 event was freedom fighters who were fighting the noble fight to overturn a fraudulent election. Rah Rah!

b) The Jan 6 event was a peaceful protest to overturn a fraudulent election. 

c) The Jan 6 event was democrats trying to make republicans look bad.

d) The Jan 6 event was Antifa. (See here)

e) Ashli Babbitt (a protestor who was shot by police at the Jan 6 event) should have been honored by flying the flag at half-mast (see here

If we take the intersection of d and e we find that Trump wants to honor a member of Antifa who was shot by police. 

To be fair, Trump is entitled to change his mind. But I wonder- did he ever really think it was Antifa or was that a talking point? If he really thought so then  when did he change his mind? I ask non rhetorically---  NOT a `gotcha question'

(NOTE: I wrote this post a while back. Since then Trump and some other republicans are tending towards the Freedom Fighters narrative.) 

2) Vaccines:

a) Some people think that the vaccines are bad to take. I suspect they would give some (incorrect) health reasons, while the real reason may be political. (One reason is that the vaccine make you magnetic. That sounds awesome! Others think that there is a microchip in the vaccine so that Bill Gates can track our movements. Gee, Mark Zuckerberg can already do that. Some thing it will rewrite our DNA. A bio major I know  tells me that such people are confusing messenger RNA with DNA. Great- we can now have a nice conversation and point out where they are wrong.) 

b) Some people think we should NOT give vaccines to poor countries that need them, or to people in prison,  since Americans should have priority. (NOTE- from a purely health-viewpoint this is not correct since a pandemic does not respect boundaries- if there is an outbreak in a diff country or in a prison it will affect people not in those countries and not in prison.) 

Are there people who believe a and b? I ask non-rhetorically. 

I am not surprised when people hold contradictory thoughts in their heads, but these two cases just struck me as particularly strange. Not sure why.


by gasarch ( at July 18, 2021 06:58 PM UTC

We show that the QRAT simulation algorithm of $\forall$Exp+Res from [B. Kiesl and M. Seidl, 2019] cannot be lifted to IR-calc.

at July 18, 2021 04:15 PM UTC

Over finite fields $F_q$ containing a root of unity of smooth order $n$ (smoothness means $n$ is the product of small primes), the Fast Fourier Transform (FFT) leads to the fastest known algebraic algorithms for many basic polynomial operations, such as multiplication, division, interpolation and multi-point evaluation. These operations can be computed by constant fan-in arithmetic circuits over $F_q$ of quasi-linear size; specifically, $O(n \log n)$ for multiplication and division, and $O(n \log^2 n)$ for interpolation and evaluation. However, the same operations over fields with no smooth order root of unity suffer from an asymptotic slowdown, typically due to the need to introduce “synthetic” roots of unity to enable the FFT. The classical algorithm of Schönhage and Strassen incurred a multiplicative slowdown factor of $\log \log n$ on top of the smooth case. Recent remarkable results of Harvey, van der Hoeven and Lecerf dramatically reduced this multiplicative overhead to $\exp(\log^* (n))$. We introduce a new approach to fast algorithms for polynomial operations over all large finite fields. The key idea is to replace the group of roots of unity with a set of points $L \subset F_q$ suitably related to a well-chosen elliptic curve group over $F_q$ (the set L itself is not a group). The key advantage of this approach is that elliptic curve groups can be of any size in the Hasse–Weil interval $[q + 1 \pm 2\sqrt{q}]$ and thus can have subgroups of large, smooth order, which an FFT-like divide and conquer algorithm can exploit. Compare this with multiplicative subgroups over $F_q$ whose order must divide $q-1$. By analogy, our method extends the standard, multiplicative FFT in a similar way to how Lenstra’s elliptic curve method extended Pollard’s $p-1$ algorithm for factoring integers. For polynomials represented by their evaluation over subsets of $L$, we show that multiplication, division, degree-computation, interpolation, evaluation and Reed–Solomon encoding (also known as low-degree extension) with fixed evaluation points can all be computed with arithmetic circuits of size similar to what is achievable with the classical FFTs when the field size $q$ is special. For several problems, this yields the asymptotically smallest known arithmetic circuits even in the standard monomial representation of polynomials. The efficiency of the classical FFT follows from using the 2-to-1 squaring map to reduce the evaluation set of roots of unity of order $2^k$ to similar groups of size $2^{k?i}$, $i > 0$. Our algorithms operate similarly, using isogenies of elliptic curves with kernel size 2 as 2-to-1 maps to reduce $L$ of size $2^k$ to sets of size $2^{k?i}$ that are, like $L$, suitably related to elliptic curves, albeit different ones.

at July 18, 2021 02:07 PM UTC

January 10-11, 2022 Alexandria, Virginia, U.S. Submission deadline: August 9, 2021 Symposium on Simplicity in Algorithms is a conference in theoretical computer science dedicated to advancing algorithms research by promoting simplicity and elegance in the design and analysis of algorithms. The benefits of simplicity are manifold: simpler algorithms manifest a better understanding of the … Continue reading 5th SIAM Symposium on Simplicity in Algorithms

by shacharlovett at July 18, 2021 01:15 PM UTC

Raft is a consensus algorithm for deciding a sequence of commands to execute on a replicated state machine. Raft is famed for its understandability (relative to other consensus algorithms such as Paxos) yet some aspects of the protocol still require careful treatment. For instance, determining when it is safe for...

at July 17, 2021 03:25 PM UTC

The researcher will join a rapidly growing department, with strong research sections in the areas of Algorithms and Complexity, Machine Learning, Natural Language Processing, Human-Centered Computing, Software Engineering and Data Management, Programming Languages, and Image Analysis.


by shacharlovett at July 16, 2021 08:41 AM UTC

There are two reasons for the large number of Good Articles in this set. First, I had previously been trying to keep my nominations and reviews in balance, but there were too few nominations to review on topics of interest to me, and the inability to find things to review was preventing me from nominating other articles when they were ready. So I started nominating more often. And second, the Wikipedia Good Articles editors are having a drive this month to clean out old nominations, as they tend to do a couple of times per year.

by David Eppstein at July 15, 2021 06:02 PM UTC

We must avoid becoming one

Cropped from USA Today source

José Altuve hit a game-winning home run in the bottom of the ninth against the Yankees on Sunday. He thereby reproduced the conditions and the outcome of baseball’s most dramatic cheating accusation of 2019.

Today, at baseball’s All-Star break, we review this and other social experiments that have quite a bit more data.

Altuve won the 2019 American League Championship series with a pinch-hit homer in the bottom of the ninth against the Yankees’ closer, Aroldis Chapman. As he approached home plate, he was seen telling his teammates waiting to mob him at the plate not to rip off his jersey in celebration. He subsequently scooted into the corridor behind the dugout, then re-emerged into the on-field celebration. This fed accusations that he had been wired with a buzzer to know what kind of pitch was coming from Chapman, in line with sign-stealing by other means from the Astros’ 2017 championship season and 2018 that was proven and punished by Major League Baseball.

Almost the same scenario was reproduced Sunday: Houston down 7-5 against the Yankees with two out and two on base in the ninth, Altuve up against the Yankees’ closer (Chad Green recently supplanting Chapman). Altuve socked a homer to the same part of the ballpark to complete a shocking six-run comeback. Immediately upon touching home, he had his shirt ripped off to reveal nothing but the top half of his birthday suit. This was the most direct way possible to witness that he could have hit the other homer without illegal information.

Examples and Non-Examples

I have dealt with chess-cheating cases in which electronic buzzing has been specifically alleged, including the two most prominent cases of 2013. I will not take this post further in this direction, however, but rather to pose this question:

What is considered a “social proof” of an assertion—especially when there are elements of scientific control and reproduction?

A simple example is a police lineup. This tries to control for whether a witness has previously seen the accused by including the accused among usually four or five similarly represented people. Picking the right person is considered to prove the previous encounter. Statistically, however, this is a {p}-value of only 0.20 or 0.167, which are not considered significant at even the weakest level of “statistical proof.” Allowing null lineups does not change the statistics much.

Baseball gives a non-example that surprises me. One of the bad performances that cost Chapman his closer role was losing an 8-4 lead against the Los Angeles Angels on June 30. As a fantasy-baseball player, I’ve regularly observed poor pitching (by the closers on my “fantasy team”) when the lead is too large to earn credit for a coveted save. Does the data reproduce a phenomenon of closers bearing down less when way ahead, with no “save” to gain? A study after the 2013 season, which cleverly represented performances by the same {z}-scores I use in chess, found none. “Meltdowns” like Chapman’s are offset by cases where closers pitched better. The {z}-scores in the study are all in the range -1.25 to +1.50 anyway, which count as statistically random.

This study used a reasonably large data set, one that is well-defined and admits controlling factors such as normalizing for game circumstances and the quality of the opposing hitters. At least it is more than the two instances of Altuve. In-between would be an attempt to determine whether certain national soccer teams are consistently worse at penalty-kick tiebreaks. England’s and Italy’s teams brought their long tortured histories together in the tiebreak of Sunday’s European Cup final. The Italians missed two of five kicks, a score that often spells doom, but the English missed three.

Larger Scale

Dick and I are really interested in “experiments” that have spilled into society, with minimal controls but large data. One sphere of this is cybersecurity.

It seems to us that only in the past decade have security experts begun formalizing their research as experimental science with repeatability and reproducibility as explicit criteria. The NSA devoted a special 2012 issue of their Next Wave series to what they titled as “Developing a blueprint for a science of cybersecurity.” Among the contents are:

  • An introductory essay by Carl Landwehr titled, “Cybersecurity: From engineering to science.”

  • A linchpin paper by Fred Schneider titled, “Blueprint for a science of cybersecurity.”

  • A paper by Roy Maxion titled, “Making experiments dependable,” which came from a 2011 Springer LNCS Festschrift.

Maxion’s main example is keystroke biometrics. This covers inferences made from typing style on a computer keyboard or mouse or similar handheld input device. This can be used to verify identity or screen for malfeasant activities. Online chess playing platforms collect data of this nature—okay we could not resist adding chess example.

Another area is experiments designed to simulate attacks and test defenses against them. Schneider’s paper begins with a contrast between predictive modeling versus reactive handling of them. About the latter, he draws an analogy with health care:

“Some health problems are best handled in a reactive manner. We know what to do when somebody breaks a finger, and each year we create a new influenza vaccine in anticipation of the flu season to come. But only after making significant investments in basic medical sciences are we starting to understand the mechanisms by which cancers grow, and a cure seems to require that kind of deep understanding.”

He goes on to outline the kind of scientific foundation that could hopefully underlie a ‘cure’ for intrusion and malware and the like.

What we have seen happen especially in the past months, however—in both health and security—is uncontrolled experiments with society as the domain. Large-scale ransomware attacks are becoming as frequent as hurricanes and heat waves. And of course, the pandemic. These share with Altuve the property of being one-off instances, but have large data on the receiving end.

Summer Pandemic Update

The following chart updates our June 20 post on the state of the pandemic and its projection for the summer—for Florida and the United Kingdom in particular:

The vertical line shows about where the charts were on June 20. The past few days are the first where we can point to a significant rise in Florida, though Missouri had a similar rise last week and it is showing up in some other states. The charts are taken from the Worldometer coronavirus pages.

The UK rise looks ghastly. It was a subtext of our previous post to worry that allowing the large dense soccer crowds at London’s Wembley Stadium for the semis and final—and anything similar in baseball—would stoke the rise in our respective countries even more. However, the rate of hospitalizations in the UK has remained largely flat. This Fortune article last week is one of several attesting that the new cases are mostly in children or in vaccinated people with enough immunity to contain the “breakthrough” positive. The UK is going ahead with large-scale re-openings later this month, with the portion of those 18 and older who have had one dose approaching 90% and those fully vaccinated coming past 70%. The latter number in relation to the whole population is about 52%.

The US looks like becoming an experiment in how the local vaccination rate affects the numbers. The rates of those fully vaccinated by state are currently eerily similar to Joe Biden’s vote percentage in the state. One aspect of scientific reproducibility is the size of the simplest classifier of the results. For a presidential vote to have simpler explaining power than any factors of biology or other life circumstances would make a strange experiment indeed.

Open Problems

Dick and I tried to come up with other examples—from computer security in particular—to sustain what has been occupying our thoughts about standards of proof for policy. We would welcome some examples from you, our readers.

And of course, we have been concerned about the present course of the pandemic amid re-openings since the referenced post last month. In the meantime, if it is your taste, please enjoy the All-Star Game, which Altuve is, ironically, skipping.

[some word fixes and changes]

by KWRegan at July 13, 2021 07:47 PM UTC

We give tight bounds on the degree $\ell$ homogenous parts $f_\ell$ of a bounded function $f$ on the cube. We show that if $f: \{\pm 1\}^n \rightarrow [-1,1]$ has degree $d$, then $\| f_\ell \|_\infty$ is bounded by $d^\ell/\ell!$, and $\| \hat{f}_\ell \|_1$ is bounded by $d^\ell e^{{\ell+1 \choose 2}} n^{\frac{\ell-1}{2}}$. We describe applications to pseudorandomness and learning theory. We use similar methods to generalize the classical Pisier's inequality from convex analysis. Our analysis involves properties of real-rooted polynomials that may be useful elsewhere.

at July 13, 2021 05:55 PM UTC

We give a new proof of the fact that the parallel repetition of the (3-player) GHZ game reduces the value of the game to zero polynomially quickly. That is, we show that the value of the $n$-fold GHZ game is at most $n^{-\Omega(1)}$. This was first established by Holmgren and Raz [HR20]. We present a new proof of this theorem that we believe to be simpler and more direct. Unlike most previous works on parallel repetition, our proof makes no use of information theory, and relies on the use of Fourier analysis. The GHZ game [GHZ89] has played a foundational role in the understanding of quantum information theory, due in part to the fact that quantum strategies can win the GHZ game with probability $1$. It is possible that improved parallel repetition bounds may find applications in this setting. Recently, Dinur, Harsha, Venkat, and Yuen [DHVY17] highlighted the GHZ game as a simple three-player game, which is in some sense maximally far from the class of multi-player games whose behavior under parallel repetition is well understood. Dinur et al. conjectured that parallel repetition decreases the value of the GHZ game exponentially quickly, and speculated that progress on proving this would shed light on parallel repetition for general multi-player (multi-prover) games.

at July 13, 2021 02:48 PM UTC

The link is here. Due to the pandemic it is going to be virtual.

Quoting Bob Dylan, this blog is not dead, it is just asleep.

by Sariel at July 13, 2021 03:38 AM UTC

Piazza Duomo, in Milan, on December 26, 2020

Piazza Duomo, in Milan, on July 11, 2021

by luca at July 12, 2021 05:04 PM UTC

A basic and frequent task in data analysis is selection – given a set of options \(\mathcal{Y}\), output the (approximately) best one, where “best” is defined by some loss function \(\ell : \mathcal{Y} \times \mathcal{X}^n \to \mathbb{R}\) and a dataset \(x \in \mathcal{X}^n\). That is, we want to output some \(y \in \mathcal{Y}\) that approximately minimizes \(\ell(y,x)\). Naturally, we are interested in private selection – i.e., the output should be differentially private in terms of the dataset \(x\). This post discusses algorithms for private selection – in particular, we give an improved privacy analysis of the popular exponential mechanism.

The Exponential Mechanism

The most well-known algorithm for private selection is the exponential mechanism [MT07]. The exponential mechanism \(M : \mathcal{X}^n \to \mathcal{Y} \) is a randomized algorithm given by \[\forall x \in \mathcal{X}^n ~ \forall y \in \mathcal{Y} ~~~~~ \mathbb{P}[M(x) = y] = \frac{\exp(-\frac{\varepsilon}{2\Delta} \ell(y,x))}{\sum_{y’ \in \mathcal{Y}} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x)) }, \tag{1}\] where \(\Delta\) is the sensitivity of the loss function \(\ell\) given by \[\Delta = \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \max_{y\in\mathcal{Y}} |\ell(y,x) - \ell(y,x’)|,\tag{2}\] where the supremum is taken over all datasets \(x\) and \(x’\) differing on the data of a single individual (which we denote by \(d(x,x’)\le 1\)).

In terms of utility, we can easily show that [BNSSSU16] \[\mathbb{E}[\ell(M(x),x)] \le \min_{y \in \mathcal{Y}} \ell(y,x) + \frac{2\Delta}{\varepsilon} \log |\mathcal{Y}|\] for all \(x \in \mathcal{X}^n\) (and we can also give high probability bounds).

It is easy to show that the exponential mechanism satisfies \(\varepsilon\)-differential privacy. But there is more to this story! We’re going to look at a more refined privacy analysis.

Bounded Range

The privacy guarantee of the exponential mechanism is more precisely characterized by bounded range. This was observed and defined by David Durfee and Ryan Rogers [DR19] and further analyzed later [DDR20].

Definition 1 (Bounded Range).1 A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\eta\)-bounded range if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, there exists some \(t \in \mathbb{R}\) such that \[\forall y \in \mathcal{Y} ~~~~~ \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right) \in [t, t+\eta].\] Here \(t\) may depend on the pair of input datasets \(x,x’\), but not on the output \(y\).

To interpret this definition, we recall the definition of the privacy loss random variable: Define \(f : \mathcal{Y} \to \mathbb{R}\) by \[f(y) = \log\left(\frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]}\right).\] Then the privacy loss random variable \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is given by \(Z = f(M(x))\).

Pure \(\varepsilon\)-differential privacy is equivalent to demanding that the privacy loss is bounded by \(\varepsilon\) – i.e., \(\mathbb{P}[|Z|\le\varepsilon]=1\). Approximate \((\varepsilon,\delta)\)-differential privacy is, roughly, equivalent to demanding that \(\mathbb{P}[Z\le\varepsilon]\ge1-\delta\).2

Now \(\eta\)-bounded range is simply demanding that the privacy loss \(Z\) is supported on some interval of length \(\eta\). This interval \([t,t+\eta]\) may depend on the pair \(x,x’\).

Bounded range and pure differential privacy are equivalent up to a factor of 2 in the parameters:

Lemma 2 (Bounded Range versus Pure Differential Privacy).

  • \(\varepsilon\)-differential privacy implies \(\eta\)-bounded range with \(\eta \le 2\varepsilon\).
  • \(\eta\)-bounded range implies \(\varepsilon\)-differential privacy with \(\varepsilon \le \eta\).

Proof. The first part of the equivalence follows from the fact that pure \(\varepsilon\)-differential privacy implies the privacy loss is supported on the interval \([-\varepsilon,\varepsilon]\). Thus, if we set \(t=-\varepsilon\) and \(\eta=2\varepsilon\), then \([t,t+\eta] = [-\varepsilon,\varepsilon]\). The second part follows from the fact that the support of the privacy loss \([t,t+\eta]\) must straddle \(0\). That is, the privacy loss cannot be always positive nor always negative, so \(0 \in [t,t+\eta]\) and, hence, \([t,t+\eta] \subseteq [-\eta,\eta]\). Otherwise \(\forall y ~ f(y)>0\) or \(\forall y ~ f(y)<0\) would imply \(\forall y ~ \mathbb{P}[M(x)=y]>\mathbb{P}[M(x’)=y]\) or \(\forall y ~ \mathbb{P}[M(x)=y]<\mathbb{P}[M(x’)=y]\), contradicting the fact that \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x)=y] = 1\) and \(\sum_{y \in \mathcal{Y}} \mathbb{P}[M(x’)=y] = 1\). ∎

OK, back to the exponential mechanism:

Lemma 3 (The Exponential Mechanism is Bounded Range). The exponential mechanism (given in Equation 1 above) satisfies \(\varepsilon\)-bounded range .3

Proof. We have \[e^{f(y)} = \frac{\mathbb{P}[M(x)=y]}{\mathbb{P}[M(x’)=y]} = \frac{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x))}{\exp(-\frac{\varepsilon}{2\Delta}\ell(y,x’))} \cdot \frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}.\] Setting \(t = \log\left(\frac{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x’))}{\sum_{y’} \exp(-\frac{\varepsilon}{2\Delta} \ell(y’,x))}\right) - \frac{\varepsilon}{2}\), we have \[ f(y) = \frac{\varepsilon}{2\Delta} (\ell(y,x’)-\ell(y,x)+\Delta) + t.\] By the definition of sensitivity (given in Equation 2), we have \( 0 \le \ell(y,x’)-\ell(y,x)+\Delta \le 2\Delta\), whence \(t \le f(y) \le t + \varepsilon\). ∎

Bounded range is not really a useful privacy definition on its own. Thus we’re going to relate it to a relaxed version of differential privacy next.

Concentrated Differential Privacy

Concentrated differential privacy [BS16] and its variants [DR16] [M17] are relaxations of pure differential privacy with many nice properties. In particular, it composes very cleanly.

Definition 4 (Concentrated Differential Privacy). A randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, \[\forall \lambda > 0 ~~~~~ \mathbb{E}[\exp( \lambda Z)] \le \exp(\lambda(\lambda+1)\rho),\tag{3}\] where \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) is the privacy loss random variable.4

Intuitively, concentrated differential privacy requires that the privacy loss is subgaussian. Specifically, the bound on the moment generating function of \(\rho\)-concentrated differential privacy is tight if the privacy loss \(Z\) follows the distribution \(\mathcal{N}(\rho,2\rho)\). Indeed, the privacy loss random variable of the Gaussian mechanism has such a distribution.5

OK, back to the exponential mechanism: We know that \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy [BS16]. This, of course, applies to the exponential mechaism. A cool fact – that we want to draw more attention to – is that we can do better! Specifically, \(\eta\)-bounded range implies \(\frac18 \eta^2\)-concentrated differential privacy [CR21]. What follows is a proof of this fact following that of Mark Cesar and Ryan Rogers, but with some simplification.

Theorem 5 (Bounded Range implies Concentrated Differential Privacy). If \(M\) is \(\eta\)-bounded range, then it is \(\frac18\eta^2\)-concentrated differentially private.

Proof. Fix datasets \(x,x’ \in \mathcal{X}^n\) differing on a single individual’s data. Let \(Z \gets \mathsf{PrivLoss}(M(x)\|M(x’))\) be the privacy loss random variable of the mechanism \(M\) on this pair of datasets. By the definition of bounded range (Definition 1), there exists some \(t \in \mathbb{R}\) such that \(Z \in [t, t+\eta]\) with probability 1. Now we employ Hoeffding’s Lemma [H63]:

Lemma 6 (Hoeffding’s Lemma). Let \(X\) be a random variable supported on the interval \([a,b]\). Then, for all \(\lambda \in \mathbb{R}\), we have \[\mathbb{E}[\exp(\lambda X)] \le \exp \left( \mathbb{E}[X] \cdot \lambda + \frac{(b-a)^2}{8} \cdot \lambda^2 \right).\]

Applying the lemma to the privacy loss gives \[\forall \lambda \in \mathbb{R} ~~~~~ \mathbb{E}[\exp(\lambda Z)] \le \exp \left( \mathbb{E}[Z] \cdot \lambda + \frac{\eta^2}{8} \cdot \lambda^2 \right).\] The only remaining thing we need is to show is that \(\mathbb{E}[Z] \le \frac18 \eta^2\).6

If we set \(\lambda = -1 \), then we get \( \mathbb{E}[\exp( - Z)] \le \exp \left( -\mathbb{E}[Z] + \frac{\eta^2}{8} \right)\), which rearranges to \(\mathbb{E}[Z] \le \frac18 \eta^2 - \log \mathbb{E}[\exp( - Z)]\). Now we have \[ \mathbb{E}[\exp( - Z)] \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \exp(-f(y)) \!=\! \sum_y \mathbb{P}[M(x)\!=\!y] \!\cdot\! \frac{\mathbb{P}[M(x’)\!=\!y]}{\mathbb{P}[M(x)\!=\!y]} \!=\! 1.\] ∎

This brings us to the TL;DR of this post:

Corollary 7. The exponential mechanism (given by Equation 1) is \(\frac18 \varepsilon^2\)-concentrated differentially private.

This is great news. The standard analysis only gives \(\frac12 \varepsilon^2\)-concentrated differential privacy. Constants matter when applying differential privacy, and we save a factor of 4 in the concentrated differential privacy analysis of the exponential mechanism for free with this improved analysis.

Combining Lemma 2 with Theorem 5 also gives a simpler proof of the conversion from pure differential privacy to concentrated differential privacy [BS16]:

Corollary 8. \(\varepsilon\)-differential privacy implies \(\frac12 \varepsilon^2\)-concentrated differential privacy.

Beyond the Exponential Mechanism

The exponential mechanism is not the only algorithm for private selection. A closely-related algorithm is report noisy max/min:7 Draw independent noise \(\xi_y\) from some distribution for each \(y \in \mathcal{Y}\) then output \[M(x) = \underset{y \in \mathcal{Y}}{\mathrm{argmin}} ~ \ell(y,x) - \xi_y.\]

If the noise distribution is an appropriate Gumbel distribution, then report noisy max is exactly the exponential mechanism. (This equivalence is known as the “Gumbel max trick.”)

We can also use the Laplace distribution or the exponential distribution. Report noisy max with the exponential distribution is equivalent to the permute and flip algorithm [MS20] [DKSSWXZ21]. However, these algorithms don’t enjoy the same improved bounded range and concentrated differential privacy guarantees as the exponential mechanism.

There are also other variants of the selection problem. For example, in some cases we can assume that only a few options have low loss and the rest of the options have high loss – i.e., there is a gap between the minimum loss and the second-lowest loss (or, more generally, the \(k\)-th lowest loss). In this case there are algorithms that attain better accuracy than the exponential mechanism under relaxed privacy definitions [CHS14] [BDRS18] [BKSW19].

There are a lot of interesting aspects of private selection, including questions for further research! We hope to have further posts about some of these topics.

  1. For simplicity, we restrict our discussion here to finite sets of outputs, although the definitions, algorithms, and results can be extended to infinite sets. 

  2. To be more precise, \((\varepsilon,\delta)\)-differential privacy is equivalent to demanding that \(\mathbb{E}[\max\{0,1-\exp(\varepsilon-Z)\}]\le\delta\) [CKS20]. (To be completely precise, we must appropriately deal with the \(Z=\infty\) case, which we ignore in this discussion for simplicity.) 

  3. This proof actually gives a slightly stronger result: We can replace the sensitivity \(\Delta\) (defined in Equation 2) by half the range \[\hat\Delta = \frac12 \sup_{x,x’ \in \mathcal{X}^n : d(x,x’) \le 1} \left( \max_{\overline{y}\in\mathcal{Y}} \ell(\overline{y},x) - \ell(\overline{y},x’) - \min_{\underline{y}\in\mathcal{Y}} \ell(\underline{y},x) - \ell(\underline{y},x’) \right).\] We always have \(\hat\Delta \le \Delta\) but it is possible that \(\hat\Delta < \Delta\) and the privacy analysis of the exponential mechanism still works if we replace \(\Delta\) by \(\hat\Delta\). 

  4. Equivalently, a randomized algorithm \(M : \mathcal{X}^n \to \mathcal{Y}\) satisfies \(\rho\)-concentrated differential privacy if, for all pairs of inputs \(x, x’ \in \mathcal{X}^n\) differing only on the data of a single individual, \[\forall \lambda > 0 ~~~~~ \mathrm{D}_{\lambda+1}(M(x)\|M(x’)) \le \lambda(\lambda+1)\rho,\] where \(\mathrm{D}_{\lambda+1}(M(x)\|M(x’)))\) is the order \(\lambda+1\) Rényi divergence of \(M(x)\) from \(M(x’)\). 

  5. To be precise, if \(M(x) = q(x) + \mathcal{N}(0,\sigma^2I)\), then \(M : \mathcal{X}^n \to \mathbb{R}^d\) satisfies \(\frac{\Delta_2^2}{2\sigma^2}\)-concentrated differential privacy, where \(\Delta_2 = \sup_{x,x’\in\mathcal{X}^n : d(x,x’)\le1} \|q(x)-q(x’)\|_2\) is the 2-norm sensitivity of \(q:\mathcal{X}^n \to \mathbb{R}^d\). Furthermore, the privacy loss of the Gaussian mechanism is itself a Gaussian and it makes the inequality defining concentrated differential privacy (Equation 3) an equality for all \(\lambda\) 

  6. Note that the expectation of the privacy loss is simply the KL divergence: \(\mathbb{E}[Z] = \mathrm{D}_1( M(x) \| M(x’) )\). 

  7. We have presented selection here in terms of minimization, but most of the literature is in terms of maximization. 

by Thomas Steinke at July 12, 2021 05:00 PM UTC

We present a Karchmer-Wigderson game to study the complexity of hazard-free formulas. This new game is both a generalization of the monotone Karchmer-Wigderson game and an analog of the classical Boolean Karchmer-Wigderson game. Therefore, it acts as a bridge between the existing monotone and general games. Using this game, we prove hazard-free formula size and depth lower bounds that are provably stronger than those possible by the standard technique of transferring results from monotone complexity in a black-box fashion. For the multiplexer function we give (1) a hazard-free formula of optimal size and (2) an improved low-depth hazard-free formula of almost optimal size and (3) a hazard-free formula with alternation depth $2$ that has optimal depth. We then use our optimal constructions to obtain an improved universal worst-case hazard-free formula size upper bound. We see our results as a significant step towards establishing hazard-free computation as an independent missing link between Boolean complexity and monotone complexity.

at July 12, 2021 07:35 AM UTC

 Recall from my last post (here)

I offer you the following bet: 

I will flip a coin.

If  HEADS you get 1 dollar and we end there.

If TAILS I flip again

If  HEADS you get 2 dollars and we end there.

If  TAILS I flip again

If HEADS you get 4 dollars and we end there.

If TAILS I flip again

The expected value is infinity.

Would you pay $1000 to play this game?

Everyone who responded said NO. Most gave reasons similar to what I have below. 

This is called The St Petersburg Paradox. Not sure it's a paradox, but it is odd. The concrete question of would you pay $1000 to play might be a paradox since most people would say NO even though the expected value is infinity.  See here for more background.

Shapley (see here) gives a good reason why you would not  pay $1000 to play the game, and also how much you should pay to play the game (spoiler alert: not much). I will summarize his argument and then add to it. 

1) Shapley's argument: Lets say the game goes for 40 rounds. Then you are owed 2^{40} dollars. 

The amount of money in the world is, according to this article around 1.2 quadrillion dollars  which is roughly 2^{40} dollars. 

So the expected value calculation has to be capped at (say) 40 rounds. This means you expect to get 20 dollars! So pay 19 to play. 

2) My angle which is very similar: at what point is more money not going to change your life at all? For me it is way less than 2^{40} dollars. Hence I would not pay 1000. Or even 20. 

Exercise: If you think the game will go at most R rounds and you only wand D dollars, how much should you pay to play? You can also juggle more parameters - the bias of the coin, how much they pay out when you win. 

Does Shapley's discussions  resolve the paradox? It depends on what you consider paradoxical. If the paradox is that people would NOT pay 1000 even though the expected value is infinity, then Shapley  resolves the paradox  by contrasting the real world to the math world. 

by gasarch ( at July 12, 2021 12:53 AM UTC

High-dimensional expanders generalize the notion of expander graphs to higher-dimensional simplicial complexes. In contrast to expander graphs, only a handful of high-dimensional expander constructions have been proposed, and no elementary combinatorial construction with near-optimal expansion is known. In this paper, we introduce an improved combinatorial high-dimensional expander construction, by modifying a previous construction of Liu, Mohanty, and Yang (ITCS 2020), which is based on a high-dimensional variant of a tensor product. Our construction achieves a spectral gap of $\Omega(\frac{1}{k^2})$ for random walks on the $k$-dimensional faces, which is only quadratically worse than the optimal bound of $\Theta(\frac{1}{k})$. Previous combinatorial constructions, including that of Liu, Mohanty, and Yang, only achieved a spectral gap that is exponentially small in $k$. We also present reasoning that suggests our construction is optimal among similar product-based constructions.

at July 11, 2021 07:39 AM UTC

Nisan and Szegedy (CC 1994) showed that any Boolean function $f:\{0,1\}^n\to\{0,1\}$ that depends on all its input variables, when represented as a real-valued multivariate polynomial $P(x_1,\ldots,x_n)$, has degree at least $\log n - O(\log \log n)$. This was improved to a tight $(\log n - O(1))$ bound by Chiarelli, Hatami and Saks (Combinatorica 2020). Similar statements are also known for other Boolean function complexity measures such as Sensitivity (Simon (FCT 1983)), Quantum query complexity, and Approximate degree (Ambainis and de Wolf (CC 2014)). In this paper, we address this question for \emph{Probabilistic degree}. The function $f$ has probabilistic degree at most $d$ if there is a random real-valued polynomial of degree at most $d$ that agrees with $f$ at each input with high probability. Our understanding of this complexity measure is significantly weaker than those above: for instance, we do not even know the probabilistic degree of the OR function, the best-known bounds put it between $(\log n)^{1/2-o(1)}$ and $O(\log n)$ (Beigel, Reingold, Spielman (STOC 1991); Tarui (TCS 1993); Harsha, Srinivasan (RSA 2019)). Here we can give a near-optimal understanding of the probabilistic degree of $n$-variate functions $f$, \emph{modulo} our lack of understanding of the probabilistic degree of OR. We show that if the probabilistic degree of OR is $(\log n)^c$, then the minimum possible probabilistic degree of such an $f$ is at least $(\log n)^{c/(c+1)-o(1)}$, and we show this is tight up to $(\log n)^{o(1)}$ factors.

at July 11, 2021 04:50 AM UTC

Piecewise-circular curves or, if you like, arc-polygons are a very old topic in mathematics. Archimedes and Pappus studied the arbelos, a curved triangle formed from three semicircles, and Hippocrates of Chios found that the lune of Hippocrates, a two-sided figure bounded by a semicircle and a quarter-circle, has the same area as an isosceles right triangle stretched between the same two points. The history of the Reuleaux triangle, bounded by three sixths of circles, stretches back well past Reuleaux to the shapes of of Gothic church windows and its use by Leonardo da Vinci for fortress floor plans and world map projections. But despite their long history and frequent use (for instance in the design of machined parts), there are some basic properties of arc-polygons that seem to have been unexplored so far.

I looked at one of these properties, the ability of arc-triangles to tile the plane, in an earlier post. Another of these properties involves the feasible combinations of angles of these shapes. As is well known, in a straight-sided triangle in the plane, the three interior angles always sum to exactly \(\pi\), and any three positive angles summing to \(\pi\) are possible. Let \(T\) be the set of triples of angles \((\theta_1,\theta_2,\theta_3)\) from triangles, and reinterpret these triples as coordinates of points in Euclidean space. Then \(T\) is itself an equilateral triangle, with corners at the three points \((\pi,0,0)\), \((0,\pi,0)\), and \((0,0\pi)\). (More precisely, it’s the relative interior of this triangle.)

What about arc-triangles? Are their angles similarly constrained? What shape do their triples of angles make? First of all, their angles don’t have a fixed sum (except for the tilers, for which this sum is again \(\pi\)). The arbelos has three interior angles that are all zero, summing to zero. The Reuleux triangle has three angles of \(2\pi/3\), summing to \(2\pi\). Boscovitch’s cardioid, below, uses three semicircles like the arbelos, but with one interior angle of \(2\pi\) and two others equal to \(\pi\), summing to \(4\pi\). The trefoil, a common architectural motif, bulges outward from its three corners, forming interior angles that are much larger, up to \(2\pi\) each for a trefoil made from three \(5/6\)-circle arcs, for a total interior angle of \(6\pi\).

Boscovich's cardioid

Nevertheless, for a non-self-crossing arc-triangle, not all combinations of angles are possible. For instance, it’s not possible to have one angle that is zero and two that are \(2\pi\). My new preprint, “Angles of arc-polygons and lombardi drawings of cacti” (arXiv:2107.03615 , with UCI students Daniel Frishberg and Martha Osegueda, to appear at CCCG) proves a precise characterization: beyond the obvious requirement that each angle \(\theta_i\) be in the range \(0\le\theta_i\le 2\pi\), we have only the additional inequalities

\[-\pi < \frac{\pi - \theta_i + \theta_{i+1} - \theta_{i+2}}{2} < \pi\]

where the index arithmetic is done modulo three. The formula in the middle of each of these inequalities is itself an angle, the angle of incidence between one of the circular arcs of the arc-triangle and the circle through its three corners. Where the straight triangles had an equilateral-triangle feasible region, the arc-triangles have a more complicated shape. The obvious constraints \(0\le\theta_i\le 2\pi\) would produce a cubical feasible region \([0,2\pi]^3\), but the additional inequalities above cut off six corners of the cube, leaving a feasible region looking like this:

The feasible region for triples of angles of arc-triangles

The motivating application for all of this is graph drawing, and more specifically Lombardi drawing, in which edges are circular arcs meeting at equal angles at each vertex. Using our new understanding of arc-polygons, we prove that every cactus graph has a planar Lombardi drawing for its natural embedding (the one in which each cycle of the cactus forms a face) but might not for some other embeddings, including the one below.

An embedded cactus that has no planar Lombardi drawing

But beyond graph drawing, I think that the long history and many applications of arc-polygons justifies more study of their general properties. For instance, what about arc-polygons with more than three sides? What can their angles be? Our paper has a partial answer, enough to answer the questions we asked in our Lombardi drawing application, but a complete characterization for arc-polygons of more than three sides is still open.

(Discuss on Mastodon)

by David Eppstein at July 10, 2021 11:06 AM UTC