Theory of Computing Blog Aggregator

Authors: Ioannis Z. Emiris, Christina Katsamaki
Download: PDF
Abstract: Voronoi diagrams are a fundamental geometric data structure for obtaining proximity relations. We consider collections of axis-aligned orthogonal polyhedra in two and three-dimensional space under the max-norm, which is a particularly useful scenario in certain application domains. We construct the exact Voronoi diagram inside an orthogonal polyhedron with holes defined by such polyhedra. Our approach avoids creating full-dimensional elements on the Voronoi diagram and yields a skeletal representation of the input object. We introduce a complete algorithm in 2D and 3D that follows the subdivision paradigm relying on a bounding-volume hierarchy; this is an original approach to the problem. The complexity is adaptive and comparable to that of previous methods, namely linear in the number of sites, namely edges or facets resp. We also provide a numerically stable, open-source implementation in Julia, illustrating the practical nature of our algorithm.

at May 22, 2019 01:28 AM UTC

Authors: Simon Bruggmann, Rico Zenklusen
Download: PDF
Abstract: Relaxation and rounding approaches became a standard and extremely versatile tool for constrained submodular function maximization. One of the most common rounding techniques in this context are contention resolution schemes. Such schemes round a fractional point by first rounding each coordinate independently, and then dropping some elements to reach a feasible set. Also the second step, where elements are dropped, is typically randomized. This leads to an additional source of randomization within the procedure, which can complicate the analysis. We suggest a different, polyhedral viewpoint to design contention resolution schemes, which avoids to deal explicitly with the randomization in the second step. This is achieved by focusing on the marginals of a dropping procedure. Apart from avoiding one source of randomization, our viewpoint allows for employing polyhedral techniques. Both can significantly simplify the construction and analysis of contention resolution schemes. We show how, through our framework, one can obtain an optimal monotone contention resolution scheme for bipartite matchings. So far, only very few results are known about optimality of monotone contention resolution schemes. Our contention resolution scheme for the bipartite case also improves the lower bound on the correlation gap for bipartite matchings. Furthermore, we derive a monotone contention resolution scheme for matchings that significantly improves over the previously best one. At the same time, our scheme implies that the currently best lower bound on the correlation gap for matchings is not tight. Our results lead to improved approximation factors for various constrained submodular function maximization problems over a combination of matching constraints with further constraints.

at May 22, 2019 01:24 AM UTC

Authors: Valentin Bakoev
Download: PDF
Abstract: Here we consider an approach for fast computing the algebraic degree of Boolean functions. It combines fast computing the ANF (known as ANF transform) and thereafter the algebraic degree by using the weight-lexicographic order (WLO) of the vectors of the $n$-dimensional Boolean cube. Byte-wise and bitwise versions of a search based on the WLO and their implementations are discussed. They are compared with the usual exhaustive search applied in computing the algebraic degree. For Boolean functions of $n$ variables, the bitwise implementation of the search by WLO has total time complexity $O(n.2^n)$. When such a function is given by its truth table vector and its algebraic degree is computed by the bitwise versions of the algorithms discussed, the total time complexity is $\Theta((9n-2).2^{n-7})=\Theta(n.2^n)$. All algorithms discussed have time complexities of the same type, but with big differences in the constants hidden in the $\Theta$-notation. The experimental results after numerous tests confirm the theoretical results - the running times of the bitwise implementation are dozens of times better than the running times of the byte-wise algorithms.

at May 22, 2019 01:24 AM UTC

Authors: Herman Haverkort, David Kübel, Elmar Langetepe
Download: PDF
Abstract: Various applications of graphs, in particular applications related to finding shortest paths, naturally get inputs with real weights on the edges. However, for algorithmic or visualization reasons, inputs with integer weights would often be preferable or even required. This raises the following question: given an undirected graph with non-negative real weights on the edges and an error threshold $\varepsilon$, how efficiently can we decide whether we can round all weights such that shortest paths are maintained, and the change of weight of each shortest path is less than $\varepsilon$? So far, only for path-shaped graphs a polynomial-time algorithm was known. In this paper we prove, by reduction from 3-SAT, that, in general, the problem is NP-hard. However, if the graph is a tree with $n$ vertices, the problem can be solved in $O(n^2)$ time.

at May 22, 2019 12:00 AM UTC

Authors: Marin Bougeret, Klaus Jansen, Michael Poss, Lars Rohwedder
Download: PDF
Abstract: We study approximation algorithms for the problem of minimizing the makespan on a set of machines with uncertainty on the processing times of jobs. In the model we consider, which goes back to~\cite{BertsimasS03}, once the schedule is defined an adversary can pick a scenario where deviation is added to some of the jobs' processing times. Given only the maximal cardinality of these jobs, and the magnitude of potential deviation for each job, the goal is to optimize the worst-case scenario. We consider both the cases of identical and unrelated machines. Our main result is an EPTAS for the case of identical machines. We also provide a $3$-approximation algorithm and an inapproximability ratio of $2-\epsilon$ for the case of unrelated machines

at May 22, 2019 01:21 AM UTC

Authors: Lélia Blin, Swan Dubois, Laurent Feuilloley
Download: PDF
Abstract: In network distributed computing, minimum spanning tree (MST) is one of the key problems, and silent self-stabilization one of the most demanding fault-tolerance properties. For this problem and this model, a polynomial-time algorithm with $O(\log^2\!n)$ memory is known for the state model. This is memory optimal for weights in the classic $[1,\text{poly}(n)]$ range (where $n$ is the size of the network). In this paper, we go below this $O(\log^2\!n)$ memory, using approximation and parametrized complexity.

More specifically, our contributions are two-fold. We introduce a second parameter~$s$, which is the space needed to encode a weight, and we design a silent polynomial-time self-stabilizing algorithm, with space $O(\log n \cdot s)$. In turn, this allows us to get an approximation algorithm for the problem, with a trade-off between the approximation ratio of the solution and the space used. For polynomial weights, this trade-off goes smoothly from memory $O(\log n)$ for an $n$-approximation, to memory $O(\log^2\!n)$ for exact solutions, with for example memory $O(\log n\log\log n)$ for a 2-approximation.

at May 22, 2019 01:25 AM UTC

Authors: Lélia Blin, Laurent Feuilloley, Gabriel Le Bouder
Download: PDF
Abstract: In the context of self-stabilization, a \emph{silent} algorithm guarantees that the register of every node does not change once the algorithm has stabilized. At the end of the 90's, Dolev et al. [Acta Inf. '99] showed that, for finding the centers of a graph, for electing a leader, or for constructing a spanning tree, every silent algorithm must use a memory of $\Omega(\log n)$ bits per register in $n$-node networks. Similarly, Korman et al. [Dist. Comp. '07] proved, using the notion of proof-labeling-scheme, that, for constructing a minimum-weight spanning trees (MST), every silent algorithm must use a memory of $\Omega(\log^2n)$ bits per register. It follows that requiring the algorithm to be silent has a cost in terms of memory space, while, in the context of self-stabilization, where every node constantly checks the states of its neighbors, the silence property can be of limited practical interest. In fact, it is known that relaxing this requirement results in algorithms with smaller space-complexity.

In this paper, we are aiming at measuring how much gain in terms of memory can be expected by using arbitrary self-stabilizing algorithms, not necessarily silent. To our knowledge, the only known lower bound on the memory requirement for general algorithms, also established at the end of the 90's, is due to Beauquier et al.~[PODC '99] who proved that registers of constant size are not sufficient for leader election algorithms. We improve this result by establishing a tight lower bound of $\Theta(\log \Delta+\log \log n)$ bits per register for self-stabilizing algorithms solving $(\Delta+1)$-coloring or constructing a spanning tree in networks of maximum degree~$\Delta$. The lower bound $\Omega(\log \log n)$ bits per register also holds for leader election.

at May 22, 2019 01:26 AM UTC

Authors: Moses Charikar, Kirankumar Shiragur, Aaron Sidford
Download: PDF
Abstract: Estimating symmetric properties of a distribution, e.g. support size, coverage, entropy, distance to uniformity, are among the most fundamental problems in algorithmic statistics. While each of these properties have been studied extensively and separate optimal estimators are known for each, in striking recent work, Acharya et al. 2016 showed that there is a single estimator that is competitive for all symmetric properties. This work proved that computing the distribution that approximately maximizes \emph{profile likelihood (PML)}, i.e. the probability of observed frequency of frequencies, and returning the value of the property on this distribution is sample competitive with respect to a broad class of estimators of symmetric properties. Further, they showed that even computing an approximation of the PML suffices to achieve such a universal plug-in estimator. Unfortunately, prior to this work there was no known polynomial time algorithm to compute an approximate PML and it was open to obtain a polynomial time universal plug-in estimator through the use of approximate PML. In this paper we provide a algorithm (in number of samples) that, given $n$ samples from a distribution, computes an approximate PML distribution up to a multiplicative error of $\exp(n^{2/3} \mathrm{poly} \log(n))$ in time nearly linear in $n$. Generalizing work of Acharya et al. 2016 on the utility of approximate PML we show that our algorithm provides a nearly linear time universal plug-in estimator for all symmetric functions up to accuracy $\epsilon = \Omega(n^{-0.166})$. Further, we show how to extend our work to provide efficient polynomial-time algorithms for computing a $d$-dimensional generalization of PML (for constant $d$) that allows for universal plug-in estimation of symmetric relationships between distributions.

at May 22, 2019 01:26 AM UTC

Authors: Jayadev Acharya, Clément L. Canonne, Himanshu Tyagi
Download: PDF
Abstract: A central server needs to perform statistical inference based on samples that are distributed over multiple users who can each send a message of limited length to the center. We study problems of distribution learning and identity testing in this distributed inference setting and examine the role of shared randomness as a resource. We propose a general-purpose simulate-and-infer strategy that uses only private-coin communication protocols and is sample-optimal for distribution learning. This general strategy turns out to be sample-optimal even for distribution testing among private-coin protocols. Interestingly, we propose a public-coin protocol that outperforms simulate-and-infer for distribution testing and is, in fact, sample-optimal. Underlying our public-coin protocol is a random hash that when applied to the samples minimally contracts the chi-squared distance of their distribution to the uniform distribution.

at May 22, 2019 01:25 AM UTC

Today I participated in the successful dissertation defense of Bálint Tillman, a student of Athina Markopoulou in the UCI Graduate Program in Networked Systems.

Bálint has been investigating problems connected with the Erdős–Gallai theorem, which states that it is possible to test whether a sequence of numbers is the degree distribution of a graph (the sequence of numbers of vertices of each possible degree) and if so, find a graph with that degree distribution, in polynomial time. The degree distribution can be extended to a matrix called the joint degree distribution, which specifies the number of pairs of adjacent vertices with each combination of degrees, and to higher-order tensors specifying the degrees of subgraphs with more than two vertices.

In his INFOCOM 2015 paper “Construction of simple graphs with a target joint degree matrix and beyond”, Bálint showed that one can recognize the joint degree distributions of simple graphs, and reconstruct a graph with that distribution, in polynomial time. The algorithm works equally well when the vertices are distinguished in other ways than by degree and the input matrix specifies the target number of edges with each pair of degrees between each class of vertices. Later, in KDD 2017, he extended these results to directed graphs and at NetSci 2018 he showed how to find a realization with as few connected components as possible.

On the other hand, if one adds only a little bit of extra information to the joint degree distribution, such as the total number of triangles in the graph, it becomes NP-complete to recognize whether the input describes a valid graph and NP-hard to reconstruct a graph that realizes a given description. This comes from a poster by Bálint with Will Devanny and me at NetSci 2016, where we found the reduction from graph 3-coloring depicted below.

NP-completeness reduction from 3-coloring to realizability of joint degree matrices with numbers of triangles

To perform an NP-hardness reduction, one should start with a graph for which it’s hard to test 3-colorability, and translate it into an instance of whatever other problem you want to prove hard. But instead let’s pretend for now that we start with a little more information: a graph that’s known to be 3-colorable, and a specific 3-coloring of it. To turn this into a hard problem for realizability of joint distribution plus number of triangles, we add a triangle to the graph, representing the three colors, and connect each original graph vertex to the new triangle vertex for its color. Then we add enough hair (in the form of degree-one vertices) to the augmented graph to make all vertices have distinct degrees, except within the triangle of new vertices where all three degrees should be the same. Now take as the result of the reduction the pair of the joint degree distribution and number of triangles in the augmented graph.

But now the trick is that can be computed directly from your starting graph, without knowing its coloring or even whether it is colorable. Because the degrees are distinct, and tells you the number of edges for each combination of degrees, any realization of must contain a copy of your starting graph augmented by a triangle. The graph and the triangle might be connected to each other differently than they were before, but their connection pattern must still correspond to a different valid 3-coloring, because otherwise you would form some extra triangles in the graph (one for each invalidly-colored edge) and not correctly match the value of . So is realizable if and only if your starting graph has a 3-coloring. This reduction proves that testing realizability of pairs is NP-complete.

There’s even more material along these lines in Bálint’s dissertation, but some of it is not yet published. I think the plan is to get all of that submitted over the summer before Bálint starts a new position at Google. Congratulations, Bálint!

(Discuss on Mastodon)

by David Eppstein at May 21, 2019 06:20 PM UTC

It’s harder to make up tests than to take them

[ Recent photo ]

Ken Regan has been busy these last few days working on making a final exam, giving the exam, and now grading the exam.

Today Ken and I want to talk about tests.

I also have a test for you. You can jump right to our test of knowledge. Do not, please, use any search tools, especially Google.

Test Theory

Ken recently made up a final exam. We both have had to make countless tests over the years. I was never trained in how to make a good test. Nor how to make a test at all. I am still puzzled about how to do it.

Avi Wigderson once told me that Michael Rabin only asked questions on his exams that he had stated already in lectures. Is there a theory of what makes a proper test? I do not know any.

Suppose that before the exam we lectured and the students learned {T} and {F}: Here {T} are true statements and {F} are false statements. A rote type question might be:

Is the statement {S} true or false?

Here {S} would be in either {T} or {F}. This type of question is purely a memory problem.

A more difficult test would have questions like:

Is {S} true or false

Here {S} would be equivalent to some {S'} that is in {T} or in {F}. The equivalence between {S} and {S'} would require only the application of a few simple logical rules. This is much harder for students. In the limit we could have {S} and {S'} far apart, even could have it an open problem if {S} and {S'} are equivalent.

Our Test

No looking at Google please.

Question 1: We all know that Dick Karp created the P=NP question. What is Dick’s middle name?

  1. Mark

  2. Manning

  3. Mathew

  4. Richard

Question 2: This year is the fifty-first anniversary of STOC. Where was the first one held?

  1. Marina del Rey, CA

  2. Massapequa, NY

  3. Boston, MA

  4. Chicago, IL

Question 3: Which of these did not happen in 1969?

  1. The first automatic teller machine in the United States is installed.

  2. The $500 bills are officially removed from circulation.

  3. The first The Limited store opens, in San Francisco.

  4. The New York Mets win the World Series.

Question 4: The first STOC conference program committee included:

  1. No women.

  2. A person named Mike.

  3. A person named Pat.

  4. All the above.

Question 5: How do you tell if you are a “theoretical computer scientist”?

  1. You wear flip-flops in the winter.

  2. You regularly attend STOC.

  3. You wear glasses.

  4. You cannot program a computer.

Question 6: “Cooking” a chess problem means:

  1. Showing it is in a family of NP-complete problems on {n \times n} boards.

  2. Showing it has two or more solutions (or no solutions).

  3. Showing it cannot be solved by Steve Cook.

  4. Showing that it cannot be solved by the best-move strategy.

Question 7: The other theory conference is called FOCS. Which of these is true about this conference:

  1. The name was selected by a person named Edward.

  2. It has never had parallel sessions.

  3. It was originally called Symposium on Switching Circuit Theory and Logical Design.

  4. The artwork for the proceedings cover is by an artist named Smith, who never published in the conference.

Question 8: What do the STOC conferences have in common with last night’s final episode of Game of Thrones?

  1. Both had flying horses and whistling pigs.

  2. No dragons were harmed during either.

  3. Both have left many big questions unanswered.

  4. Both are explained by the “Prisoner’s Dilemma” game solution.

Question 9: STOC has been held on each of these islands except:

  1. Long Island, NY.

  2. Puerto Rico.

  3. Crete.

  4. Vancouver Island.

Question 10: What term appears in the titles of three award-winning STOC/FOCS papers since 2016?

  1. Quantum.

  2. Quadratic/subquadratic.

  3. Quadtree.

  4. Quasi/quasipolynomial.

Open Problems

Answers: Note 1a means question 1 and answer 1 and so on. This is a wordpress issue.

by rjlipton at May 21, 2019 12:48 PM UTC

In this post we return to the generic form of the FTRL online optimization algorithm. If the cost functions are linear, as they will be in all the applications that I plan to talk about, the algorithm is:

\displaystyle   x_t := \arg\min_{x\in K} \ R(x) + \sum_{k=1}^{t-1} \langle \ell_k, x \rangle \ \ \ \ \ (1)

where {K\subseteq {\mathbb R}^n} is the convex set of feasible solutions that the algorithm is allowed to produce, {x \rightarrow \langle \ell_k , x \rangle} is the linear loss function at time {k}, and {R: K \rightarrow {\mathbb R}} is the strictly convex regularizer.

If we have an unconstrained problem, that is, if {K= {\mathbb R}^n}, then the optimization problem (1) has a unique solution: the {x_t} such that

\displaystyle  \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k

and we can usually both compute {x_t} efficiently in an algorithm and reason about {x_t} effectively in an analysis.

Unfortunately, we are almost always interested in constrained settings, and then it becomes difficult both to compute {x_t} and to reason about it.

A very nice special case happens when the regularizer {R} acts as a barrier function for {K}, that is, the (norm of the) gradient of {R} goes to infinity when one approaches the boundary of {K}. In such a case, it is impossible for the minimum of (1) to occur at the boundary and the solution will be again the unique {x_t} in {K} such that

\displaystyle  \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k

We swept this point under the rug when we studied FTRL with negative-entropy regularizer in the settings of experts, in which {K = \Delta} is the set of probability distributions. When we proceeded to solve (1) using Lagrange multipliers, we ignored the non-negativity constraints. The reason why it was ok to do so was that the negative-entropy is a barrier function for the non-negative orthant {({\mathbb R}_{\geq 0})^n}.

Another important special case occurs when the regularizer {R(x) = c || x||^2} is a multiple of length-squared. In this case, we saw that we could “decouple” the optimization problem by first solving the unconstrained optimization problem, and then projecting the solution of the unconstrained problem to {K}:

\displaystyle y_{t} = \arg\min_{y\in {\mathbb R}^n} \ c || y||^2 + \sum_{k=1}^{t-1} \langle \ell_k, y \rangle

\displaystyle  x_t = \Pi_K (y_t) = \arg\min _{x\in K} || x - y_t ||

Then we have the closed-form solution {y_t = - \frac 1{2c} \sum_{k=1}^{t-1} \ell _k} and, depending on the set {K}, the projection might also have a nice closed-form, as in the case {K= [0,1]^n} that comes up in results related to regularity lemmas.

As we will see today, this approach of solving the unconstrained problem and then projecting on {K} works for every regularizer, for an appropriate notion of projection called the Bregman projection (the projection will depend on the regularizer).

To define the Bregman projection, we will first define the Bregman divergence with respect to the regularizer {R}, which is a non-negative “distance” {D(x,y)} defined on {{\mathbb R}^n} (or possibly a subset of {{\mathbb R}^n} for which the regularizer {R} is a barrier function). Then, the Bregman projection of {y} on {K} is defined as {\arg\min_{x\in K} \ D(x,y)}.

Unfortunately, it is not so easy to reason about Bregman projections either, but the notion of Bregman divergence offers a way to reinterpret the FTRL algorithm from another point of view, called mirror descent. Via this reinterpretation, we will prove the regret bound

\displaystyle  {\rm Regret}_T(x) \leq D(x,x_1) + \sum_{t=1}^T D(x_t,y_{t+1})

which carries the intuition that the regret comes from a combination of the “distance” of our initial solution from the offline optimum and of the “stability” of the algorithm, that is, the “distance” between consecutive soltuions. Nicely, the above bound measures both quantities using the same “distance” function.

1. Bregman Divergence and Bregman Projection

For a strictly convex function {R: {\mathbb R}^n \rightarrow {\mathbb R}}, we define the Bregman divergence associated to {R} as

\displaystyle  D_R(x,y) := R(x) - R(y) - \langle \nabla R(y), x-y \rangle

that is, the difference between the value of {R} at {x} and the value of the linear approximation of {R} at {x} (centered at {y}). By the strict convexity of {R} we have {D_R(x,y) \geq 0} and {D_R(x,y) = 0} iff {x=y}. These properties suggest that we may think of {D_R(x,y)} as a kind of “distance” between {x} and {y}, which is a useful intuition although it is important to keep in mind that the divergence need not be symmetric and need not satisfy the triangle inequality.

Now we show that, assuming that {R(\cdot)} is well defined and strictly convex on all {{\mathbb R}^n}, and that the losses are linear, the constrained optimization problem (1) can be solved by first solving the unconstrained problem and then “projecting” the solution on {K} by finding the point in {K} of smallest Bregman divergence from the unconstrained optimum:

\displaystyle  y_t = \arg\min _{y\in {\mathbb R}^n} \ R(y) + \sum_{k=1}^{t-1} \langle \ell_k , y \rangle

\displaystyle  x_t = \arg\min_{x \in K} \ D_R(x,y_t)

The proof is very simple. The optimum of the unconstrained optimization problem is the unique {y_t} such that

\displaystyle  \nabla \left( R(y_t) + \sum_{k=1}^{t-1} \ell_k \right) = 0

that is, the unique {y_t} such that

\displaystyle  \nabla R(y_t) = - \sum_{k=1}^{t-1} \ell_k

On the other hand, {x_t} is defined as

\displaystyle  x_t = \arg\min_{x \in K} \ D_R(x,y_t) = \arg\min_{x \in K} R(x) - R(y_t) - \langle \nabla R(y_t) , x - y_t \rangle

that is,

\displaystyle  x_t = \arg\min_{x \in K} R(x) - R(y_t) + \langle \sum_{k=1}^{t-1} \ell_k , x - y_t \rangle = \arg\min_{x \in K} R(x) + \langle \sum_{k=1}^{t-1} \ell_k , x \rangle

where the second equality above follows from the fact that two functions that differ by a constant have the same optimal solutions.

Indeed we see that the above “decoupled” characterization of the FTRL algorithm would have worked for any definition of a function of the form

\displaystyle  D(x,y) = R(x) - \langle \nabla R(y), x \rangle + \langle \mbox{ stuff that depends only on } y \rangle

and that our particular choice of what “stuff dependent only on {y}” to add makes {D(x,x) = 0} which is reasonable for something that we want to think of as a “distance function.”

Note that, in all of the above, we can replace {{\mathbb R}^n} with a convex set {K \subseteq S \subseteq {\mathbb R}^n} provided that {R} is a barrier function for {S}. In that case

\displaystyle  y_t = \arg\min_{y\in S} R(y) + \sum_k \ell_k

is the unique {y_t} such that

\displaystyle  \nabla R(y_t) = - \sum _k \ell_k

and everything else follows analogously.

2. Examples

2.1. Bregman Divergence of Length-Squared

If {R(x) = ||x ||^2}, then

\displaystyle  D(x,y) = ||x||^2 - ||y||^2 - \langle 2 y , x- y \rangle = ||x - y||^2 \ ,

so Bregman divergence is distance-squared, and Bregman projection is just (Euclidean) projection.

2.2. Bregman Divergence of Negative Entropy

If, for {x\in ({\mathbb R}_{\geq 0})^n}, we define

\displaystyle  R(x) = \sum_{i=1}^n x_i \ln x_i

then the associated Bregman divergence is the generalized KL divergence.

\displaystyle  D(x,y) = \sum_{i=1}^n x_i \ln {x_i} \ - \sum_i y_i \ln y_i \ - \langle \nabla R(y), x-y \rangle

where {(\nabla R(y))_i = 1 + \ln y_i} so that

\displaystyle D(x,y) = \sum_{i=1}^n x_i \ln x_i \ - \sum_i y_i \ln y_i \ - \sum_i x_i \ln y_i \ + \sum_i y_i \ln y_i \ - \sum_i x_i + \sum_i y_i

\displaystyle  = \sum_{i=1}^n x_i \ln \frac{x_i} {y_i} \ - \sum_i x_i + \sum_i y_i

Note that, if {x} and {y} are probability distributions, then the final two terms above cancel out, leaving just the KL divergence {\sum_i x_i \ln \frac {x_i}{y_i}}.

3. Mirror Descent

We now introduce a new perspective on FTRL.

In the unconstrained setting, if {R} is a strictly convex function and {D} is the associated Bregman divergence, the mirror descent algorithm for online optimization has the update rule

\displaystyle  x_t = \arg\min_{x\in {\mathbb R}^n} D(x,x_{t-1}) + \langle \ell_{t-1}, x \rangle

The idea is that we want to find a solution that is good for the past loss functions, but that does not “overfit” too much. If, in past steps, {x_{t-1}} had been chosen to be such a solution for the loss functions {\ell_1,\ldots,\ell_{t-2}}, then, in choosing {x_t}, we want to balance staying close to {x_{t-1}} but also doing well with respect to {\ell_{t-1}}, hence the above definition.

Theorem 1 Initialized with {x_1 = \arg\min_{x\in {\mathbb R}^n} R(x)}, the unconstrained mirror descent algorithm is identical to FTRL with regularizer {R}.

Proof: We will proceed by induction on {t}. At {t=1}, the definition of {x_1} is the same. For larger {t}, we know that FTRL will choose the unique {x_t} such that {\nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k}, so we will assume that this is true for the mirror descent algorithm for {x_{t-1}} and prove it for {x_t}.

First, we note that the function {x \rightarrow D(x,x_{t-1}) + \langle \ell_{t-1} , x \rangle} is strictly convex, because it equals

\displaystyle  R(x) - R(x_{t-1}) - \langle \nabla R(x_{t-1}), x - x_t \rangle + \langle \ell_{t-1} , x \rangle

and so it is a sum of a strictly convex function {R(x)}, linear functions in {x}, and constants independent of {x}. This means that {x_t} is the unique point at which the gradient of the above function is zero, that is,

\displaystyle  \nabla R(x_t) - \nabla R(x_{t-1}) + \ell_{t-1} = 0

and so

\displaystyle  \nabla R(x_t) = \nabla R(x_{t-1}) - \ell_{t-1}

and, using the inductive hypothesis, we have

\displaystyle  \nabla R(x_t) = - \sum_{k=1}^{t-1} \ell_k

as desired. \Box

In the constrained case, there are two variants of mirror descent. Using the terminology from Elad Hazan’s survey, agile mirror descent is the natural generalization of the unconstrained algorithm:

\displaystyle  x_t = \arg\min_{x\in K} \ D(x,x_{t-1}) + \langle \ell_{t-1}, x \rangle

Following the same steps as the proof in the previous section, it is possible to show that agile mirror descent is equivalent to solving, at each iteration, the “decoupled” optimization problems

\displaystyle  y_t = \arg\min_{y\in {\mathbb R}^n} D(y,x_{t-1}) + \langle \ell_{t-1}, y \rangle

\displaystyle  x_t = \arg\min_{x\in K} D(x,y_t)

That is, we can first solve the unconstrained problem and then project on {K}. (Again, we can always replace {{\mathbb R}^n} by a set {S} for which {R} is a barrier function and such that {K \subseteq S}.)

The lazy mirror descent algorithm has the update rule

\displaystyle  y_t = \arg\min_{x\in {\mathbb R}^n} D(y,y_{t-1}) + \langle \ell_{t-1}, y \rangle

\displaystyle  x_t = \arg\min_{x\in K} D(x,y_t)

The initialization is

\displaystyle  y_1 = \arg\min_{x\in {\mathbb R}^n} \ R(x) \ \ \ x_1 = \arg\min_{x\in K} R(x)

Fact 2 Lazy mirror descent is equivalent to FTRL.

Proof: The solutions {y_t} are the unconstrained optimum of FTRL, and {x_t} is the Bregman projection of {y_t} on {K}. We proved in the previous section that this characterizes constrained FTRL. \Box

What about agile mirror descent? In certain special cases it is equivalent to lazy mirror descent, and hence to FTRL, but it usually leads to a different set of solutions.

We will provide an analysis of lazy mirror descent, but first we will give an analysis of the regret of unconstrained FTRL in terms of Bregman divergence, which will be the model on which we will build the proof for the constrained case.

4. A Regret Bound for FTRL in Terms of Bregman Divergence

In this section we prove the following regret bound.

Theorem 3 Unconstrained FTRL with regularizer {R} satisfies the regret bound

\displaystyle  {\rm Regret}_T(x) \leq D(x,x_1) + \sum_{t=1}^T D(x_t,x_{t+1})

where {D} is the Bregman divergence associated with {R}.

We will take the mirror descent view of unconstrained FTRL, so that

\displaystyle  x_1 = \arg\min_{x\in {\mathbb R}^n} \ R(x)

\displaystyle  x_{t+1} = \arg\min_{x\in {\mathbb R}^n} \ D(x,x_t) + \langle \ell_t , x\rangle

We proved that

\displaystyle  \nabla R(x_{t+1} ) = \nabla R(x_t) - \ell_t

This means that we can rewrite the regret suffered at step {t} with respect to {x} as

\displaystyle  \langle \ell_t , x_t - x \rangle = \langle \nabla R(x_t) - \nabla R(x_{t+1}), x_t - x \rangle

\displaystyle  = D(x,x_t) - D(x,x_{t+1} ) +D(x_t, x_{t+1})

and the theorem follows by adding up the above expression for {t=1,\ldots,T} and recalling that {D(x,x_{T+1}) \geq 0}.

Unfortunately I have no geometric intuition about the above identity, although, as you can check yourself, the algebra works neatly.

5. A Regret Bound for Agile Mirror Descent

In this section we prove the following generalization of the regret bound from the previous section.

Theorem 4 Agile mirror descent satisfies the regret bound

\displaystyle  {\rm Regret}_T(x) \leq D(x,x_1) + \sum_{t=1}^T D(x_t,y_{t+1})

The first part of the update rule of agile mirror descent is

\displaystyle  y_{t+1} = \arg\min_{y\in {\mathbb R}^n} D(y,x_{t}) + \langle \ell_{t}, y \rangle

and, following steps that we have already carried out before, {y_{t+1}} satisfies

\displaystyle  \nabla R(y_{t+1} ) = \nabla R(x_t) - \ell_t

This means that we can rewrite the regret suffered at step {t} with respect to {x} as

\displaystyle  \langle \ell_t , x_t - x \rangle = \langle \nabla R(x_t) - \nabla R(y_{t+1}), x_t - x \rangle

\displaystyle  = D(x,x_t) - D(x,y_{t+1} ) +D(x_t, y_{t+1})

where the same mystery cancellations as before make the above identity true.

Now I will wield another piece of magic, and I will state without proof the following fact about Bregman projections

Lemma 5 If {x\in K} and {z} is the Bregman projection on {K} of a point {y}, then

\displaystyle  D(x,y) \geq D(x,z) + D(z,y)

That is, if we think of {D} as a “distance,” the distance from {y} to its closest point {z} in {K} plus the distance from {z} to {x} is at most the distance from {y} to {x}. Note that this goes in the opposite direction as the triangle inequality (which ok, because {D} typically does not satisfy the triangle inequality).

In particular, the above lemma gives us

\displaystyle  D(x,y_{t+1}) \geq D(x,x_{t+1})

and so

\displaystyle  \langle \ell_t , x_t - x \rangle \leq D(x,x_t) - D(x,x_{t+1} ) +D(x_t, y_{t+1})

Now summing over {t} and recalling that {D(x,x_{T+1}) \geq 0} we have our theorem.

by luca at May 21, 2019 01:42 AM UTC

Authors: Mahdi Boroujeni, Sina Dehghani, Soheil Ehsani, MohammadTaghi HajiAghayi, Saeed Seddighin
Download: PDF
Abstract: Despite persistent efforts, there is no known technique for obtaining unconditional super-linear lower bounds for the computational complexity of the problems in P. Vassilevska Williams and Williams introduce a fruitful approach to advance a better understanding of the computational complexity of the problems in P. In particular, they consider All Pairs Shortest Paths (APSP) and other fundamental problems such as checking whether a matrix defines a metric, verifying the correctness of a matrix product, and detecting a negative triangle in a graph. Abboud, Grandoni, and Vassilevska Williams study well-known graph centrality problems such as Radius, Median, etc., and make a connection between their computational complexity to that of two fundamental problems, namely APSP and Diameter. They show any algorithm with subcubic running time for these centrality problems, implies a subcubic algorithm for either APSP or Diameter. In this paper, we define vertex versions for these centrality problems and based on that we introduce new complementary problems. The main open problem of Abboud et al. is whether or not APSP and Diameter are equivalent under subcubic reduction. One of the results of this paper is APSP and CoDiameter, which is the complementary version of Diameter, are equivalent. Moreover, for some of the problems in this set, we show that they are equivalent to their complementary versions. Considering the slight difference between a problem and its complementary version, these equivalences give us the impression that every problem has such a property, and thus APSP and Diameter are equivalent. This paper is a step forward in showing a subcubic equivalence between APSP and Diameter, and we hope that the approach introduced in our paper can be helpful to make this breakthrough happen.

at May 21, 2019 12:00 AM UTC

Authors: Roman Galay
Download: PDF
Abstract: As it follows from G\"odel's incompleteness theorems, any consistent formal system of axioms and rules of inference should imply a true unprovable statement. Actually this fundamental principle can be efficiently applicable in Computational Mathematics and Complexity Theory concerning the computational complexity of problems from the class NP, particularly and especially the NP-complete ones. While there is a wide set of algorithms for these problems that we call heuristic, the correctness or/and complexity of each concrete algorithm (or the probability of its correct and polynomial-time work) on a class of instances is often too difficult to determine, although we may also assume the existence of a variety of algorithms for NP-complete problems that are both correct and polynomial-time on all the instances from a given class (where the given problem remains NP-complete), but whose correctness or/and polynomial-time complexity on the class is impossible to prove as an example for G\"odel's theorems. However, supposedly such algorithms should possess a certain complicatedness of processing the input data and treat it in a certain algebraically "entangled" manner. The same algorithmic analysis in fact concerns all the other significant problems and subclasses of NP, such as the graph isomorphism problem and its associated complexity class GI.

The following short article offers a couple of algebraically entangled polynomial-time algorithms for the graph isomorphism and clique problems whose correctness is yet to be determined either empirically or through attempting to find proofs.

at May 21, 2019 11:36 PM UTC

Authors: Sebastian Berndt, Valentin Dreismann, Kilian Grage, Klaus Jansen, Ingmar Knof
Download: PDF
Abstract: Online algorithms that allow a small amount of migration or recourse have been intensively studied in the last years. They are essential in the design of competitive algorithms for dynamic problems, where objects can also depart from the instance. In this work, we give a general framework to obtain so called robust online algorithms for these dynamic problems: these online algorithm achieve an asymptotic competitive ratio of $\gamma+\epsilon$ with migration $O(1/\epsilon)$, where $\gamma$ is the best known offline asymptotic approximation ratio. In order to use our framework, one only needs to construct a suitable online algorithm for the static online case, where items never depart. To show the usefulness of our approach, we improve upon the best known robust algorithms for the dynamic versions of generalizations of Strip Packing and Bin Packing, including the first robust algorithm for general $d$-dimensional Bin Packing and Vector Packing.

at May 21, 2019 11:37 PM UTC

Authors: Lijie Chen, Ofer Grossman
Download: PDF
Abstract: We develop techniques to prove lower bounds for the BCAST(log n) Broadcast Congested Clique model (a distributed message passing model where in each round, each processor can broadcast an O(log n)-sized message to all other processors). Our techniques are built to prove bounds for natural input distributions. So far, all lower bounds for problems in the model relied on constructing specifically tailored graph families for the specific problem at hand, resulting in lower bounds for artificially constructed inputs, instead of natural input distributions.

One of our results is a lower bound for the directed planted clique problem. In this problem, an input graph is either a random directed graph (each directed edge is included with probability 1/2), or a random graph with a planted clique of size k. That is, k randomly chosen vertices have all of the edges between them included, and all other edges in the graph appear with probability 1/2. The goal is to determine whether a clique exists. We show that when k = n^(1/4 - eps), this problem requires a number of rounds polynomial in n.

Additionally, we construct a pseudo-random generator which fools the Broadcast Congested Clique. This allows us to show that every k round randomized algorithm in which each processor uses up to n random bits can be efficiently transformed into an O(k)-round randomized algorithm in which each processor uses only up to O(k log n) random bits, while maintaining a high success probability. The pseudo-random generator is simple to describe, computationally very cheap, and its seed size is optimal up to constant factors. However, the analysis is quite involved, and is based on the new technique for proving lower bounds in the model.

The technique also allows us to prove the first average case lower bound for the Broadcast Congested Clique, as well as an average-case time hierarchy.

at May 21, 2019 11:22 PM UTC

Authors: Yair Bartal, Nova Fandina, Ofer Neiman
Download: PDF
Abstract: A {\em tree cover} of a metric space $(X,d)$ is a collection of trees, so that every pair $x,y\in X$ has a low distortion path in one of the trees. If it has the stronger property that every point $x\in X$ has a single tree with low distortion paths to all other points, we call this a {\em Ramsey} tree cover. Tree covers and Ramsey tree covers have been studied by \cite{BLMN03,GKR04,CGMZ05,GHR06,MN07}, and have found several important algorithmic applications, e.g. routing and distance oracles. The union of trees in a tree cover also serves as a special type of spanner, that can be decomposed into a few trees with low distortion paths contained in a single tree; Such spanners for Euclidean pointsets were presented by \cite{ADMSS95}.

In this paper we devise efficient algorithms to construct tree covers and Ramsey tree covers for general, planar and doubling metrics. We pay particular attention to the desirable case of distortion close to 1, and study what can be achieved when the number of trees is small. In particular, our work shows a large separation between what can be achieved by tree covers vs. Ramsey tree covers.

at May 21, 2019 11:36 PM UTC

Authors: Soheil Behnezhad, Laxman Dhulipala, Hossein Esfandiari, Jakub Łącki, Warren Schudy, Vahab Mirrokni
Download: PDF
Abstract: We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms using MapReduce and a distributed hash table service.

This extension allows us to give new graph algorithms with much lower round complexities compared to the best known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in $O(1)$ rounds and connectivity/minimum spanning tree in $O(\log\log_{m/n} n)$ rounds both using $O(n^\delta)$ space per machine for constant $\delta < 1$. In the same memory regime for MPC, the best known algorithms for these problems require polylog $n$ rounds. Our results imply that the 2-Cycle conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.

at May 21, 2019 11:23 PM UTC

Authors: Noga Alon, Shiri Chechik, Sarel Cohen
Download: PDF
Abstract: In this work we derandomize two central results in graph algorithms, replacement paths and distance sensitivity oracles (DSOs) matching in both cases the running time of the randomized algorithms.

For the replacement paths problem, let G = (V,E) be a directed unweighted graph with n vertices and m edges and let P be a shortest path from s to t in G. The {\sl replacement paths} problem is to find for every edge e \in P the shortest path from s to t avoiding e. Roditty and Zwick [ICALP 2005] obtained a randomized algorithm with running time of ~O(m \sqrt{n}). Here we provide the first deterministic algorithm for this problem, with the same ~O(m \sqrt{n}) time.

For the problem of distance sensitivity oracles, let G = (V,E) be a directed graph with real-edge weights. An f-Sensitivity Distance Oracle (f-DSO) gets as input the graph G=(V,E) and a parameter f, preprocesses it into a data-structure, such that given a query (s,t,F) with s,t \in V and F \subseteq E \cup V, |F| \le f being a set of at most f edges or vertices (failures), the query algorithm efficiently computes the distance from s to t in the graph G \setminus F ({\sl i.e.}, the distance from s to t in the graph G after removing from it the failing edges and vertices F). For weighted graphs with real edge weights, Weimann and Yuster [FOCS 2010] presented a combinatorial randomized f-DSO with ~O(mn^{4-\alpha}) preprocessing time and subquadratic ~O(n^{2-2(1-\alpha)/f}) query time for every value of 0 < \alpha < 1. We derandomize this result and present a combinatorial deterministic f-DSO with the same asymptotic preprocessing and query time.

at May 21, 2019 11:24 PM UTC

Authors: Michael Dinitz, Yasamin Nazari, Zeyu Zhang
Download: PDF
Abstract: There has been significant recent progress on algorithms for approximating graph spanners, i.e., algorithms which approximate the best spanner for a given input graph. Essentially all of these algorithms use the same basic LP relaxation, so a variety of papers have studied the limitations of this approach and proved integrality gaps for this LP in a variety of settings. We extend these results by showing that even the strongest lift-and-project methods cannot help significantly, by proving polynomial integrality gaps even for $n^{\Omega(\epsilon)}$ levels of the Lasserre hierarchy, for both the directed and undirected spanner problems. We also extend these integrality gaps to related problems, notably Directed Steiner Network and Shallow-Light Steiner Network.

at May 21, 2019 12:00 AM UTC

Authors: Ahmed Shokry
Download: PDF
Abstract: Utilizing graph algorithms is a common activity in computer science. Algorithms that perform computations on large graphs are not always efficient. This work investigates the Single-Source Shortest Path (SSSP) problem, which is considered to be one of the most important and most studied graph problems. This thesis contains a review of the SSSP problem in both theory and practice. In addition, it discusses a new single-source shortest-path algorithm that achieves the same $O(n \cdot m)$ time bound as the traditional Bellman-Ford-Moore algorithm but outperforms it and other state-of-the-art algorithms in practice.

The work is comprised of three parts. The first discusses some basic shortest-path and negative-cycle-detection algorithms in literature from the theoretical and practical point of view. The second contains a discussion of a new algorithm for the single-source shortest-path problem that outperforms most state-of-the-art algorithms for several well-known families of graphs. The main idea behind the proposed algorithm is to select the fewest most-effective vertices to scan. We also propose a discussion of correctness, termination, and the proof of the worst-case time bound of the proposed algorithm. This section also suggests two different implementations for the proposed algorithm, the first runs faster while the second performs a fewer number of operations. Finally, an extensive computational study of the different shortest paths algorithms is conducted. The results are proposed using a new evaluation metric for shortest-path algorithms. A discussion of outcomes, strengths, and weaknesses of the various shortest path algorithms are also included in this work.

at May 21, 2019 12:00 AM UTC

Authors: Osman Asif Malik, Stephen Becker
Download: PDF
Abstract: We present a method for randomizing a formula for bilinear computation of matrix products. We consider the implications of such randomization when the formula itself is approximate, and when the formula is exact but its computation is plagued by numerical error due to finite precision arithmetic. Our theoretical results and numerical experiments indicate that our method can improve performance in both settings for a negligible increase in computational complexity.

at May 21, 2019 11:35 PM UTC

The Department of Computer Science at Royal Holloway, University of London, invites applications for a Lecturer position (a full-time and permanent (tenured) post). We are recruiting for an academic member of staff who can strengthen our research, which falls broadly within Algorithms and Complexity, Artificial Intelligence, Distributed and Global Computing, and Software Language Engineering.


by shacharlovett at May 20, 2019 06:34 PM UTC

I noticed a while back that even on the nth proofread of a document there are still corrections. So I decided to keep track of how many corrections there are in a paper I was working on. I chose a non-technical paper so that errors-in-the-math would not be the issue.  I chose

                                           Guest Column: The Third P =?NP Poll (see here)

that appeared in Lane Hemaspaandra's SIGACT News Complexity Column.

I kept track of the following:

1) Number of corrections. Anything that I changed. Could be style, a new thought, need not be (though could be) an error.

2) Errors. These are things that really need to be corrected, like having `think' instead of  `thing' .

Corrections vs  Errors, an Example:

If I refer to Lane Hemaspaandra as Notorious L.A.N that is a correction and an  error, as he is Notorous L.A.H.

If I refer to Lane Hemaspaandra as Notorious L.A.H and decide to change it to LAH that is a correction that is not an error.

I didn't keep track of serious errors vs typos, but after the first 3 proofreads there were no more serious errors--- sort of- --you'll see. Most serious was a fonts-gone-wild thing where half the paper was in boldface.

Here is a history of the number of corrections

1) Lane proofread the first draft. κ corrections where κ is some cardinal between the cardinality of N and the cardinality of  2N . Its value depends on which model of set theory you are in. (My spellchecker thinks that cardinality is not a word. I checked and I am spelling it correctly but perhaps it's one of those things where I stare at it too much and keep misreading it.)

Henceforth I omit the word proofread as it is understood

2) Bill G:  81 corrections, 29 of which were errors.

3) Clyde: 64 corrections,  of which 17 were errors.

4) Bill G: 40 corrections, of which 21 were errors (I had added a new section causing more errors)

5) Clyde: 30 corrections of which 10 were errors.

6) Bill G: 24 corrections of which 6 were errors.

7) Clyde: 18 corrections of which 8 were errors.

8) David Sekora (A CS grad student at Maryland who at one time wanted to be an English Major): f15 corrections of which 15 were errors. Really! Typos dagnabbit! (Spell check thinks that dagnabbit is spelled wrong. Um---in that case what is the correct spelling?)

9) Nathan Grammel (A CS grad student at Maryland) :6 corrections of which  3 were errors.

10) Bill G, proofreading backwards, a paragraph at a time: 29 corrections of which 5 were errors.

11) Justin Hontz, an ugrad who TAs for me: 10 corrections of which 7 were errors.

12) Karthik Abinav, a grad student in theory at Maryland: 2 corrections both of which were errors. Was this the end or are there still issues?

13) Josh Twitty, an ugrad who TAs for me: 0 corrections. YEAH!

14) Dan Smolyak, an ugrad CS and Eco major:4 corrections, all 4 errors. Error  sounds too strong. For example, one of them was to replace ?. with ?  Yes, its an error, but not that important. It DOES point to his carefulness as a proofreader.

15) Clyde Kruskal :20 corrections, 10 of which were errors. To call them errors seems wrong when he corrects  Group theory' to  Group Theory. None of these corrections were caused by prior comments. I think all of the errors were in the paper early on, undetected until now!

16)  Backwards Bill G again:  28 corrections,  14 of which were errors. Again, the errors were minor. Most of the errors were relatively recent. As an example, if I list out topics in math like:

a) Group Theory, Set Theory, and  Ramsey Theory

then I am supposed to use capital letters, but if I say in prose

Lance Fortnow thinks that the techniques used will be group theory, set theory, and Ramsey theory

then only the R in Ramsey Theory is in caps.  Makes me glad I'm in math.

17) Lane got penultimate proofread. Lane found 75 (yes 75 WOW) of which 66 (yes 66 WOW) were errors. Many of these were spacing and latex things that I would never have noticed (indeed- I didn't notice) and most readers would not have noticed (hmmm- how do I know that?) but only an editor could catch (hmmm- when I've edited the book review column and now the open problems column and I never found more than 10 errors). So when all is said and done: KUDOS to Lane! And My point was that you can never get all the errors out. On that I am correct. I wonder if there are still errors? Yeah, but at most 10. However, I said that BEFORE giving it to Lane.

18) Stephen Fenner, the editor of SIGACT news got FINAL proofread. He found that I spelled his name wrong . How many errors are left? I would bet at most 10. I would bet that I would lose that bet.

Why after multiple proofreadings are there still errors? (My spell check thinks proofreadings is not a word. Maybe my spell check is worried that if people get documents proofread a lot then they won't be needed anymore. This blog post refutes that thought.)

1)  An error can occur from a correction. This caused a massive problem with another paper. Lane's next column will be by me and co-authors on The Muffin Problem. We had all kinds of problems with the colors and sizes--- Massive Magenta Muffins or Miniature Magenta Muffins? Sizes gone wild! Again Kudos to my proofreaders and to Lane for catching this rather important error.

2) If some passage is added late in the process it will surely have errors.

3) An error correction may clear away the brush so you can see other errors.

4) With LaTeX (or Word for some) we have the ability to get things perfect. So there is no cost to keeping on perfecting things. This lead so many corrections that are not errors.

5) I know of an adviser who would first say change A to B, and later change B back to A. (None of that happened with the paper discussed above).

Are errors inevitable? Clyde Kruskal tells me that his father Martin Kruskal, as a teenager, read Courant and Robbins book What is Mathematics and found some errors in it. Martin's mother didn't believe him and marched him over to Courant's house:

MARTIN MOTHER: Martin claims to have found errors in your book.

COURANT:  (laughs) There are errors in every book.

Courant was so impressed that ten (or so) years later Courant became Martin's PhD adviser.

by GASARCH ( at May 20, 2019 02:14 PM UTC


I would like to congratulate my Taiwanese readers for being in the first Asian country to introduce same-sex marriage.

by luca at May 18, 2019 08:41 PM UTC

That is “app” as in an on-line application

[ Leo Stein ]

Leo Stein is an assistant professor in the department of Physics and Astronomy at the University of Mississippi. His research interests include general relativity from an astrophysical standpoint.

Today I want to share an unusual proof of his.

Mathematics and complexity theory are all about proving theorems. Most of the time, so far, we prove the old way: we write out a humanly readable proof. At least we hope the proof is readable. Some of the time, we use a computer to check or even create the proof. Sometimes we do extensive numerical computations, but these are not proofs.

Solving Quadratic Equations

I have known, as I am sure you do, forever that a quadratic equation can be solved in closed form. That is

\displaystyle  x^{2} + bx + c = 0,

has the two solutions

\displaystyle  -b/2 + 1/2\sqrt{b^{2}-4c} \text{ and } -b/2 - 1/2\sqrt{b^{2}-4c}.

I have discussed this before here and its relationship to the World’s Fair in Flushing Meadows.

A natural question is: Are square roots needed in any formula for quadratic equations? The answer is “Yes”.

Theorem 1 There does not exist any continuous function from the space of quadratic polynomials to complex numbers which associates to any quadratic polynomial a root of that polynomial.

Corollary 2 There is no quadratic formula built out of a finite combination of field operations and the functions {\sin, \cos, \exp}, and the coefficients of the polynomial.

The corollary uses the basic fact that {\sin, \cos, \exp} are continuous functions. Note that each has a single branch on complex plane, whereas radicals and the logarithm function do not. So how do we prove the theorem?

An App Based Proof

Here is a novel, I think, proof that uses an app. Stein has written the app and it is here. He explains how to use it. I strongly suggest that you try this yourself.

To get a feel for all this, drag the {a_{0}} coefficient to {-1} and the {a_{1}} coefficient to {1/2}. You should have two real roots in root space (one at {\approx -1.28}, the other at {\approx 0.78}). Let’s call {r_{1}} the negative root, and {r_{2}} the positive root. Now move the coefficient {a_{0}} around in a small loop (i.e. move it around a little bit, and then return it to {-1} where it started). Note that the roots move continuously, and then return to their original positions. Next, move {a_{0}} in a big loop (big enough that it orbits around {r_{2}}). Something funny happens: the roots {r_{1}} and {r_{2}} switch places.

Leo Goldmakher says here:

Pause and think about this for a second. This is really, really weird.

Here is one immediate consequence of this observation:

Theorem 3 There does not exist any continuous function from the space of quadratic polynomials to complex numbers which associates to any quadratic polynomial a root of that polynomial.

And so the corollary follows.

A Standard Proof

Goldmakher writes out a more conventional proof in his paper titled Arnold’s Elementary Proof Of The Insolvability Of The Quintic. He also shows the following theorem:

Theorem 4 Fix a positive integer {N}. Any quintic formula built out of the field operations, continuous functions, and radicals must have nesting of level more than {N}.

This says that there can be no fixed formula for fifth degree, quintic, polynomials. Of course, this follows from Galois theory, but his proof uses just calculus. The Arnold is Vladimir Arnold.

Open Problems

Do you know other cases of an app with animation conveying the essence of a mathematical proof? This means more than “proofs in pictures” or “proofs without words”—the animation and interactivity are crucial.

by rjlipton at May 18, 2019 08:03 PM UTC

We study the complexity of computing symmetric and threshold functions by constant-depth circuits with Parity gates, also known as AC$^0[\oplus]$ circuits. Razborov (1987) and Smolensky (1987, 1993) showed that Majority requires depth-$d$ AC$^0[\oplus]$ circuits of size $2^{\Omega(n^{1/2(d-1)})}$. By using a divide-and-conquer approach, it is easy to show that Majority can be computed with depth-$d$ AC$^0[\oplus]$ circuits of size $2^{\widetilde{O}(n^{1/(d-1)})}$. This gap between upper and lower bounds has stood for nearly three decades. Somewhat surprisingly, we show that neither the upper bound nor the lower bound above is tight for large $d$. We show for $d \geq 5$ that any symmetric function can be computed with depth-$d$ AC$^0[\oplus]$ circuits of size $\exp({\widetilde{O} (n^{\frac{2}{3} \cdot \frac{1}{(d - 4)}} )})$. Our upper bound extends to threshold functions (with a constant additive loss in the denominator of the double exponent). We improve the Razborov-Smolensky lower bound to show that for $d \geq 3$ Majority requires depth-$d$ AC$^0[\oplus]$ circuits of size $2^{\Omega(n^{1/(2d-4)})}$. For depths $d \leq 4$, we are able to refine our techniques to get almost-optimal bounds: the depth-$3$ AC$^0[\oplus]$ circuit size of Majority is $2^{\widetilde{\Theta}(n^{1/2})}$, while its depth-$4$ AC$^0[\oplus]$ circuit size is $2^{\widetilde{\Theta}(n^{1/4})}$.

at May 17, 2019 09:48 PM UTC

Consider the multiparty communication complexity model where there are n processors, each receiving as input a row of an n by n matrix M with entries in {0, 1}, and in each round each party can broadcast a single bit to all other parties (this is known as the BCAST(1) model). There are many lower bounds known for the number of rounds necessary for certain problems in this model, but they are all worst case lower bounds which apply only for very specifically constructed input distributions. We develop a framework for showing lower bounds in this setting for more natural input distributions, and apply the framework to show: A lower bound for finding planted cliques in random inputs (i.e., each entry of the matrix is random, except there is a random subset a_1,..., a_k in [n] where M_{a_i,a_j} = 1 for all i and j). Specifically, we show that if k = n^(1/4 - eps), this problem requires a number of rounds polynomial in n. A pseudo-random generator which fools the BCAST(1) model. That is, we show a distribution which is efficiently samplable using few random bits, and which is indistinguishable from uniform by a low-round BCAST(1) protocol. This allows us to show that every t = Omega(log n) round randomized algorithm in which each processor uses up to n random bits can be efficiently transformed into an O(t)-round randomized algorithm in which each processor uses only up to O(t) random bits, while maintaining a high success probability. As a corollary of the pseudo-random generator, we also prove the first average case lower bound for the model (specifically, for the problem of determining whether the input matrix is full rank), as well as an average-case time hierarchy.

at May 17, 2019 07:38 PM UTC

We now discuss how to view proofs of certain regularity lemmas as applications of the FTRL methodology.

The Szemeredi Regularity Lemma states (in modern language) that every dense graph is well approximate by a graph with a very simple structure, made of the (edge-disjoint) union of a constant number of weighted complete bipartite subgraphs. The notion of approximation is a bit complicated to describe, but it enables the proof of counting lemmas, which show that, for example, the number of triangles in the original graph is well approximated by the (appropriately weighted) number of triangles in the approximating graph.

Analogous regularity lemmas, in which an arbitrary object is approximated by a low-complexity object, have been proved for hypergraphs, for subsets of abelian groups (for applications to additive combinatorics), in an analytic setting (for applications to graph limits) and so on.

The weak regularity lemma of Frieze and Kannan provides, as the name suggests, a weaker kind of approximation than the one promised by Szemeredi’s lemma, but one that is achievable with a graph that has a much smaller number of pieces. If {\epsilon} is the “approximation error” that one is willing to tolerate, Szemeredi’s lemma constructs a graph that is the union of a {2^{2^{\vdots}}} weighted complete bipartite subgraphs where the height of the tower of exponentials is polynomial in {1/\epsilon}. In the Frieze-Kannan construction, that number is cut down to a single exponential {2^{O(1/\epsilon^2)}}. This result too can be generalized to graph limits, subsets of groups, and so on.

With Tulsiani and Vadhan, we proved an abstract version of the Frieze-Kannan lemma (which can be applied to graphs, functions, distributions, etc.) in which the “complexity” of the approximation is {O(1/\epsilon^2)}. In the graph case, the approximating graph is still the union of {2^{O(1/\epsilon^2)}} complete bipartite subgraphs, but it has a more compact representation. One consequence of this result is that for every high-min-entropy distribution {\cal D}, there is an efficiently samplable distribution with the same min-entropy as {\cal D}, that is indistinguishable from {\cal D}. Such a result could be taken to be a proof that what GANs attempt to achieve is possible in principle, except that our result requires an unrealistically high entropy (and we achieve “efficient samplability” and “indistinguishability” only in a weak sense).

All these results are proved with a similar strategy: one starts from a trivial approximator, for example the empty graph, and then repeats the following iteration: if the current approximator achieves the required approximation, then we are done; otherwise take a counterexample, and modify the approximator using the counterexample. Then one shows that:

  • The number of iterations is bounded, by keeping track of an appropriate potential function;
  • The “complexity” of the approximator does not increase too much from iteration to iteration.

Typically, the number of iterations is {O(1/\epsilon^2)}, and the difference between the various results is given by whether at each iteration the “complexity” increases exponentially, or by a multiplicative factor, or by an additive term.

Like in the post on pseudorandom constructions, one can view such constructions as an online game between a “builder” and an “inspector,” except that now the online optimization algorithm will play the role of the builder, and the inspector is the one acting as an adversary. The {O(1/\epsilon^2)} bound on the number of rounds comes from the fact that the online optimization algorithms that we have seen so far achieve amortized error per round {O(1/\sqrt T)} after {T} rounds, so it takes {O(1/\epsilon^2)} rounds for the error bound to go below {\epsilon}.

We will see that the abstract weak regularity lemma of my paper with Tulsiani and Vadhan (and hence the graph weak regularity lemma of Frieze and Kannan) can be immediately deduced from the theory developed in the previous post.

When I was preparing these notes, I was asked by several people if the same can be done for Szemeredi’s lemma. I don’t see a natural way of doing that. For such results, one should maybe use the online optimization techniques as a guide rather than as a black box. In general, iterative arguments (in which one constructs an object through a series of improvements) require the choice of a potential function, and an argument about how much the potential function changes at every step. The power of the FTRL method is that it creates the potential function and a big part of the analysis automatically and, even where it does not work directly, it can serve as an inspiration.

One could imagine a counterfactual history in which people first proved the weak regularity lemma using online optimization out of the box, as we do in this post, and then decided to try and use an L2 potential function and an iterative method to get the Szemeredi lemma, subsequently trying to see what happens if the potential function is entropy, thus discovering Jacob Fox’s major improvement on the “triangle removal lemma,” which involves the construction of an approximator that just approximates the number of triangles.

1. A “vanilla” weak regularity lemma

Frieze and Kannan proved the following basic result about graph approximations, which has a number of algorithmic applications. If {V} is a set of vertices which is understood from the context, and {A,B \subseteq V} are disjoint subsets of vertices, then let {K_{A,B} = {\bf 1}_A \cdot {\bf 1}_B^T}, that is, the boolean matrix such that {K_{A,B} (i,j) = 1} iff {i\in A} and {j\in B}.

The cut norm of a matrix {M} is

\displaystyle  || M ||_{\square} := \max_{A,B} | \langle M, K_{A,B} \rangle |

In the following we will identify a graph with its adjacency matrix.

Theorem 1 Let {G=(V,E)} be an graph on {n} vertices and {\epsilon >0} be an approximation parameter.

Then there are sets {A_1,B_1,\ldots,A_T,B_T} and scalars {\alpha_1,\ldots,\alpha_T}, where {T \leq O(1/\epsilon^2)}, such that if we define

\displaystyle  H:= \sum_{i=1}^T \alpha_i K_{A_i,B_i}

we have

\displaystyle  || G - H ||_{\square} \leq \epsilon n^2

We will prove the following more general version.

Theorem 2 Let {X} be a set, {g: X \rightarrow [0,1]} be a bounded function, {{\cal F}} be a family of functions mapping {X} to {[0,1]} and {\epsilon} be an approximation parameter. Then there are functions {f_1,\ldots,f_T} in {{\cal F}} and scalars {\alpha_1,\ldots,\alpha_T}, with {T = O(1/\epsilon^2)}, such that if we define

\displaystyle  h := \sum_{i=1}^T \alpha_i f_i

we have

\displaystyle  \forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X|

We could also, with the same proof, argue about a possibly infinite set {X} with a measure {\mu} such that {\mu(X)} is finite, and, after defining the inner product

\displaystyle  \langle f, g \rangle := \int_X f\cdot g\ d \mu \ ,

we could prove the same conclusion of the theorem, with {\epsilon \cdot \mu(X)} instead of {\epsilon |X|} as an error bound.

Here is the proof: run the FTRL algorithm with L2-squared regularizer in the setup in which the space of solutions {K} is the set of all functions { X \rightarrow {\mathbb R}} and the loss functions are linear. Every time the algorithm proposes a solution {h_t}, if there is a function {f_t \in {\cal F}} such that either { \langle h_t - g , f_t \rangle > \epsilon|X|} or { \langle h_t - g , f_t \rangle < - \epsilon|X|}, the adversary will pick, respectively, {f_t} or {-f_t} as a loss function {\ell_t}. When the adversary has no such choice, we stop and the function {h_t} is our desired approximation.

First of all, let us analyze the number of rounds. Here the maximum norm of the functions in {{\cal F}} is {\sqrt {|X|}}, so after {T} rounds we have the regret bound

\displaystyle  \forall h : \ \ \sum_{t=1}^T \langle \ell_t, h_t - h \rangle \leq \sqrt{|X|} \cdot || h || \cdot \sqrt{2T}

Now let us consider {g} to be our offline solution: we have

\displaystyle  \epsilon T |X| < \sum_{t=1}^T \langle \ell_t, h_t - g \rangle \leq |X| \cdot \sqrt{2T}

which implies

\displaystyle  T < \frac 2{\epsilon^2}

Finally, recall that

\displaystyle  h_T = \sum_{t=1}^{T-1} - \frac 1 {2c} \ell_t

where {c} is the scaling constant in the definition of the regularizer ({1/2c} is of order of {\epsilon} when {T} is order of {1/\epsilon^2}), and so our final approximator {h_T} computed at the last round is a weighted sum of functions from {{\cal F}}.

2. The weak regularity lemma

Frieze and Kannan’s weak regularity lemma has the following form.

Theorem 3 Let {G=(V,E)} be an graph on {n} vertices and {\epsilon >0} be an approximation parameter.

Then there is a partition of {V} into {k = 2^{O(1/\epsilon^2)}} sets {S_1,\ldots,S_k}, and there are bounded weights {0\leq p_{i,j} \leq 1} for {i,j \in \{1,\ldots, k\}} such that if we defined the weighted graph {H} where the weight of the edge {(u,v)} in {H} is {p_{i,j}}, where {u\in S_i} and {v\in S_j}, then we have

\displaystyle  || G - H ||_{\square} \leq \epsilon n^2

Notice that if we did not require the weights to be between 0 and 1 then the result of the previous section can also be cast in the above language, because we can take the partition {S_1,\ldots,S_k} to be the “Sigma-algebra generated by” the sets {A_1,B_1,\ldots,A_T,B_T}.

For a scalar {z\in {\mathbb R}}, let {\tau(z)} be defined as

\displaystyle  \tau(z) = \left\{ \begin{array}{rl} 0 & \mbox{ if } z <0\\ z & \mbox{ if } 0\leq z \leq 1\\ 1 & \mbox{ if } z > 1 \end{array} \right.

where {\tau} stands for truncation. Note that {\tau(z)} is the L2 projection of {z} on {[0,1]}.

Theorem 3 is a special case of the following result, proved in our paper with Tulsiani and Vadhan.

Theorem 4 Let {X} be a set, {g: X \rightarrow [0,1]} be a bounded function, {{\cal F}} be a family of functions mapping {X} to {[0,1]} and {\epsilon} be an approximation parameter. Then there are functions {f_1,\ldots,f_T} in {{\cal F}} and scalars {\alpha_1,\ldots,\alpha_T}, with {T = O(1/\epsilon^2)}, such that if we define

\displaystyle  h := \tau\left ( \sum_{i=1}^T \alpha_i f_i \right)

we have

\displaystyle  \forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X|

To prove Theorem 4 we play the same online game as in the previous section: the online algorithm proposes a solution {h_t}; if {\forall f\in {\cal F}: \ \ | \langle f, g- h \rangle | \leq \epsilon |X| } then we stop and output {h=h_t}, otherwise we let the loss function be a function {\ell_t} such that either {\ell_t} or {-\ell_t} is in {{\cal F}} and

\displaystyle  \langle \ell_t , g- h_t \rangle \geq \epsilon |X|

The only difference is that we use the FTRL algorithm with L2 regularizer that has the set feasible solutions {K} defined to be the set of all functions {h : X \rightarrow [0,1]} rather than the set of all functions {h: X \rightarrow {\mathbb R}}. Then each function {h_T} is the projection to {K} of {\sum_{t=1}^{T-1} - \frac 1 {2c}\ell_t}, and the projection to {K} is just composition with {\tau}. The bound on the number of steps is the same as the one in the previous section.

Looking at the case in which {X} is the set of edges of a clique on {V}, {{\cal F}} is the set of graphs of the form {K_{A,B}}, and considering the Sigma-algebra generated by {A_1,B_1,\ldots,A_T,B_T} gives Theorem 3 from Theorem 4.

3. Sampling High-Entropy Distributions

Finally we discuss the application to sampling high-entropy distributions.

Suppose that {D} is a distribution over {\{ 0,1 \}^n} of min-entropy {\geq n-d}, meaning that for every {x\in \{ 0,1 \}^n} we have

\displaystyle  D(x) \leq 2^{-(n-d)}

where we think of the entropy deficiency {d} as being small, such as a constant or {O(\log n)}

Let {{\cal F}} be a class of functions {\{ 0,1 \}^n \rightarrow \{ 0,1\}} that we think of as being “efficient.” For example, {{\cal F}} could be the set of all functions computable by circuits of size {\leq S(n)} for some size bound {S(\cdot)}, such as, for example {S(n) = 10 n^2}. We will assume that {f(x) \equiv 1} is in {{\cal F}}. Define

\displaystyle  g(x) = 2^{n-d} \cdot D(x)

to be a bounded function {g: \{ 0,1 \}^n \rightarrow [0,1]}. Fix an approximation parameter {\epsilon >0}.

Then from Theorem 4 we have that there are {T = O(1/\epsilon^2)} functions {f_1,\ldots,f_T}, and scalars {\alpha_1,\ldots,\alpha_T}, all equal to {\pm 1/2c} for a certain parameter {c}, such that if we define

\displaystyle  h(x) = \tau \left( \sum_{t=1}^T \alpha_t f_t(x) \right)

we have

\displaystyle   \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( g(x) f(x) - h(x) f(x) \right ) \right | \leq \epsilon 2^n \ \ \ \ \ (1)

and so, multiplying by {2^{-(n-d)}}

\displaystyle   \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - h(x)2^{-(n-d)} f(x) \right ) \right | \leq \epsilon 2^d \ \ \ \ \ (2)

Now define the probability distribution

\displaystyle  H(x) = \frac {h(x)}{\sum_{y\in \{ 0,1 \}^n} h(y) }

Applying (1) to the case {f(x) \equiv 1}, we have

\displaystyle  \left | \sum_{x\in \{ 0,1 \}^n} \left( g(x) - h(x) \right ) \right | \leq \epsilon 2^n

and we know that {\sum_x g(x) = 2^{n-d} \sum_x D(x) = 2^{n-d}}, so

\displaystyle  \left | 2^{n-d} - \sum_{x\in \{ 0,1 \}^n} h(x) \right | \leq \epsilon 2^n

and we can rewrite (2) as

\displaystyle  \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - H(x) \cdot (\sum_y h(y)) 2^{-(n-d)} f(x) \right ) \right | \leq \epsilon 2^d

and, finally

\displaystyle  \forall f \in {\cal F} : \ \ \ \left | \sum_{x\in \{ 0,1 \}^n} \left( D(x) f(x) - H(x) f(x) \right ) \right | \leq 2 \epsilon 2^d

that is

\displaystyle  \forall f \in {\cal F} : \ \ \ \left | \Pr_{x\sim D} [f(x) = 1] - \Pr_{x\sim H} [f(x) =1 ] \right | \leq 2 \epsilon 2^d

which says that {H} and {D} are {\epsilon 2^{d+1}}-indistinguishable by functions in {{\cal F}}. If we chose {{\cal F}}, for example, to be the class of functions computable by circuits of size {\leq S(n)}, then {H} and {D} are {\epsilon 2^{d+1}}-indistinguishable by circuits of size {\leq S(n)}.

But {H} is also samplable in a relatively efficient way using rejection sampling: pick a random {x}, then output {x} with probability {h(x)} and fail with probability {1-h(x)}. Repeat the above until the procedure does not fail. At each step, the probability of success is {\mathop{\mathbb E}_{x\in \{ 0,1 \}^n} h(x) \geq 2{-d} - \epsilon}, so, assuming (because otherwise all of the above makes no sense) that, say, {\epsilon < 2^{-d-1}}, the procedure succeeds on average in at most {2^{d+1}} attempts. And if each {f_t} is computable by a circuit of size {S(n)}, then {h} is computable by a circuit of size {O(1/\epsilon^2) + S(n)}.

The undesirable features of this result are that the complexity of sampling and the quality of indistinguishability depend exponentially on the randomness deficiency, and the sampling circuit is a non-uniform circuit that it’s not clear how to construct without advice. Impagliazzo’s recent results address both these issues.

by luca at May 16, 2019 07:37 PM UTC

My daughter showed up at her doctor's office the other day and found it closed. She complained that she had received all these alerts, texts, emails, voicemails, reminding her of the appointment and then they weren't there when she showed up. She had stopped reading the alerts, the last of which said the appointment need to be rescheduled.

We all get too many alerts. I just assume many of the emails I send to faculty never get read or remembered. I get many requests to mention conferences, workshop and other theory happenings on this blog because nobody knows how to get the word out through the clutter. If we followed through on all these requests, this blog would become clutter itself.

Back in 2009, Nikhil Devanur and I wrote a paper trying to capture this phenomenon using complexity. Building on Levin's notion of universal search, the unawareness of a string x in an environment E is the amount of time needed to generate x from a context c given an oracle for E. Levin showed that one can have a universal algorithm, only a constant time slower to generate x than any other algorithm. One should think of E as the sum total of all our knowledge including search engines on the Internet. Technically we should have used the term "attention" instead of "awareness".

One example is using a calendar as part of your environment, E, that reminds you of an event, x, on that date, c. We use calendars, contact lists, emails searches and many other ways to keep strings we need to remember. Advertisers try to alter your E to get the unawareness of x down.

One of these papers that didn't go far because we didn't have great applications for the definition. But it follows a good philosophy, when life seems complex, model it.

by Lance Fortnow ( at May 16, 2019 12:46 PM UTC

by David Eppstein at May 15, 2019 10:00 PM UTC

We invite applications for a postdoc position hosted by Hsin-Hao Su at Boston College. Areas of specific interests include but not limited to distributed graph algorithms, local algorithms, dynamic graph algorithms, gossip algorithms, and massive parallel computation algorithms. The starting date is flexible between Fall 2019 and Spring 2020. The position is for a period of up to two years.


by shacharlovett at May 15, 2019 07:23 PM UTC

Ever since I “upgraded” this website to use SSL, it’s become completely inaccessible once every three months, because the SSL certificate expires. Several years in, I’ve been unable to find any way to prevent this from happening, and Bluehost technical support was unable to suggest any solution. The fundamental problem is that, as long as the site remains up, the Bluehost control panel tells me that there’s nothing to do, since there is a current certificate. Meanwhile, though, I start getting menacing emails saying that my SSL certificate is about to expire and “you must take action to secure the site”—never, of course, specifying what action to take. The only thing to do seems to be to wait for the whole site to go down, then frantically take random certificate-related actions until somehow the site goes back up. Those actions vary each time and are not repeatable.

Does anyone know a simple solution to this ridiculous problem?

(The deeper problem, of course, is that a PhD in theoretical computer science left me utterly unqualified for the job of webmaster. And webmasters, as it turns out, need to do a lot just to prevent anything from changing. And since childhood, I’ve been accustomed to countless tasks that are trivial for most people being difficult for me—-if that ever stopped being the case, I’d no longer feel like myself.)

by Scott at May 15, 2019 02:56 AM UTC

A line of recent works showed that for a large class of learning problems, any learning algorithm requires either super-linear memory size or a super-polynomial number of samples [Raz16,KRT17,Raz17,MM18,BOGY18,GRT18]. For example, any algorithm for learning parities of size $n$ requires either a memory of size $\Omega(n^{2})$ or an exponential number of samples [Raz16]. All these works modeled the learner as a one-pass branching program, allowing only one pass over the stream of samples. In this work, we prove the first memory-samples lower bounds (with a super-linear lower bound on the memory size and super-polynomial lower bound on the number of samples) when the learner is allowed two passes over the stream of samples. For example, we prove that any two-pass algorithm for learning parities of size $n$ requires either a memory of size $\Omega(n^{1.5})$ or at least $2^{\Omega(\sqrt{n})}$ samples. More generally, a matrix $M: A \times X \rightarrow \{-1,1\}$ corresponds to the following learning problem: An unknown element $x \in X$ is chosen uniformly at random. A learner tries to learn $x$ from a stream of samples, $(a_1, b_1), (a_2, b_2) \ldots$, where for every $i$, $a_i \in A$ is chosen uniformly at random and $b_i = M(a_i,x)$. Assume that $k,l, r$ are such that any submatrix of $M$ of at least $2^{-k} \cdot |A|$ rows and at least $2^{-l} \cdot |X|$ columns, has a bias of at most $2^{-r}$. We show that any two-pass learning algorithm for the learning problem corresponding to $M$ requires either a memory of size at least $\Omega\left(k \cdot \min\{k,\sqrt{l}\} \right)$, or at least $2^{\Omega(\min\{k,\sqrt{l},r\})}$ samples.

at May 14, 2019 04:35 PM UTC

An inappropriate comment on the ABC conjecture

Joseph Oesterlé and David Masser are famous for their independent discovery of the ABC conjecture.

Today I want to point out an unfair comment about their discovery.

Anonymity on the Internet was captured by a famous 1993 cartoon in the New Yorker magazine titled, “On the Internet, nobody knows you’re a dog.” Amazing to think that was more than a quarter-century ago and remains true. But people can tell if what you’ve written is something inappropriate.

The Comment

The comment is:

SAYS WHO??? I have some trouble with this item.

Masser is a Fellow of the Royal Society, who was elected in 2005. He is

{\dots} also responsible, following an earlier insight of Joseph Oesterlé, for formulating the abc conjecture; this is a simple statement about integers which seems to hold the key to much of the future direction of number theory.

See this link for his full citation and the comment. Click on the show more bibliography button there. The comment is apparently anonymous, although the author is probably known to some. I thank Joël Ouaknine for pointing out this strange comment.

Update: Ken speculates that it’s a misplaced comment by an editor of the Royal Society website itself. Perhaps they compose HTML from MS Word or Acrobat or other software that provides comment bubbles—but this one escaped the bubble and wasn’t noticed. Editors of Wikipedia have automatic tools for flagging assertions that are unsupported or at least need citation.

What the comment undoubtedly shows is vigorous debate behind the walls of Britain’s august institution. So let’s say a little more on what the comment is about.

The ABC Conjecture

The biggest mysteries about numbers often concern the interaction between addition and multiplication. For example:

  • The twin prime conjecture: There are an infinite number of primes {p} such that {p+2} is also prime. This is due to Alphonse de Polignac.

  • The Goldbach conjecture: Every even number greater than four is the sum of two odd primes. This is due to Christian Goldbach.

  • The Brocard conjecture: There are only a finite number of solutions to {n! = m^{2} + 1} in natural numbers. This is due to Henri Brocard.

Suppose that {A + B = C} where {A,B,C} are positive and co-prime natural numbers. Let {D} be the product of all the distinct prime divisors of {ABC}. Then the ABC conjecture says that

\displaystyle  C \le O(D^{2}).

Note, this inequality does indeed connect adding with multiplying. The usual conjecture is stronger, see this for details.

The ABC conjecture appears to be open, even though Shinichi Mochizuki has claimed a proof for years. See this for a discussion about the status of the conjecture.

Despite multiple conferences dedicated to explicating Mochizuki’s proof, number theorists have struggled to come to grips with its underlying ideas. His series of papers, which total more than 500 pages, are written in an impenetrable style, and refer back to a further 500 pages or so of previous work by Mochizuki, creating what one mathematician, Brian Conrad of Stanford University, has called “a sense of infinite regress.”

The Comment II

The comment on Masser’s work is wrong, strange, inappropriate. Oesterlé and Masser deserve more credit, not less, for their brilliant discovery of the ABC conjecture. There are now many—perhaps hundreds—of applications of the ABC conjecture. For example consider generalizations of Fermat’s Last Theorem. Suppose that

\displaystyle  x^{p} + y^{q} = z^{r} \text{  (*)}

where {p,q,r} are odd primes. And r > 6. Provided {x,y,z} are positive and co-prime, it follows by the ABC conjecture that {z^{r}} is bounded by {O(z^{2})}. This is impossible for {z} large enough since {r \ge 3}. Therefore, (*) can only have a finite number of solutions. Pretty neat.

Open Problems

Do you know of any other inappropriate comments of this kind?

[added remark by Ken, linked rather than embed dog cartoon]
[Added prime r must be 7 or larger. Thanks to comment by MadHatter.]

by rjlipton at May 14, 2019 02:49 PM UTC

Non-signaling strategies are a generalization of quantum strategies that have been studied in physics for decades, and have recently found applications in theoretical computer science. These applications motivate the study of local-to-global phenomena for non-signaling functions. We present general results about the local testability of linear codes in the non-signaling setting. Our contributions include formulating natural definitions that capture the condition that a non-signaling function "belongs" to a given code, and characterizing the sets of local constraints that imply membership in the code. We prove these results by relating the Fourier spectrum of non-signaling functions to Cayley hypergraphs induced by local constraints. We apply the above results to show a separation between locally testable codes in the classical and non-signaling setting by proving that bivariate low-degree testing fails spectacularly in the non-signaling setting. Specifically, we show that there exist non-signaling functions that pass bivariate low-degree tests with probability 1, and yet are maximally far from low-degree.

at May 14, 2019 05:26 AM UTC

Graham's number (see here) was at one time the largest number to appear in a math proof.

a) GN was an upper bound on a problem in Ramsey theory. There are now better upper bounds, see here. These better upper bounds are still large- Hales-Jewitt-Large, but that's already much smaller than the original GN.

b) Other proofs now have numbers even larger than GN. For example Harvey Friedman's work on the finite versions of Kruskal's Tree Theorem. (There may be other cases- if you know some then let me know in the comments.)

Since my dept recently moved buildings I found old papers that I had not looked at in years. One of them was

Old and New Problems and Results in Combinatorial Number Theory

by Erdos and Graham

(see here)

So I began reading it and came across a result of Graham from 1964 that used large numbers. No where near as large as GN, but I found it interesting that Graham was involved with large numbers way back then.

Here is the problem:

A Lucas Sequence is a sequence that obeys

a(n) = a(n-1) + a(n-2).

Clearly such a sequence is determined by a(0) and a(1).

QUESTION: Does there exists a(0) and a(1)  that are rel prime such that the sequence has only composite numbers?

By ingenuity and some computing power Graham found YES. For how the got the numbers see here. The numbers are of course in the paper, and how they got them is interesting, but I present them anyway. Hope I don't make a typo:

a(0) = 1786772701928802632268715130455793

a(1) = 1059683225053915111058164141686995

The paper Old and New... says its open if there is a smaller pair of numbers, I do not know if it is still open. If you know, let us all  know in the comments!

These numbers seem small today since we have modern computers that can store and manipulate them easily. Were the considered large numbers in 1964? They were never called Graham Numbers which is probably just as well since that honor lay ahead.

by GASARCH ( at May 13, 2019 04:39 AM UTC

This post also appears on the ICE-TCS blog.

On Friday, 3 May, ICE-TCS hosted its 15th annual Theory Day. The event consisted of two 45-minute presentations by Ravi Boppana (Department of Mathematics, MIT) and Exequiel Rivas(Inria Paris - Rocquencourt, France), and three ten-minute presentations by ICE-TCS researchers highlighting some of the recent research directions pursued by members of the centre.

Ravi Boppana kicked off the Theory Day with a wonderfully paced talk on his work with Ron Holzman on Tomaszewski’s problem on randomly signed sums. The problem is as follows. Let v1, v2, ..., vn be real numbers whose squares add up to 1.  Consider the 2n signed sums of the form S=±vi.  Can there be more signed sums whose value is greater than 1 then those whose value  is at most 1? Holzman and Kleitman (1992) proved that at least 3/8 of these sums satisfy |S|1.  In his talk, Ravi showed us the main ideas Holzman and he used to improve the bound to 13/32.

Computational effects model the interaction of computer programs with their environment. In his talk, Exequiel Rivastaught us how monadscan be used to capture computational effects (a research programme that started with Moggi's award-winning work), and then, discussed some attempts to incorporate merging operations in the monadic picture.

Two of the short talks were given by Henning A. Úlfarsson and Elli Anastasiadi. Henning described the work of his group on  a tool, called the CombSpecSearcher, that automates the methods used by combinatorialists to prove some of their theorems, The tool is able to prove results featured in dozens of research papers. Watch this space for updates on its development and for its successes!

Elli Anastasiadi, a PhD student who is already playing an important role for the centre, gave a clear seven-minute introduction to fine-grained complexity and to the notion of fine-grained reduction.

The 2019 Theory Day was well attended, at least by the standards of a TCS event in Iceland. If all goes well, we'll be back next year.

by Luca Aceto ( at May 12, 2019 10:37 PM UTC

We prove that there is a constant $K$ such that \emph{Tseitin} formulas for an undirected graph $G$ requires proofs of size $2^{\mathrm{tw}(G)^{\Omega(1/d)}}$ in depth-$d$ Frege systems for $d<\frac{K \log n}{\log \log n}$, where $\tw(G)$ is the treewidth of $G$. This extends H{\aa}stad recent lower bound for the grid graph to any graph. Furthermore, we prove tightness of our bound up to a multiplicative constant in the top exponent. Namely, we show that if a Tseitin formula for a graph $G$ has size $s$, then for all large enough $d$, it has a depth-$d$ Frege proof of size $2^{\mathrm{tw}(G)^{\O(1/d)}} \mathrm{poly}(s)$. Through this result we settle the question posed by M. Alekhnovich and A. Razborov of showing that the class of Tseitin formulas is quasi-automatizable for resolution.

at May 12, 2019 09:21 AM UTC