at August 09, 2020 04:45 PM UTC
Theory of Computing Blog Aggregator
 Good news, everyone! Following years of requests, this blog finally supports rich HTML and basic TeX in comments. Also, the German spam that used to plague the blog (when JavaScript was disabled) is gone. For all this, I owe deep gratitude to a reader and volunteer named Filip Dimitrovski.
 Filip refused to accept any payment for fixing this blog. Instead, he asked only one favor: namely, that I use my platform to raise public awareness about the plight of the MAOI antidepressant Nardil. Filip tells me that, while tens of thousands of people desperately need Nardil—no other antidepressant ever worked for them—it’s become increasingly unavailable because the pharma companies can no longer make money on it. He points me to a SlateStarCodex post from 2015 that explains the problem in more detail (anyone else miss SlateStarCodex?). More recent links about the worsening crisis here, here, and here.
 Here’s a fantastic interview of Bill Gates by Steven Levy, about the coronavirus debacle in the US. Gates, who’s always been notoriously and strategically nonpartisan, is more explicit than I’ve ever seen him before in explaining how the Trump administration’s worldhistoric incompetence led to where we are.
 Speaking of which, here’s another excellent article, this one in The American Interest, about the results of “wargames” trying to simulate what happens in the extremely likely event that Trump contests a loss of the November election. Notably, the article sets out six steps that could be taken over the next few months to decrease the chance of a crisis next to which all the previous crises of 2020 will pale.
 A reader asked me to share a link to an algorithm competition, related to cryptographic “proofs of time,” that ends on August 31. Apparently, my having shared a link to a predecessor of this competition—at the request of friendoftheblog Bram Cohen—played a big role in attracting good entries.
 Huge congratulations to my former PhD student Shalev BenDavid, as well as Eric Blais, for cowinning the FOCS’2020 Best Paper Award—along with two other papers—for highly unconventional work about a new minimax theorem for randomized algorithms. (BenDavid and Blais also have a second FOCS paper, which applies their award paper to get the first tight composition theorem for randomized query complexity. Here’s the full list of FOCS papers—lots of great stuff, for a conference that of course won’t physically convene!) Anyway, a central idea in BenDavid and Blais’s new work is to use proper scoring rules to measure the performance of randomized algorithms—algorithms that now make statements like “I’m 90% sure that this is a yesinput,” rather than just outputting a 1bit guess. Notably, Shalev tells me that he learned about proper scoring rules by reading rationalist blogs. So next time you lament your untold hours sacrificed to that pastime, remind yourself of where it once led!
 What have I been up to lately? Besides Busy Beaver, hanging out with my kids, and just trying to survive? Mostly giving a lot of Zoom lectures! For those interested, here’s a Q&A that I recently did on the past and present of quantum computing, hosted by Andris Ambainis in Latvia. It did feel a bit surreal when my “interviewer” asked me to explain how I got into quantum computing research, and my answer was basically: “well, as you know, Andris, a lot of it started when I got hold of your seminal paper back in 1999…”
by Scott at August 09, 2020 08:36 AM UTC
Authors: Shay Kutten, William K. Moses Jr., Gopal Pandurangan, David Peleg
Download: PDF
Abstract: This paper concerns designing distributed algorithms that are singularly
optimal, i.e., algorithms that are simultaneously time and message optimal, for
the fundamental leader election problem in networks. Our main result is a
randomized distributed leader election algorithm for asynchronous complete
networks that is essentially (up to a polylogarithmic factor) singularly
optimal. Our algorithm uses $O(n)$ messages with high probability and runs in
$O(\log^2 n)$ time (with high probability) to elect a unique leader. The $O(n)$
message complexity should be contrasted with the $\Omega(n \log n)$ lower
bounds for the deterministic message complexity of leader election algorithms
(regardless of time), proven by Korach, Moran, and Zaks (TCS, 1989) for
asynchronous algorithms and by Afek and Gafni (SIAM J. Comput., 1991) for
synchronous networks. Hence, our result also separates the message complexities
of randomized and deterministic leader election. More importantly, our
(randomized) time complexity of $O(\log^2 n)$ for obtaining the optimal $O(n)$
message complexity is significantly smaller than the longstanding
$\tilde{\Theta}(n)$ time complexity obtained by Afek and Gafni and by Singh
(SIAM J. Comput., 1997) for message optimal (deterministic) election in
asynchronous networks.
In synchronous complete networks, Afek and Gafni showed an essentially singularly optimal deterministic algorithm with $O(\log n)$ time and $O(n \log n)$ messages. Ramanathan et al. (Distrib. Comput. 2007) used randomization to improve the message complexity, and showed a randomized algorithm with $O(n)$ messages and $O(\log n)$ time (with failure probability $O(1 / \log^{\Omega(1)}n)$). Our second result is a tightly singularly optimal randomized algorithm, with $O(1)$ time and $O(n)$ messages, for this setting, whose time bound holds with certainty and message bound holds with high probability.
at August 09, 2020 11:22 PM UTC
Authors: Philipp Schepper
Download: PDF
Abstract: The currently fastest algorithm for regular expression pattern matching and
membership improves the classical O(nm) time algorithm by a factor of about
log^{3/2}n. Instead of focussing on general patterns we analyse homogeneous
patterns of bounded depth in this work. For them a classification splitting the
types in easy (strongly subquadratic) and hard (essentially quadratic time
under SETH) is known. We take a very finegrained look at the hard pattern
types from this classification and show a dichotomy: few types allow
superpolylogarithmic improvements while the algorithms for the other pattern
types can only be improved by a constant number of logfactors, assuming the
FormulaSAT Hypothesis.
at August 09, 2020 12:00 AM UTC
Authors: Bhaskar Ray Chaudhury, Jugal Garg, Peter McGlaughlin, Ruta Mehta
Download: PDF
Abstract: We study the fair division problem of allocating a mixed manna under
additively separable piecewise linear concave (SPLC) utilities. A mixed manna
contains goods that everyone likes and bads that everyone dislikes, as well as
items that some like and others dislike. The seminal work of Bogomolnaia et al.
[Econometrica'17] argue why allocating a mixed manna is genuinely more
complicated than a good or a bad manna, and why competitive equilibrium is the
best mechanism. They also provide the existence of equilibrium and establish
its peculiar properties (e.g., nonconvex and disconnected set of equilibria
even under linear utilities), but leave the problem of computing an equilibrium
open. This problem remained unresolved even for only bad manna under linear
utilities.
Our main result is a simplexlike algorithm based on Lemke's scheme for computing a competitive allocation of a mixed manna under SPLC utilities, a strict generalization of linear. Experimental results on randomly generated instances suggest that our algorithm will be fast in practice. The problem is known to be PPADhard for the case of good manna, and we also show a similar result for the case of bad manna. Given these PPADhardness results, designing such an algorithm is the only nonbruteforce (nonenumerative) option known, e.g., the classic LemkeHowson algorithm (1964) for computing a Nash equilibrium in a 2player game is still one of the most widely used algorithms in practice.
Our algorithm also yields several new structural properties as simple corollaries. We obtain a (constructive) proof of existence for a far more general setting, membership of the problem in PPAD, rationalvalued solution, and odd number of solutions property. The last property also settles the conjecture of Bogomolnaia et al. in the affirmative.
at August 09, 2020 12:00 AM UTC
Authors: R. Karasev, A. Skopenkov
Download: PDF
Abstract: A lowdimensional version of our main result is the following `converse' of
the ConwayGordonSachs Theorem on intrinsic linking of the graph $K_6$ in
3space:
For any integer $z$ there are 6 points $1,2,3,4,5,6$ in 3space, of which every two $i,j$ are joint by a polygonal line $ij$, the interior of one polygonal line is disjoint with any other polygonal line, the linking coefficient of any pair disjoint 3cycles except for $\{123,456\}$ is zero, and for the exceptional pair $\{123,456\}$ is $2z+1$.
We prove a higherdimensional analogue, which is a `converse' of a lemma by SegalSpie\.z.
at August 09, 2020 12:00 AM UTC
Authors: Henrique Botelho, Francisco Zampirolli, Valério Ramos Batista
Download: PDF
Abstract: We present an algorithm to find nearoptimal weighted Steiner minimal trees
in the plane. The algorithm is implemented in Evolver programming language,
which already contains many builtin energy minimisation routines. Some are
invoked in the program, which enable it to consist of only 183 lines of source
code. Our algorithm reproduces the physical experiment of a soap film detaching
from connected pins towards a stable configuration. In the nonweighted case
comparisons with GeoSteiner are drawn for terminals that form a pattern.
at August 09, 2020 12:00 AM UTC
Authors: Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, Ashwin Maran, N. V. Vinodchandran
Download: PDF
Abstract: We study the problem of efficiently estimating the effect of an intervention
on a single variable (atomic interventions) using observational samples in a
causal Bayesian network. Our goal is to give algorithms that are efficient in
both time and sample complexity in a nonparametric setting.
Tian and Pearl (AAAI `02) have exactly characterized the class of causal graphs for which causal effects of atomic interventions can be identified from observational data. We make their result quantitative. Suppose P is a causal model on a set $\vec{V}$ of n observable variables with respect to a given causal graph G with observable distribution $P$. Let $P_x$ denote the interventional distribution over the observables with respect to an intervention of a designated variable X with x. Assuming that $G$ has bounded indegree, bounded ccomponents ($k$), and that the observational distribution is identifiable and satisfies certain strong positivity condition, we give an algorithm that takes $m=\tilde{O}(n\epsilon^{2})$ samples from $P$ and $O(mn)$ time, and outputs with high probability a description of a distribution $\hat{P}$ such that $d_{\mathrm{TV}}(P_x, \hat{P}) \leq \epsilon$, and:
1. [Evaluation] the description can return in $O(n)$ time the probability $\hat{P}(\vec{v})$ for any assignment $\vec{v}$ to $\vec{V}$
2. [Generation] the description can return an iid sample from $\hat{P}$ in $O(n)$ time.
We also show lower bounds for the sample complexity showing that our sample complexity has an optimal dependence on the parameters $n$ and $\epsilon$, as well as if $k=1$ on the strong positivity parameter.
at August 09, 2020 11:21 PM UTC
We lost a great computer scientist.
Frances Allen was one of the leaders who helped create the field of compilers research. Fran was an elite researcher at IBM, and won a Turing Award for this pioneering work. Allen also collected other awards.
Perhaps the coolest award is one that Fran could never win: The Frances E. Allen award created this year by the IEEE:
For innovative work in computing leading to lasting impact on other fields of engineering, technology, or science.
Today we will talk about Fran, who sadly just passed away.
I consider Fran a friend, although we never worked together—our areas of interest were different. One fond memory of mine is being on panel a while ago with Fran. What a delightful presence.
Fran always seemed to be smiling, always with that smile. The following images come in large part from interviews and lectures and award eventsthe fact that it is so easy to find them is a testament to her public engagement as well as scientific contributions.
Compliers were Key
There was a time when compliers were the most important program available on any new computer. Perhaps on any computer. Here is proof:
 Computers are there to solve problems.
 We must write programs to solve problems.
 Details of the computer and instructions, are complicated, which make writing programs hard.
 Thus, automating the writing of programs is important.
 QED: we must have compliers.
Okay this is not really a “proof”, but there is some truth to the argument. Fran was at IBM and worked on some of the early compliers, including FORTRAN and related languages. IBM wanted to sell computers, well actually in the early days rent them. One potential roadblock, IBM realized, was that new computers could be hard to program. Thus to ensure that companies rented new machines as fast as IBM could manufacture them was important. This created the need for compliers and even more for optimizing compilers.
In order to ship more machines, the code that a complier created had to be efficient. Hence, a stress on Allen was to figure out how compliers could generate high quality code. This led Fran and others like John Cocke to discover many complier techniques that are still used today. A short list of the ideas is:
 Branch prediction
 Register allocation
 Control flow graphs
 Program dependence graphs
 And many more.
What is so important is that Allen’s work was not just applicable to this machine or that language. Rather the work was able to be used for almost any machine and for almost any language. This universal nature of the work on compliers reminds me of what we try to do in theory. Allen’s research was so important because it could be used for future hardware as well as future languages.
Register Allocation
Guy Steele interviewed Allen for the ACM here. During the interview Fran talked about register allocation:
I have a story about register allocation. FORTRAN back in the 1950’s had the beginnings of a theory of register allocation, even though there were only three registers on the target machine. Quite a bit later, John Backus became interested in applying graph coloring to allocating registers; he worked for about 10 years on that problem and just couldn’t solve it. I considered it the biggest outstanding problem in optimizing compilers for a long time. Optimizing transformations would produce code with symbolic registers; the issue was then to map symbolic registers to real machine registers, of which there was a limited set. For highperformance computing, register allocation often conflicts with instruction scheduling. There wasn’t a good algorithm until the Chaitin algorithm. Chaitin was working on the PL compiler for the 801 system. Ashok Chandra, another student of Knuth’s, joined the department and told about how he had worked on the graph coloring problem, which Knuth had given out in class, and had solved it not by solving the coloring problem directly, but in terms of what is the minimal number of colors needed to color the graph. Greg immediately recognized that he could apply this solution to the register allocator issue. It was a wonderful kind of serendipity.
Compliers Create Theory
The early goal of creating compliers lead directly to some wonderful theory problems. One whole area that dominated early theory research was language theory. In particular understanding questions that arise in defining programming languages. Syntax came first—later semantics was formalized.
Noam Chomsky created contextfree grammars to help understand natural languages in the 1950s. His ideas were used by John Backus, also a Turing award winner from IBM, to describe the then new programming language IAL. This is known today as ALGOL 58, which became ALGOL 60. Peter Naur on the ALGOL 60 committee called Backus’s notation for ALGOL’s syntax: Backus normal form, but is now called BNFBackusNaur form.
Theorists worked on, I confess I did, many questions about such languages. Existence problems, decidability problems, efficient algorithms, and closure properties were just some of the examples. Not clear how much of this theory effected compiler design, but I would like to think that some was useful. Theorist should thank the complier researchers. I do.
For instance the 1970 STOC program had many papers on language related topics—here are some:
 A Result on the Relationship between Simple Precedence Languages and Reducing Transition Languages. Gary Lindstrom
 The Design of Parsers for Incremental Language Processors: Ronald Book, Sheila Greibach, Ben Wegbreit
 Tape and TimeBounded Turing Acceptors and AFLs. David Lewis
 Closure of Families of Languages under Substitution Operators. William Rounds
 TreeOriented Proofs of Some Theorems on ContextFree and Indexed Languages. Barry Rosen
 On SyntaxDirected Transduction and Tree Transducers. Alfred Aho, Jeffrey Ullman
 The Analysis of TwoDimensional Patterns using Picture Processing Grammars. JeanFrancois Perrot
 On Some Families of Languages Related to the Dyck Language. Joseph Ullian
 Three Theorems on Abstract Families of Languages. Joseph Ullian
By the way Abstract Families of Languages or AFLs was created by Seymour Ginsburg and Sheila Greibach in 1967 as a way to generalize context free languages.
Open Problems
Fran was asked by Steele in that interview: Any advice for the future?
Yes, I do have one thing. Students aren’t joining our field, computer science, and I don’t know why. It’s just such an amazing field, and it’s changed the world, and we’re just at the beginning of the change. We have to find a way to get our excitement out to be more publicly visible. It is exciting, in the 50 years that I’ve been involved, the change has been astounding.
Thanks Fran. Much of that change is due to pioneers like you. Thanks for everything.
by rjlipton at August 08, 2020 02:40 PM UTC
I spent the last few days participating in the Canadian Conference in Computational Geometry, originally planned for Saskatoon but organized virtually instead.
The way the conference was organized was that (after the usual submission reviewing process) the accepted authors provided both a proceedings paper and a 1015 minute talk video to the conference organizers. Participants were required to register, but with no registration fee, and were provided with links to the papers and talks (which are all still live on the conference program). Then, during the conference itself, live online Zoom sessions ran for only 23 hours daily, scheduled for 10AM1PM Saskatoon time: very convenient for anywhere in North America or Europe, not so much for the participants in Iran, India, China, Japan, and Korea (all of which did have participants). The sessions included a daily 1hour live invited talk, and questionandanswer sessions for the contributed works, in which we were shown a oneminute teaser for each video and then invited to ask questions of authors, at least one of whom was required to be present.
I think it all worked very well; so well, in fact, that during the business meeting there were calls for having at least some of the content similarly online so that people could participate remotely again. The ability to ask and answer questions either by live video on Zoom or through Zoom chat was useful, and used. Attendance was far above previous levels: 162 registrants, and over 120 unique participants at the most heavily attended of the two daily parallel sessions, compared to roughly 60 attendees each at the last two physical conferences. Despite the free registration, the conference organization was not without cost: they spent roughly $1500 (Canadian) in video production costs and video conferencing fees, but this was more than made up for by institutional support for the conference, so they ended up running a surplus which may (if it can be kept) end up providing some float for future conference organizers.
There are two changes I would suggest for future events of this type:

The contributed sessions were for a short enough time that holding them in parallel seemed unnecessary, and made it impossible to participate in all discussions. So this format may work better with less parallelism.

The oneminute teaser videos were cut together by the conference organizers from the longer videos provided by the authors, but in some cases the pacing of the longer videos from which they were cut meant that these teaser videos could not clearly state the results of their papers. I think it would have been better to ask authors to provide these alongside the longer talk videos.
Some of the highlights of the event:

Wednesday’s invited talk by Erik Demaine was a moving tribute to Godfried Toussaint and a survey of some of both Toussaint’s research, and Erik’s research on problems started by Toussaint, including convexifying polygons by flips, sona curves (the subject of one of my own contributions), the geometry of musical rhythms, and perhaps most importantly “supercollaboration”, the model of shared research and shared authorship developed by Toussaint at the Barbados research workshops and also used by Erik within many of his MIT classes. Erik’s talk was recorded and is now on YouTube; I hope the same will be true of the other two invited talks.

In Wednesday’s contributed session on unfolding (in which I had two papers) I particularly liked Satyan Devadoss’s talk, “Nets of higherdimensional cubes”. The main result is that if you unfold a hypercube, then no path of facets of the unfolded shape can contain a uturn: if the path takes a step in one any coordinate direction, it cannot step in the opposite direction. This implies that all dual spanning trees unfold flat without selfintersection. The same property was known for all the Platonic solids, and Satyan can prove it for regular simplices in any dimension, but that still leaves several regular polytopes for which it is still open whether all unfoldings work: the crosspolytopes in all dimensions, and the three exceptional regular polytopes in four dimensions.

Thursday’s invited talk was by Jeff Erickson, “Chasing puppies”. It was an entertaining presentation of an elegant topological proof of the following result: if you and a puppy can both move around a simple closed curve in the plane, with the puppy always moving along the curve to a local minimum of distance to you, then you can always find a path to follow that will bring you and the puppy to the same point.

There wasn’t an official prize for best contributed presentation, but in the data structures session on Thursday, several comments nominated Don Sheehy as the unofficial winner, for a video artfully mixing live action with computer animations. His paper, “Onehop greedy permutations” concerned heuristics for improving the farthestfirst traversal of a set of points by looking near each point as the sequence is constructed for a better point to use instead.

There was an official best student paper award, and it went to Alejandro Flores Velazco for his paper “Social distancing is good for points too!” It concerns the problem of reducing the size of a data set while preserving the quality of nearestneighbor classification using the reduced set. It proves that FCNN, which it calls the most popular heuristic for this problem, can produce significantly lessreduced outputs than some other proven heuristics, and shows how to modify FCNN to get a heuristic with guaranteed output quality.

The third of the invited sessions was by Yusu Wang, newly moved from Ohio State to UC San Diego. She gave a nice introduction to combinatorial methods for reconstructing road networks (or other networks embedded into higherdimensional geometry) from noisy samples of points on the network by combining discrete Morse theory to find the ridge lines of sample density with persistent homology to clean some of the noise from the data.

The final technical component of the conference was an open problem session, also recorded and presumably to be uploaded at some point. Satyan posed his question on regular polytope unfolding there. Mike Paterson asked whether one can construct “Plato’s torus”, an embedded torus with six equilateraltriangle faces meeting at each vertex; in a blog post I made on this problem in 2009 I traced its history to Nick Halloway in 1997 but Mike says he discussed it already with Christopher Zeeman in the 1970s. Another problem that caught my attention asked for an algorithmic version of the polyhedral theorem of the three geodesics, the existence of a path across the surface of a convex polyhedron that stays straight across each edge or face of the polyhedron, and has at most \(\pi\) surface angle on each side of it when it passes through a vertex. Again there’s some history here: Joe O’Rourke says he once mentioned the problem to Gromov, who said it was easy but unfortunately didn’t elaborate.
CCCG 2021 is planned for Halifax, colocated with WADS. One somewhat controversial issue is that the current plan is to have both conferences overlap for two days, with one overlapfree day for each conference at each end of the overlap period. But if both conferences are doublesession, this means that participants can only choose one of four overlapping talks. At this point everyone is still hoping that events allow for a physical conference by then but that remains to be seen.
by David Eppstein at August 07, 2020 05:46 PM UTC
by gasarch (noreply@blogger.com) at August 06, 2020 06:59 PM UTC
The 4th Workshop on Local Algorithms (WoLA 2020) recently concluded: aimed at “fostering dialogue and crosspollination of ideas between the various communities” related to local algorithms, broadly construed, it featured invited and contributed talks on a variety of topics, many (if not most) very relevant to the sublinear algorithms and property testing community.
Because of the online format of the workshop (imposed by the current circumstances), all talks were recorded and posted online. As such, all videos all available on the workshop’s website and YouTube channel: a good list of resources to peruse!
by Clement Canonne at August 06, 2020 12:33 AM UTC
by Adam Sheffer at August 05, 2020 10:51 PM UTC
Years ago when I learned about Google PageRank algorithm, my first reaction was this is not the way it should be done! There should be some proof. This probably just shows that my CS education was too theoretical ;). Years later I have learned that indeed there are some nice tools to argue about the running time of PageRank algorithm. And very recently we were able to give some new parallel (in MPC model) algorithms for computing vanilla PageRank. We improved the number of rounds needed from O(log n) to O(log^2 log n) time. You can hear Solbodan talking out it here: https://www.youtube.com/watch?v=xoodhmjJ9Xs .
by sank at August 05, 2020 06:55 PM UTC
Background: Suppose we are interested in computing the distance between two vertices in a graph. Under edge or node differential privacy, this problem is not promising because the removal of a single edge can make distances change from 1 to \(n − 1\) or can even disconnect the graph. However, a different setting that makes sense to consider is that of a weighted graph \((G, w)\) whose topology \(G = (V, E)\) is publicly known but edge weight function \(w : E \to \mathbb{R}^+\) must be kept private. (For instance, consider transit times on a road network. The topology of the road network may be publicly available as a map, but the edge weights corresponding to transit times may be based on private GPS locations of individual cars.)
Suppose that two weight functions \(w\) and \(w’\) of the same graph \(G\) are considered to be neighbors if they differ by at most 1 in \(\ell^1\) norm. Then the length of any fixed path is sensitivityone, so the distance between any pair of vertices is also sensitivityone and can be released privately via the Laplace mechanism. But what if we want to release all \(\Theta(n^2)\) distances between all pairs of vertices in \(G\)? We can do this with accuracy roughly \(O(n \log n )/\varepsilon\) by adding noise to each edge, or roughly \(O(n \sqrt{\log(1/\delta)}/\varepsilon)\) using composition theorems. Both of these are roughly \(n/\varepsilon\). But is this linear dependence on \(n\) inherent, or is it possible to release allpairs distances with error sublinear in \(n\)?
This setting and question were considered in [S16].
Problem 1: Let \(G\) be an arbitrary public graph, and \(w : E \to \mathbb{R}^+\) be an edge weight function. Can we release approximate allpairs distances in \((G, w)\) with accuracy sublinear in \(n\) while preserving the privacy of the edge weight function, where two weight functions \(w, w’\) are neighbors if \(\w − w’\_1 \le 1\)? Or can we show that any private algorithm must have error \(\Omega(n)\)? A weaker (but nontrivial) lower bound would also be nice.
Reward: A bar of chocolate.
Other related work: [S16] provided algorithms with better error for two special cases, trees and graphs of a priori bounded weight. For trees, it is possible to release allpairs distances with error roughly \(O(\log^{1.5} n)/\varepsilon\), while for arbitrary graphs with edge weights restricted to the interval \([0,M]\), it is possible to release allpairs distances with error roughly \(O( \sqrt{nM\varepsilon^{1}\log(1/\delta)})\)
Submitted by Adam Sealfon on April 9, 2019.
by Audra McMillan at August 05, 2020 06:00 PM UTC
Should Paper X cite Paper Y as previous or concurrent/independent work? This is sometimes tricky: maybe Paper Y circulated privately before Paper X, maybe the authors of Paper X knew about Paper Y maybe not — nobody can know for sure. One can say that the authors of Paper Y should have posted Paper Y online earlier to prevent this issue, but that is not standard practice and might lead to other problems, including Paper Y never getting published!
I propose the following guiding principle:
“If a different accept/reject outcome would have forced paper X to cite paper Y as previous work, then paper X should cite paper Y as previous work.”
The reasons behind my principle seem to me especially valid in the fastmoving theoretical computer science community, where papers are typically sent to conferences and thus seen by the entire program committee plus around 3 external referees, who are typically experts — only to be rejected. Moreover, the progress is extremely fast, with the next conference cycle making obsolete a number of papers in just the previous cycle.
by Manu at August 05, 2020 02:50 PM UTC
at August 04, 2020 09:41 PM UTC
Integration by parts is a highlight of any calculus class. It leads to multiple classical applications for integration of logarithms, exponentials, etc., and it is the source of an infinite number of exercises and applications to special functions. In this post, I will look at a classical discrete extension that is useful in machine learning and optimization, namely Abel transformation, with applications to convergence proofs for the (stochastic) subgradient method. Next month, extensions to higher dimensions will be considered, with applications to score functions [2, 3] and randomized smoothing [4, 5].
Abel transformation: from continuous to discrete
The most classical version of integration by parts goes as follows. Given two continuously differentiable functions from \(\mathbb{R}\) to \(\mathbb{R}\), we have: $$ \int_a^b \!\!\!\!f(x)g'(x) dx = \Big[ f(x) g(x) \Big]_a^b \!\! \int_a^b\!\!\! \!f'(x) g(x) dx = f(b) g(b)\, – f(a)g(a)\! \int_a^b\! \!\!\! f'(x) g(x) dx.$$ This is valid for less regular functions, but this is not the main concern here. The proof follows naturally from the derivative of a product, but there is a nice “proof without words” (see, e.g., [1, p. 42] or here).
There is a discrete analogue referred to as Abel transformation or summation by parts, where derivatives are replaced by increments: given two realvalued sequences \((a_n)_{n \geq 0}\) and \((b_n)_{n \geq 0}\) (the second sequence could also be taken vectorvalued), we can expand $$ \sum_{k=1}^n a_k ( b_k\, – b_{k1}) =\sum_{k=1}^n a_k b_k \ – \sum_{k=1}^n a_k b_{k1} = \sum_{k=1}^n a_k b_k \ – \sum_{k=0}^{n1} a_{k+1} b_{k},$$ using a simple index increment in the second sum. Rearranging terms, this leads to $$ \sum_{k=1}^n a_k ( b_k\, – b_{k1}) = a_n b_n \ – a_0 b_0\ – \sum_{k=0}^{n1} ( a_{k+1} – a_{k } ) b_k.$$ In other words, we can transfer the firstorder difference from the sequence \((b_k)_{k \geq 0}\) to the sequence \((a_k)_{k \geq 0}\). A few remarks:
 Warning! It is very easy/common to make mistakes with indices and signs.
 I gave the direct proof but a proof through explicit integration by part is also possible, by introducing the piecewiseconstant function \(f\) equal to \(a_k\) on \([k,k+1)\), and \(g\) continuous piecewise affine equal to \(b_{k} + (tk) ( b_{k+1}b_{k})\) for \(t \in [k,k+1]\), and integrating between \(0+\) and \(n+\).
There are classical applications for the convergence of series (see here), but in this post, I will show how it can lead to an elegant result for stochastic gradient descent for nonsmooth functions and decaying stepsizes.
Decaying stepsizes in stochastic gradient descent
The Abel summation formula is quite useful when analyzing optimization algorithms, and we give a simple example below. We consider a sequence of random potentially nonsmooth convex functions \((f_k)_{k \geq 0}\) which are independent and identically distributed functions from \(\mathbb{R}^d \) to \(\mathbb{R}\), with expectation \(F\). The goal is to find a minimizer \(x_\ast\) of \(F\) over a some convex bounded set \(\mathcal{C}\), only being given access to some stochastic gradients of \(f_k\) at wellchosen points. The most classical example is supervised machine learning, where \(f_k(\theta)\) is the loss of a random observation for the predictor parameterized by \(\theta\). The difficulty here is the potential nonsmoothness of the function \(f_k\) (e.g., for the hinge loss and the support vector machine).
We consider the projected stochastic subgradient descent method. The deterministic version of this method dates back to Naum Shor [6] in 1962 (see nice history here). The method goes as follows: starting from some \(\theta_0 \in \mathbb{R}^d\), we perform the iteration $$ \theta_{k} = \Pi_{ \mathcal{C} } \big( \theta_{k1} – \gamma_k \nabla f_k(\theta_{k1}) \big),$$ where \(\Pi_{ \mathcal{C}}: \mathbb{R}^d \to \mathbb{R}^d\) is the orthogonal projection onto the set \(\mathcal{C}\), and \(\nabla f_k(\theta_{k1})\) is any subgradient of \(f_k\) at \(\theta_{k1}\).
We make the following standard assumptions: (a) the set \(\mathcal{C}\) is convex and compact with diameter \(\Delta\) (with respect to the \(\ell_2\)norm), (b) the functions \(f_k\) are almost surely convex and \(B\)Lipschitzcontinuous (or equivalently with gradients bounded in \(\ell_2\)norm by \(B\)). We denote by \(\theta_\ast\) a minimizer of \(f\) on \(\mathcal{C}\) (there can be multiple ones).
For nonsmooth problems, choosing a constant stepsize does not lead to an algorithm converging to a global minimizer: decaying stepsizes are then needed.
Convergence proof through Lyapunov functions
Since the functions \(f_k\) are nonsmooth, we cannot use Taylor expansions, and we rely on a now classical proof technique dating back from the 1960’s (see, e.g., a paper by Boris Polyak [7] in Russian), that has led to several extensions in particular for online learning [8]. The proof relies on the concept of “Lyapunov functions“, often also referred to as “potential functions”. This is a nonnegative function \(V(\theta_k)\) of the iterates \(\theta_k\), that is supposed to go down along iterations (at least in expectation). In optimization, standard Lyapunov functions are \(V(\theta) = F(\theta)\, – F(\theta_\ast)\) or \(V(\theta) = \ \theta \ – \theta_\ast\_2^2\).
For the subgradient method, we will not be able to show that the Lyapunov function is decreasing, but this will lead through a manipulation which is standard in linear dynamical system analysis to a convergence proof for the averaged iterate: that is, if \(V(\theta_k) \leqslant V(\theta_{k1})\ – W(\theta_{k1}) + \varepsilon_k\), for a certain function \(W\) and extra positive terms \(\varepsilon_k\), then, using telescoping sums, $$ \frac{1}{n} \sum_{k=1}^n W(\theta_{k1}) \leqslant \frac{1}{n} \big( V(\theta_0)\ – V(\theta_n) \big) + \frac{1}{n} \sum_{k=1}^n \varepsilon_k.$$ We can then either use Jensen’s inequality to get a bound on \(W \big( \frac{1}{n} \sum_{k=1}^n \theta_{k1} \big)\), or directly get a bound on \(\min_{k \in \{1,\dots,n\}} W(\theta_{k1})\). The first solution gives a performance guarantee for a welldefined iterate, while the second solution only shows that among the first \(n1\) iterates, one of them has a performance guarantee; in the stochastic setup where latex \(W\) is an expectation, it is not easily possible to know which one, so we will consider only averaging below.
Standard inequality. We have, by contractivity of orthogonal projections: $$ \\theta_k \ – \theta_\ast\_2^2 = \big\ \Pi_{ \mathcal{C} } \big( \theta_{k1} – \gamma_k \nabla f_k(\theta_{k1}) \big) – \Pi_{ \mathcal{C} } (\theta_\ast) \big\_2^2 \leqslant \big\ \theta_{k1} – \gamma_k \nabla f_k(\theta_{k1}) \ \theta_\ast \big\_2^2.$$ We can then expand the squared Euclidean norm to get: $$ \\theta_k – \theta_\ast\_2^2 \leqslant \\theta_{k1} – \theta_\ast\_2^2 \ – 2\gamma_k (\theta_{k1} – \theta_\ast)^\top \nabla f_k (\theta_{k1}) + \gamma_k^2 \ \nabla f_k(\theta_{k1})\_2^2.$$ The last term is upperbounded by \(\gamma_k^2 B^2\) because of the regularity assumption on \(f_k\). For the middle term, we use the convexity of \(f_k\), that is, the function \(f_k\) is greater than its tangent at \(\theta_{k1}\). See figure below.
We then obtain $$ f_k(\theta_\ast) \geqslant f_k(\theta_{k1}) + \nabla f_k(\theta_{k1})^\top ( \theta_{\ast} – \theta_{k1}).$$
Putting everything together, this leads to $$ \\theta_k \ – \theta_\ast\_2^2 \leqslant \\theta_{k1}\ – \theta_\ast\_2^2 \ – 2\gamma_k \big[ f_k(\theta_{k1}) \ – f_k(\theta_\ast) \big] + \gamma_k^2 B^2.$$ At this point, except the last term, all terms are random. We can now take expectations, with a particular focus on the term \(\mathbb{E} \big[ f_k(\theta_{k1}) \big]\), for which we can use the fact that the random function \(f_k\) is independent from the past, so that $$ \mathbb{E} \big[ f_k(\theta_{k1}) \big] = \mathbb{E} \Big[ \mathbb{E} \big[ f_k(\theta_{k1}) \big f_{1},\dots,f_{k1} \big] \Big] =\mathbb{E} \big[ F(\theta_{k1}) \big] . $$ We thus get $$ \mathbb{E} \big[ \\theta_k – \theta_\ast\_2^2\big] \leqslant \mathbb{E} \big[ \\theta_{k1} – \theta_\ast\_2^2\big] – 2\gamma_k \big( \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \big) + \gamma_k^2 B^2.$$ As above, we can now isolate the excess in function values as: $$ \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{1}{2 \gamma_k} \Big( \mathbb{E} \big[ \\theta_{k1} – \theta_\ast\_2^2\big] – \mathbb{E} \big[ \\theta_{k} – \theta_\ast\_2^2\big] \Big) + \frac{\gamma_k}{2} B^2.$$ At this point, the “optimization part” of the proof is done. Only algebraic manipulations are needed to obtain a convergence rate. This is where Abel transformation will come in.
From fixed horizon to anytime algorithms
The lazy way. At this point, many authors (including me sometimes) will take a constant stepsize \(\gamma_k = \gamma\) so as to obtain a telescopic sum, leading to $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{1}{2n\gamma} \Big( \mathbb{E} \big[ \\theta_{0} \ – \theta_\ast\_2^2\big] – \mathbb{E} \big[ \\theta_{n}\ – \theta_\ast\_2^2\big] \Big) + \frac{\gamma}{2} B^2,$$ which is less than \(\displaystyle \frac{\Delta^2}{2n \gamma} + \frac{\gamma}{2} B^2\), and minimized for \(\displaystyle \gamma = \frac{ \Delta}{B \sqrt{n}}\), leading to a convergence rate less than \(\displaystyle \frac{ B \Delta}{\sqrt{n}}\). Using Jensen’s inequality, we then get for \(\bar{\theta}_n = \frac{1}{n} \sum_{k=1}^n \theta_{k1}\): $$\mathbb{E} \big[ F(\bar{\theta}_{n}) \big] – F(\theta_\ast) \leqslant \frac{ B \Delta}{\sqrt{n}} .$$ This result leads to the desired rate but can be improved in at least one way: the stepsize currently has to depend on the “horizon” \(n\) (which has to be known in advance), and the algorithm is not “anytime”, which is not desirable in practice (where one often launches an algorithm and stops it when it the performance gains have plateaued or when the user gets bored waiting).
Nonuniform averaging. Another way [9] is to consider the nonuniform average $$ \eta_{k} = \frac{\sum_{k=1}^n \gamma_{k} \theta_{k1}}{\sum_{k=1}^n \gamma_{k}}, $$ for which telescoping sums apply as before, to get $$ \mathbb{E} \big[ F(\eta_k) \big] – F(\theta_\ast) \leqslant \frac{1}{2} \frac{\Delta^2 + B^2 \sum_{k=1}^n \gamma_k^2}{\sum_{k=1}^n \gamma_{k}}.$$ Then, by selecting a decaying stepsize \(\displaystyle \gamma_k = \frac{ \Delta}{B \sqrt{k}}\), that depends on the iteration number, we get a rate proportional to \(\displaystyle \frac{ B \Delta}{\sqrt{n}} ( 1 + \log n)\). We now have an anytime algorithm, but we have lost a logarithmic term, which is not the end of the world, but still disappointing. In [9], “tailaveraging” (only averaging iterates between a constant times \(n\) and \(n\)) is proposed, that removes the logarithmic term but requires to store iterates (moreover, the nonuniform averaging puts too much weight on the first iterates, slowing down convergence).
Using Abel transformation. If we start to sum inequalities from \(k=1\) to \(k=n\), we get, with \(\delta_k = \mathbb{E} \big[ \\theta_{k} – \theta_\ast\_2^2\big]\) (which is always between \(0\) and \(\Delta^2\)): $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^n \bigg( \frac{1}{2 \gamma_k} \Big( \delta_{k1} – \delta_k \Big)\bigg) + \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2,$$ which can be transformed through Abel transformation into $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^{n1} {\delta_k} \bigg(\frac{1}{ 2\gamma_{k+1}} \frac{1}{ 2\gamma_{k}} \bigg) + \frac{\delta_0}{2 \gamma_1} \frac{\delta_t}{2 \gamma_t}+ \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2.$$ For decreasing stepsize sequences, this leads to $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^{n1} {\Delta^2} \bigg(\frac{1}{ 2\gamma_{k+1}} \frac{1}{ 2\gamma_{k}} \bigg) + \frac{\Delta^2}{2 \gamma_1}+ \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2,$$ and thus $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k1}) \big] – F(\theta_\ast) \leqslant \frac{\Delta^2 }{2 n \gamma_n} + \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2.$$ For \(\gamma_k = \frac{ \Delta}{B \sqrt{k}}\), this leads to an upper bound $$\frac{\Delta B }{2 \sqrt{n}} \big( 1+ \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{1}{\sqrt{k}}\big) \leqslant \frac{3 \Delta B }{2 \sqrt{n}},$$ which is up to a factor \(\frac{3}{2}\) exactly the same bound as with a constant stepsize, but now with an anytime algorithm.
Experiments
To illustrate the behaviors above, let’s consider minimizing \(\mathbb{E}_x \ x – \theta \_1\), with respect to \(\theta\), with \(f_k(\theta) = \ x_k \theta\_1\), where \(x_k\) is sampled independently from a given distribution (here independent lognormal distributions for each coordinate). The global optimum \(\theta_\ast\) is the percoordinate median of the distribution of \(x\)’s.
When applying SGD, the chosen subgradient of \(f_k\) has components in \(\{1,1\}\). Hence in the plots below in two dimensions, the iterates are always on a grid. With a constant stepsize: if the \(\gamma\) is too large (right), there are large oscillations, while if \(\gamma\) is too small (left), optimization is too slow. Note that while the SGD iterate with a constant stepsize is always oscillating, the averaged iterate converges to some point (which is not the global optimum, and is typically at distance \(O(\gamma)\) away from it [11]).
With a decaying stepsize (figure below), the initial conditions are forgotten reasonably fast and the iterates converge to the global optimum (and of course, we get an anytime algorithm!).
We can now compare in terms of function values, showing that a constant stepsize only works well for a specific range of iteration numbers.
Conclusion
Being able to deal with decaying stepsizes and anytime algorithms is arguably not a major improvement, but quite a satisfactory one, at least to me! Discrete integration by parts is the key enabler here.
There is another rewarding aspect which is totally unrelated to integration by parts: when applied to supervised machine learning, we just obtained from elementary principles (convexity) and few calculations a generalization bound on unseen data, which is as good as regular bounds from statistics [10] that use much more complex tools such as Rademacher complexities (but typically no convexity assumptions): here, statistics considered independently from optimization is not only slower (considering the empirical risk and minimizing it using the plain nonstochastic subgradient method would lead to an \(n\) times slower algorithm) but also more difficult to analyze!
References
[1] Roger B. Nelsen, Proofs without Words: Exercises in Visual Thinking, Mathematical Association of America, 1997.
[2] Aapo Hyvärinen, Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr), 695709, 2005.
[3] Thomas M. Stoker, Consistent estimation of scaled coefficients. Econometrica: Journal of the Econometric Society, 54(6):14611481, 1986.
[4] Tamir Hazan, George Papandreou, and Daniel Tarlow. Perturbation, Optimization, and Statistics. MIT Press, 2016.
[5] Quentin Berthet, Matthieu Blondel, Olivier Teboul, Marco Cuturi, JeanPhilippe Vert, Francis Bach, Learning with differentiable perturbed optimizers. Technical report arXiv 2002.08676, 2020.
[6] Naum Z. Shor. An application of the method of gradient descent to the solution of the network transportation problem. Notes of Scientific Seminar on Theory and Applications of Cybernetics and Operations Research, Ukrainian Academy of Sciences, Kiev, 9–17, 1962.
[7] Boris T. Polyak, A general method for solving extremal problems. Doklady Akademii Nauk SSSR, 174(1):33–36, 1967.
[8] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. Proceedings of the international conference on machine learning )(ICML), 2003.
[9] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):15741609, 2009.
[10] Stéphane Boucheron, Olivier Bousquet, Gabor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: probability and statistics, 9, 323375, 2005.
[11] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. Annals of Statistics, 48(3):13481382, 2020.
by Francis Bach at August 04, 2020 03:55 PM UTC
at August 04, 2020 03:11 PM UTC
We hope you’re all staying safe and healthy! To bring you some news (and distraction?) during this… atypical summer,here are the recent papers on property testing and sublinear algorithms we saw appear this month. Graphs, probability distributions, functions… there is a something for everyone.
On Testing Hamiltonicity in the Bounded Degree Graph Model, by Oded Goldreich (ECCC). The title sort of gives it away: this relatively short paper shows that testing whether an unknown boundeddegree graph has a Hamiltonian path (or Hamiltonian cycle) in the boundeddegree model requires a number of queries linear in \(n\), the number of nodes. The results also hold for directed graphs (with respect to directed Hamiltonian path or cycle), and are shown via a local reduction to a promise problem of satisfiability of 3CNF formulae. Also included: a complete proof of the linear lower bound for another problem, Independent Set Size; and an open problem: what is the query complexity of testing graph isomorphism in the boundeddegree model?
Local Access to Sparse Connected Subgraphs Via Edge Sampling, by Rogers Epstein (arXiv). Given access to a connected graph \(G=(V,E)\), can we efficiently provide access to some sparse connected subgraph \(G’=(V,E’)\subseteq G\) with \(E’\ll E\)? This question, wellstudied in particular for the case where \(G\) had bounded degree and the goal is to achieve \(E’\leq (1\varepsilon)V\), is the focus of this paper which provides a tradeoff between the query complexity of the oracle and \(E’\). Specifically, for every parameter \(T\), one can give oracle access to \(G’\) with \(E’=O(VT)\), with a query complexity \(=\tilde{O}(E/T)\).
Switching gears, we move from graphs to probability distributions:
Tolerant Distribution Testing in the Conditional Sampling Model, by Shyam Narayanan (arXiv). In the conditional sampling model for distribution testing, which we have covered a few times on this blog, the algorithm at each step gets to specify a subset \(S\) of the domain, and observe a sample from the distribution conditioned on \(S\). As it turns out, this can speed things up a lot: as Canonne, Ron, and Servedio (2015) showed, even tolerant uniformity testing, which with i.i.d. samples requires a nearlinear (in the domain size \(n\)) number of samples, can be done in a constant number of conditional queries. Well, sort of constant: no dependence on \(n\), but the dependence on the distance parameter \(\varepsilon\) was, in CRS15, quite bad: \(\tilde{O}(1/\varepsilon^{20})\). This work gets rid of this badness, and shows the (nearly) optimal \(\tilde{O}(1/\varepsilon^{2})\) query complexity! Among other results, it also generalizes it to tolerant identity testing (\(\tilde{O}(1/\varepsilon^{4})\)), for which previously no constantquery upper bound was known. Things have become truly sublinear.
Interactive Inference under Information Constraints, by Jayadev Acharya, Clément Canonne, Yuhan Liu, Ziteng Sun, and Himanshu Tyagi (arXiv). Say you want to do uniformity/identity testing (or learn, but let’s focus on testing) on a discrete distribution, but you can’t actually observe the i.i.d. samples: instead, you can only do some sort of limited, “local” measurement on each sample. How hard is the task, compared to what you’d do if you fully had the samples? This setting, which captures things like distributed testing with communication or local privacy constraints, erasure channels, etc., was wellunderstood from previous recent work in the nonadaptive setting. But what if the “measurements” could be made adaptively? This paper shows general lower bounds for identity testing and learning, as a function of the type of local measurement allowed: as a corollary, this gives tight bounds for communication constraints and local privacy, and shows the first separation between adaptive and nonadaptive uniformity testing, for a type of “leaky” membership query measurement.
Efficient Parameter Estimation of Truncated Boolean Product Distributions, by Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos (arXiv). Suppose there is a fixed and unknown subset \(S\) of the hypercube, a “truncation” set, which you can only accessible via membership query; and you receive i.i.d. samples from an unknown product distribution on the hypercube, truncated on that set \(S\) (for instance, because your polling strategy or experimental measurements have limitations). Can you still learn that distribution efficiently? Can you test it for various properties, as you typically really would like to? (or is it just me?) This paper identifies some natural sufficient condition on \(S\), which they call fatness, under which the answer is a resounding yes. Specifically, if \(S\) satisfies this condition, one can actually generate honesttogoodness i.i.d. samples (nontruncated) from the true distribution, given truncated samples!
Leaving distribution testing, our last paper is on testing functions in the distributionfree model:
Downsampling for Testing and Learning in Product Distributions, by Nathaniel Harms and Yuichi Yoshida (arXiv). Suppose you want to test (or learn) a class of Boolean functions \(\mathcal{C}\) over some domain \(\Omega^n\), with respect to some (unknown) product distribution (i.e., in the distributionfree testing model, or PAClearning model). This paper develops a general technique, downsampling, which allows one to reduce such distributionfree testing of \(\mathcal{C}\) under a product distribution to testing \(\mathcal{C}\) over \([r]^d\) under the uniform distribution, for a suitable parameter \(r=r(d,\varepsilon,\mathcal{C})\). This allows the authors, among many other things and learning results, to easily reestablish (and, in the second case, improve upon) recent results on testing of monotonicity over \([n]^d\) (uniform distribution) and over \(\mathbb{R}^d\) (distributionfree).
by Clement Canonne at August 04, 2020 02:59 AM UTC
A breakthrough on the separating words problem
Zachary Chase is a graduate student of Ben Green at Oxford. Chase has already solved a number of interesting problems–check his site for more details. His advisor is famous for his brilliant work—especially in additive combinatorics. One example is his joint work with Terence Tao proving this amazing statement:
Theorem 1 The prime numbers contain arbitrarily long arithmetic progressions.
Today we wish to report Chase’s new paper on a problem we have twice discussed before.
But first Ken wants to say something about Oxford where he got his degree long before Green arrived.
Oxford Making Waves
Green moved to Oxford in 2013. He holds a professorship associated to Magdalen College. I (Ken) did not know him when I started at Oxford in 1981. It would have been hard, as Green was only 4 years old at the time. But I did know the preteen Ruth Lawrence when she started there and even once played a departmental croquet match including her in which Bryan Birch made some epic long shots. Lawrence had joined St. Hugh’s College in 1983 at the age of twelve.
Oxford has been Dick’s and my mind more in the past six years than before. Both of us were guests of Joël Ouaknine in 2012–2015 when he was there. Oxford has developed a frontline group in quantum computation, which fits as David Deutsch’s role as an originator began from there—note my story in the middle of this recent post.
Recently Oxford has been in the news for developing a promising Covid19 vaccine. ChAdOx1 heads Wikipedia’s list of candidate vaccines and has gone to final trials, though there is still a long evaluation process before approval for general use.
Before that, a modeling study from Oxford in March raised the question of whether many more people have had Covid19 without symptoms or any knowledge. This kind of possibility has since been likened to a “dark matter” hypothesis, not just now regarding Covid19 but a decade ago and before.
A main supporting argument is that a wide class of mathematical models can be fitted with higher relative likelihood if the hypothesis is true. I have wanted to take time to evaluate this argument amid the wider backdrop of controversy over inference methods in physics, but online chess with unfortunately rampedup frequency of cheating has filled up all disposable time and more.
The Problem
Back to Chase’s new results on the following problem:
Given two distinct binary strings of length there is always a finite state deterministic automaton (FSA) that accepts one and rejects the other. How few states can such a machine have?
This is called the separating words problem (SWP). Here we consider it for binary strings only.
John Robson proved states are enough—we suppress any log factors. Some like to write this as . Chase improves this to :
Theorem 2 For any distinct , there is a finite state deterministic automaton with states that accepts but not .
We previously discussed this twice at GLL. We discussed the background and early results here. The original problem is due to Pavel Goralcik and Vaclav Koubek. They proved an upper bound that was . Then we went over Robson’s bound here. The best upper bound was Robson’s result until Chase came along.
The Approach
All the approaches to SWP seem to have a common thread. They find some family of “hash” functions so that:
 Any in can be computed by a FSA with few states.
 For any binary strings of length , there is an so that .
The challenge is to find clever families that can do do both. Be easy to compute and also be able to tell strings apart. Actually this is only a coarse outline—Chase’s situation is a bit more complicated.
The Proof
We have taken the statement of Theorem 2 verbatim from the paper. It has a common pecadillo of beginning a sentence for a specific but writing later. However, this is how we think intuitively: in terms of how the pieces of the formula behave. Chase declares right away his intent to ignore the power of . How he gets the power of is the real point. We can convey the intuition in brief.
A length binary string can be identified with its set of positions where the string has a . Chase begins by showing how a power of on is obtainable by considering sets of the form
where is prime and . Suppose we know a bound such that for all distinct (that is, all distinct binary strings of legnth ) there is a prime and such that
Then by the Chinese Remainder Theorem, there is a prime of magnitude about such that
Now we can make a finite automaton with states that always increments modulo and increments modulo each time it reads a when . Then has orderof states. The finisher is that suffices. Again we ignore the pecadillo but we add some redundant words to the statement in the paper between dashes:
Lemma 3 For any distinct —of size at most —there is a prime such that for some ,
The power is of course weaker than Robson’s , but this statement conceals two “levers” that enable leapfrogging to get . The first is that we don’t have to limit attention to sets that come from places where the corresponding strings have a . Consider any string and take to be the set of index positions in which has the substring beginning at place . Define likewise for . Then we can try to prove results of the following form given :
Proposition 4 For all distinct there is such that and
A finite automaton using this extension needs states to store in its finite control. The second lever is to try to prove results of this form, where now the words “of size at most ” are not redundant:
Lemma 5 (?) For any distinct of size at most there is a prime such that for some , [Update: see note below.]
Now we need to balance the levers using the proposition and the lemma together. Since will add order states to the automaton, we balance it against from the previous argument. So take . Then . Lemma 5 then gives the bound
on the magnitude of the needed primes . This yields the breakthrough on SWP.
Here a famous New Yorker cartoon with the caption “If only it were so simple” comes to mind. But there is a catch. Chase is not quite able to prove lemma 5. However, the lever comes with extra flexibility that enables finding that make and also give those sets an extra regularity property . Using , he is able to show the existence of good hash functions of a certain type. The modified lemma is enough to prove his new bound. The proof still uses intricate analysis including integrals.
This is classic highpower mathematics. When some idea is blocked, try to weaken the requirements. Sometimes it is possible to still proceed. It is a lesson that we sometimes forget, but a valuable one nevertheless.
Open Problems
We like the SWP and think Chase’s contribution is impressive. Note that it adds a third element to and in the automaton. Can the argument be pushed further by finding more levers to add more elements? Is Lemma 5 true as stated, and with what (other) tradeoffs of and between it and Proposition 4? [Update: not for extreme tradeoffs—see this comment—but plausibly for polynomially related.]
We feel there could also be interesting applications for his theorem as it stands. Is the ability to tell two strings apart with a simple device—a FSA with not many states—useful? Could it solve some open problem? It does seem like a basic insight, yet we have no candidate application. Perhaps you have an idea.
[added Q on Lemma 5 to “Open Problems”, “lower” bound –> “upper” bound in third section, update in Open Problems.]
by RJLipton+KWRegan at August 03, 2020 02:50 PM UTC
The 12th Innovations in Theoretical Computer Science (ITCS) conference will be held online from January 68, 2021. The submission deadline is September 7, 2020.
The program committee encourages you to send your papers our way! See the call for papers for information about submitting to the conference.
ITCS seeks to promote research that carries a strong conceptual message (e.g., introducing a new concept, model or understanding, opening a new line of inquiry within traditional or interdisciplinary areas, introducing new mathematical techniques and methodologies, or new applications of known techniques). ITCS welcomes both conceptual and technical contributions whose contents will advance and inspire the greater theory community.
Important dates
 Submission deadline: September 7, 2020 (05:59PM PDT)
 Notification to authors: November 1, 2020
 Conference dates: January 68, 2021
by James at August 03, 2020 02:06 AM UTC
When teaching discrete math a while back I told the following story which some had already heard in High School:
When Gauss was in 1st grade the class was being bad. So the teacher made them sit down and add up the numbers from 1 to 100. Gauss did it in 2 minutes by noting that if S was the answer then
2S = (100+1) +(99+2) + ... + (1 + 100) = 100*101
So S = 50*101. Then he went to Google and typed in 50*101 for the answer.
The class laughed because of course the last part about Google was false. But I then told them that the entire story was false and showed them the following slides: here Take a look at them (there are only 4 of them) before reading on.
So I told them the Gauss Story was false (I am right about this) and then told them a lie that the story's progression over time was orderly. I then told them that that was false (hmmm actually I might not of, oh well).
One of my students emailed me this semester
Dr Gasarch one of my Math professors is telling the Gauss story as if its true! You should make a public service announcement and tell people its false!
OKAY what do you do if you have a nice story that has some good MATH in it but its not true?
Options:
by Unknown (noreply@blogger.com) at August 03, 2020 02:02 AM UTC
The last of my CCCG 2020 papers is now on the arXiv: “New Results in Sona Drawing: Hardness and TSP Separation”, arXiv:2007.15784, with Chiu, Demaine, Diomidov, Hearn, Hesterberg, Korman, Parada, and Rudoy. (As you might infer from the long list of coauthors, it’s a Barbados workshop paper.) The paper studies a mathematical formalization of the lusona drawings of southwest Africa; in this formalization, a sona curve for a given set of points is a curve that can be drawn in a single motion, intersecting itself only at simple crossings, and surrounding each given point in a separate region of the plane, with no empty regions. The paper proves that it’s hard to find the shortest one, hard even to find whether one exists when restricted to grid edges, and gives tighter bounds for the widest possible ratio between sona curve length and TSP tour length; see the preprint or the video I already posted for more information.
To save this post from being contentfree, here’s a research question that we didn’t even state in the paper, let alone make any progress on solving: just how many of these curves can a given set of points have? A sona curve can be described as a 4regular plane multigraph (satisfying certain extra conditions) together with an assignment of the given points to its bounded faces, so there are finitely many of these things up to some sort of topological equivalence. And because this is topological it shouldn’t matter where the points are placed in the plane: the number of curves should be a function only of the number of points. I tried handenumerating the curves for up to three points but it was already messy and I’m not certain I got them all. (In an earlier version of this post I definitely didn’t get them all — I had to update the figure below after finding more.) Here are the ones I found:
If this hand enumeration is correct, then the numbers of sona curves for \(n\) labeled points form an integer sequence beginning \(1, 3, 24,\dots\) and the numbers for unlabeled points form a sequence beginning \(1, 2, 5,\dots\) but I don’t really know anything more than that for this problem.
Another research direction I don’t know much about yet: given a topological equivalence class of sona drawings, how can we find a good layout for it as an explicit drawing? There’s lots of research on drawing plane graphs nicely but it’s not clear how much of it carries over to making nice sona curves.
by David Eppstein at August 02, 2020 11:48 PM UTC
at August 02, 2020 06:10 AM UTC
Today we talk about the paper, Proof of Tomaszewski’s Conjecture on Randomly Signed Sums, by Nathan Keller and Ohad Klein.
Consider a unit vector That is latex \sum_{i=1}^n a_i^2=1$. Consider all () signed sums where each is either 1 or 1.
Theorem (Keller and Klein (2020) asked by Boguslav Tomaszewski (1986)): For at least signed sums
Another way to state the theorem is that the probability of a signed sum to be in the interval [1, 1] is at least 1/2.
To see that this is best possible consider the case that and let be non zero. For the sum in question to exceed one we need both summands to have the same sign which happens half of the times. There is another example of importance, the vector . Here 3/8 of the absolute values of signed sums (6 out of 16) are below 1 (in fact, equal to zero), 1/2 equal to 1 and 1/8 exceed 1. Holzman and Kleitman proved in 1992 that the fraction of absolute values of signed sums below 1 is always at least 3/8.
Congratulations to Nathan and Ohad. I will say a little more about the problem below but before that, a few more things.
A few more things
Luca Trevisan posted on his blog In Theory a post “Silver linings” about two cheerful pieces of news. The first one is “Karlin, Klein, and Oveis Gharan have just posted a paper in which, at long last, they improve over the 1.5 approximation ratio for metric TSP which was achieved, in 1974, by Christofides.”
The second one is about breaking the logarithmic barrier for Roth’s theorem that we wrote about here. This was also discussed by Bill Gasarch on Computational Complexity. In the comment section of my post there is an interesting discussion regarding timetable for future achievements and how surprising they would be.
The third is about Ron Graham, a friend and a mathematical giant who passed away a few days ago. Here is a moving post by Fan Chung, a web page for Ron set by Fan, and a blog post by Dick and Ken on GLL.
The fourth is that there is a nice collection of open problems on Boolean functions that is cited in the paper of Nathan and Ohad: Y. Filmus, H. Hatami, S. Heilman, E. Mossel, R. O’Donnell, S. Sachdeva, A. Wan, and K. Wimmer, Real analysis in computer science: A collection of open problems.
The fifth is that both our (HUJI) combinatorics seminar and basic notions seminar are running and are recorded. Here are the links. (Hmm, the links are not yet available, I will update.)
Back to the result of Keller and Klein
Daniel Kleitman and Ron Holzman
A quick orientation
If the s are all the same, or small, or random, then to compute the probability that the weighted sum is between 1 and 1, we can use some Gaussian approximation and then we will find ourselves in a clash of constants that goes our way. The probability will be close to a constant well above 1/2. So what we need to understand is the case where some s are large.
Early papers on the problem
The problem first appeared in the American Math Monthly. Richard Guy collected several problems and challenged the readers Any Answers Anent These Analytical Enigmas? (I don’t know what the fate of the other questions is.) Holzman and Kleitman proved in 1992 that the fraction of absolute values of signed sums below 1 is always at least 3/8, and this is tight. For many years, 3/8 was the record for the original problem, until the 2017 paper by Ravi Boppana and Ron Holzman: Tomaszewski’s problem on randomly signed sums: Breaking the 3/8 barrier, where a lower bound of 0.406, was proved. The current record 0f 0.46 was proved in the paper Improved Bound for Tomaszewski’s Problem by Vojtěch Dvořák, Peter van Hintum, and Marius Tiba. The new definite result by Nathan and Ohad used some ideas of these early papers.
What is the crux of matters
Let me quote what the authors kindly wrote me:
“The crux of the matter is how to deal with the case of very large coefficients (). We gave a short semiinductive argument covering this case (this is Section 5 of the paper). The argument is only semiinductive, as it requires the full assertion of Tomaszewski for any n'<n, and gives only the case () for . But this means that if we can handle all other cases by other methods then we will be done.
The semiinductive argument takes only 3 pages. Handling the other cases takes 72 more pages and requires several new tools, but is closer to things that were done in previous works. (Actually, after we found the 3page argument, we were quite sure we will be able to finalize the proof; this indeed happened, but took a year).”
Most of the paper deals with the case of small coefficients. This requires several ideas and new tools.
Rademacher sums: Improved BerryEsseen and local tail inequalities
If all coefficients are “sufficiently small”, then we can
approximate X by a Gaussian and the inequality should follow. However, using the standard BerryEsseen bound, this holds only if all coefficients are less than 0.16.
Nathan and Ohad showed that for Rademacher sums, namely random variables of the form , as discussed in the conjecture, a stronger BerryEsseen
bound can be obtained, and this bound shows immediately that Tomaszewski’s assertion holds whenever all coefficients are less than 0.31. The stronger bound stems
from a method of Prawitz, presented in the 1972 paper. H. Prawitz, Limits for a distribution, if the characteristic function is given in a finite domain, which appeared in the Scandinavian Actuarial journal.
The second tool is local tail inequalities for Rademacher sums, of the form where a,b,c,d satisfy certain conditions. Inequalities of this kind were obtained before by Devroye and Lugosi in the 2008 paper: Local tail bounds for functions of independent random variables.
These local tail inequalities already have some other applications, e.g., to analysis of Boolean functions. They were developed and applied in an earlier paper paper of Keller and Klein: Biased halfspaces, noise sensitivity, and relative Chernoff inequalities. Let me mention my related MO question A variance tail description for continuous probability distributions.
A couple more ingredients
A stopping time argument. Variants of Tomaszewski’s problem appeared in various fields. The problem was stated independently in a 2002 paper by BenTal, Nemirovski, Roos, Robust solutions of uncertain quadratic and conicquadratic problems. A stopping time argument introduced there (for proving a lower bound of 1/3) played a crucial role in subsequent works and the critical semiinductive argument by Nathan and Ohad.
Refinements of the famous Chebyshev’s inequality. (Did you know Chebyshev’s full name? Ans: Pafnuty Lvovich Chebyshev.)
Questions and connections that come to mind
Q1: What can be said about families of signs that can serve as those signs for which for some vector .
Q2: What can be said about the complex version or even more generally about high dimensions?
Q3: Are there any relations to LittlewoodOfford type problems?
Q4: Is there any relation to the Komlos Conjecture?
See also this MO question by Luca Trevisan and this one by George Lowther.
Is there a simpler proof?
We can ask about simpler or just different proofs for almost every result we discuss here. But here the statement is so simple…
by Gil Kalai at August 01, 2020 06:32 PM UTC
at August 01, 2020 05:13 AM UTC

Gil Kalai on recent developments in Roth’s theorem (\(\mathbb{M}\), see also). Salem and Spencer and later Behrend proved in the 1940s that subsets of \([1,n]\) with no triple in arithmetic progression can have nearly linear size, and Klaus Roth proved in 1953 that they must be sublinear. The upper bounds have slowly come down, to \(n/\log^{1+c} n\) in this new result, but they’re still far from Behrend’s \(n/e^{O(\sqrt{\log n})}\) lower bound.

Peter Cameron describes Peter Sarnak’s Hardy Lecture (\(\mathbb{M}\)). It’s on the spectral theory of graphs. If you know about this you probably already know that regular graphs with a big gap between the largest eigenvalue (degree) and the second largest are very good expander graphs. It turns out that 3regular graphs with gaps elsewhere in their spectrum are also important in the theories of waveguides and fullerenes, and some tight bounds on where those gaps can be are now known.

Optimal angle bounds for quadrilateral meshes (\(\mathbb{M}\)). Christopher J. Bishop meshes any simple polygon (why simple?) with max angle 120° and min angle max(60°, min of the polygon). Nice techniques involving conformal mapping, hyperbolic tessellation, and thick/thin decompositions of hyperbolic convex hulls of ideal sets. Also amusing to see him have to disambiguate my name from David B. A. Epstein’s within a single paragraph.

Onedimensional diagonal cellular automata generate Sierpinski carpets and intricate branching structures (\(\mathbb{M}\), see also). Via the June 27 update to mathpuzzle.com which also has plenty of other neat stuff involving tilings, drawings of symmetric graphs, graceful labeling, rectangle dissection into similar rectangles, etc.

Terry Tao on mathematical notation (\(\mathbb{M}\)), in response to a MathOverflow question about why there’s more than one way to write inner products.

Carmesin’s 3d version of Whitney’s planarity criterion (\(\mathbb{M}\)): a simplyconnected 2dimensional simplicial complex (meeting a technical condition, “locality”) can be topologically embedded into Euclidean space if and only if a certain ternary matroid on its faces has a graphic dual. The proof relies on Perelman’s proof of the Poincaré conjecture! Simplyconnected complexes are pretty restrictive but they include e.g. the cone over a graph, which embeds if and only if the graph is planar.

Fastgrowing functions revisited (\(\mathbb{M}\)). News of recent developments relating the busy beaver function with Graham’s number, and proofs of some older claims.

Wikipedia, the free online medical encyclopedia anyone can plagiarize: Time to address wiki‑plagiarism (\(\mathbb{M}\), via). In this editorial in Publishing Research Quarterly, Michaël R. Laurent identifies five PubMedindexed papers that copied content from Wikipedia without crediting it (noting that this is much more prevalent in predatory book and journal publishing), and argues that doing this should be treated as a form of academic misconduct.

Facebook temporarily blocks posts of links to dreamwidth (\(\mathbb{M}\), via). Maybe it was just a mistake? And I guess the decentralization of Mastodon would make doing this to Mastodon posts somewhat harder. But this continued wallingoff of the open web is not a good thing.

How much we don’t know about Fibonacci (\(\mathbb{M}\)). Entry F in Joseph Nebus’s 2020 mathematics AtoZ.

Rational dodecahedron inscribed in unit sphere (\(\mathbb{M}\)). It’s easy to inscribe a dodecahedron in the unit sphere: just use a regular one of the appropriate size. And it’s not hard to construct a dodecahedron combinatorially equivalent to the regular dodecahedron but with integer coordinates. Now Adam Goucher shows how to do both at once, in answer to an old MathOverflow question.

Descartes on Polyhedra (\(\mathbb{M}\), see also). This book is mainly on whether Descartes (circa 1630) knew Euler’s formula \(EV+F=2\) (before Euler in 1752, but after Maurolico in 1537). It also covers Descartes’ invention of polyhedral figurate numbers beyond the cubes and pyramidal ones known to the Greeks. Descartes’ manuscript has an interesting history: found after his death in a desk, sunk in the Seine, copied by Leibniz, both copies lost, and Leibniz’s copy finally rediscovered in 1860.

Spherical geometry is stranger than hyperbolic (in how it looks from an inuniverse viewpoint) (\(\mathbb{M}\), via).
by David Eppstein at July 31, 2020 09:03 PM UTC
My newest arXiv preprint is “Acutely triangulated, stacked, and very ununfoldable polyhedra” with Erik and Martin Demaine (arXiv:2007.14525). It’s about polyhedra with acutetriangle faces that cannot be unfolded without cutting their surface into many separate polygons. I already posted a video for the paper so see that for more information.
Instead, I thought I’d go into a little more detail about a throwaway remark in the video and the paper (one that I already got an email query about). It says that ideal hyperbolic polyhedra can always be unfolded (into the hyperbolic plane). These polyhedra are the hyperbolic convex hulls of finitely many limit points of the hyperbolic space; their faces are ideal polygons, glued together along entire hyperbolic lines. More strongly, if you cut an ideal polyhedron along any spanning tree of its vertices and edges, the result always unfolds into a convex ideal hyperbolic polygon. Here, for instance, is a net for an ideal cube:
I don’t know of a previous reference for this result, and the paper and video state it without proof (because it’s an introductory remark and not the topic of the paper), but it’s easy to prove a stronger statement by induction: any collection of ideal hyperbolic polygons (like the faces of an ideal polyhedron), when connected edgetoedge in a complex with the connectivity of a tree (like the faces of any convex polyhedron when you cut it along a spanning tree), unfolds to an ideal convex polygon. As a base case, when you have one polygon in your collection, it unfolds to itself. When you have more than one, find a leaf polygon of the tree structure, remove it, and unfold the rest into a convex ideal polygon. Now add back the leaf. It needs to be connected to the rest of the complex along a hyperbolic line, which (by the induction hypothesis that the rest unfolds convexly) has the rest of the complex on one side and an empty hyperbolic halfplane on the other side. Any convex ideal polygon can be placed within this halfplane so that the side on which it should be glued matches up with the boundary line of the halfplane, with enough freedom to match up the points along this line that should be matched up.
This caused me to wonder: which Euclidean convex polyhedra have the same property, that cutting them along any spanning tree leads to a convex unfolding? The answer is: not very many. By Descartes’ theorem on total angular defect, the angular defects at the vertices of a convex polyhedron add up to \(4\pi\). If a polyhedron is to have all spanning trees produce a (weakly) convex unfolding, then each vertex has to have angular defect at least \(\pi\), because otherwise cutting along a spanning tree that has a leaf at that vertex will make an unfolding that is nonconvex at that vertex. And this is the only thing that can go wrong, because if all angular defects are at least \(\pi\) then the unfolding will be convex at each of its vertices and cannot selfoverlap.
So to answer the question about Euclidean polyhedra with all unfoldings convex, we need only look for ways to partition the total angular defect of \(4\pi\) among some set of vertices so that each one gets at least \(\pi\). If we know the defects of all the vertices and the distances between vertices, then by Alexandrov’s uniqueness theorem the shape of the polyhedron will be determined. Since we’re using Alexandrov, we should also consider a dihedron (two mirrorimage convex faces glued at their edges) to be a special case of a polyhedron. This leaves, as the only cases:

A triangular dihedron based on a right or acute triangle.

A rectangular dihedron.

A tetrahedron with angular defect exactly \(\pi\) at each vertex.
The unfoldings of the dihedra have two copies of their face, mirrored across a joining edge. The tetrahedra with allconvex unfoldings are exactly the disphenoids, the tetrahedra whose four faces are congruent. They unfold either to a copy of the same face shape, expanded by a factor of two in each dimension and creased into four copies along its medial triangle, or a parallelogram, creased to form a strip of four congruent triangles. Their unfoldings were discussed by Jin Akiyama in his paper “Tilemakers and semitilemakers” (American Mathematical Monthly 2007, doi:10.1080/00029890.2007.11920450, jstor:27642275), as part of a broader investigation of polyhedra whose every unfolding tiles the plane.
by David Eppstein at July 29, 2020 10:18 PM UTC
Junior and Advanced Fellows of the ETH Institute for Theoretical Studies are independent postdocs of exceptional talent and promise, having achieved significant results in mathematics, theoretical computer science or the theoretical natural sciences. Junior Fellows stay at the Institute for up to three years, Advanced Fellows up to five years.
Website: https://ethits.ethz.ch/fellows/nominationofjuniorfellows1.html
Email: nominations@ethits.ethz.ch
by shacharlovett at July 29, 2020 02:52 PM UTC
And Razborov’s brilliant proof method
Stasys Jukna is the author of the book Extremal Combinatorics With Applications in Computer Science.
Today we talk about Jukna’s book on extremal combinatorics.
The structure of his book is great. The material is useful and well presented. Rather than add more general comments about his book, we thought we might highlight one tiny part—the part on monotone circuit lower bounds. Here goes. All below is based directly on his discussion. Any errors or misguided comments are ours.
Monotone Boolean Functions
Fix an input size and consider some property of subsets of . Let exactly when has the property. We can think of as a Boolean function. You believe that this property is hard to compute—how do you go about proving that?
In general we have no tools, but if the property is monotone, then there are some powerful methods. Recall monotone means that if then any set so that still has the property. For example, could be that includes at least half of the elements of . It cannot be that has an even number of elements. Another example is when is given in disjunctive normal form (DNF),
where each term is a conjunction of variables. Each can be regarded as a subset of . Then if and only if for some . Every monotone function also has a conjunctive normal form (CNF)
where each clause is a disjunction of variables. Then if and only if for all The problem is that the numbers of terms and of clauses involved may be huge. The clauses may have different sizes. Given a CNF of maximum clause size , we write for the conjunction of clauses of size exactly and for the rest. We similarly write and for DNFs .
The lower bound methods are on the size of a monotone circuit for . That is the circuit can only use gates and , but no other types of gates, especially not gates. Of course, if has no small monotone circuits, then it has no small DNF or CNF formulas either.
The neat fact on which the lowerbound technique builds is that if does have small monotone circuits, then we can “wrap” it between a CNF and a DNF in various customizable ways:
Theorem 1 (informal) For every with small monotone circuits and we can find a CNF of maximum clause size and a DNF of maximum term size such that
Moreover, and are small.
We have said “wrap” not “sandwich” because although is the “upper slice,” the part of with smaller terms—but there could be many of them—wraps around to be under the corresponding part of . This fact will enable us to throw away the smaller clauses and terms. How small is “small”? We will say later. We are trying to solve problems of exposition by keeping a highlevel view at the start.
Exposition Problems
Tim Gowers has written an article about the lower method for monotone functions. The method is due to Alexander Razborov in his seminal 1985 paper and extended by Noga Alon and Ravi Boppana in their paper right afterward, and by Benjamin Rossman in his 2009 paper, to name a few.
Gowers says right away that the original papers on this method are clear and well written. But he believes that there is need for more exposition. The method is so important that it must be made easy for all to understand. He says his article is an attempt to solve an open exposition problem. The notion of an exposition problem is due to Timothy Chow who wrote:
All mathematicians are familiar with the concept of an open research problem. I propose the less familiar concept of an open exposition problem.
Chow raised this issue with respect to the forcing method in set theory due to Paul Cohen. A modest suggestion: Read Chow on forcing, a great exposition; read Gowers on the monotone lower bound method, another great one. Both are much better than anything we can do. But we will put our own spin on the lower bound method. And hope to add to the quest to solve the exposition problem.
The Method—High Level
Suppose that is a monotone boolean circuit that has inputs and computes at the last gate. The method is called the approximation method because the idea is that it builds two other boolean functions and : for all in :
This follows a tradition in math that we often replace a complex function, , with simpler upper and lower bounds. Standard stuff.
Usually the point is that the approximators are not only easier to understand but also simpler in some objective sense. For example, Christophe Chesneau and Yogesh Bagul give a nice short compendium of approximating formulas involving trigonometric functions by formulas without them, including that for all ,
with . If you have to reason about the behavior of , it is nice to have these upper and lower bounds. Note that the upper bound kindof wraps around because it is the same kind of function as the lower bound.
What gives the monotone method a special twist is that and are not necessarily simple in the sense of being small. Rather, they make simple errors—ones that can be corrected with small effort. The correction process yields and Isolating what is small, however, requires us to trade an “AND” of two inequalities for an “OR” of two economical ones. We know that at least one of the latter inequalities must be true. We arrange that either one gives us the kind of lower bound we seek.
Some More Detail
Here is how the trade happens. From Theorem 1 we have:
where: and are small, and while and might be big, we have . The trick is to ask:
Is empty—that is, is it the trivial function?

If yes, then it goes away on the lefthand side. We get:
Since is small, this is something we want. We got a small lower bound on that holds for all arguments .

If no, then it has a nontrivial clause corresponding to a set of size at most . This is where the wraparound comes in. We have:
since we chose at least one clause. Substituting on the righthand side thus gives us:
Now is small, since it is just one clause, and is small. We got a small upper bound rather than lower bound, but the fact that it has a restricted form and holds for all cases we can input to will give us a lower bound on .
Finally we are ready to state the theorem, which quantifies “small.” To follow Jukna, we now need to replace “” by “” and “” by “.” But the essence is the same.
Theorem 2 If has a monotone Boolean circuit of size , then for any such that , we can build a conjunction of at most clauses of size exactly , a disjunction of at most terms of size exactly , and a set of size at most such that either or .
Rather than reprove this, we will continue the discussion with a concrete example. An exposition trick is: give examples before the general case and then abstract. Our example will involve graphs —so the variables have the form , where means there is an edge between vertex and vertex , otherwise. Putting as the number of vertices, the number of possible edges is . We think of as a set of edges, so .
Checking for Triangles
Let hold precisely when has a triangle. This is clearly a monotone property. Our goal is to use the lower and upper bounds to prove that the monotone complexity of is almost of order . A side note is that the general complexity is much less via matrix products.
The first beauty of using the method is that you get to choose the parameters and with a goal in mind. The and must be in . The value of will be a lower bound on the size of any monotone boolean circuit for . The parameters are bounds on the clause and term size of the DNF and the CNF. You can select them any way you wish. But of course choose them wisely.
In this case we know that is a right choice. We will say what is later but we will have . Once you pick them, the CNF and DNF (and small set , a set of edges in this case) are chosen for you. You have no control over the sets that make up the terms of and the sets that correspond to the clauses of . Well you do know something about them. Here is what you do know about how many sets there are and how big the sets are:
 For , each is of size .
 For , each is of size .
The goal in either case is to force to be large. We’ve numbered the righthand case first.

Case . Here we want to consider graphs that do have a triangle—and nothing else. Because includes at most edges, hence touches at most vertices, and , we can focus on triangles among the untouched vertices. There are such triangles, hence graphs to consider.
Since these graphs have no edges in but make , there must be some such that . Since has size , this means has two edges of the triangle. Now the point is:
For each , there is at most one triangle that can be two edges of.
Hence there must be at least as many terms as possible triangles. This means:
Because , we finally get , where the tilde means to ignore factors of .

Case . Here we want to consider graphs such that but is chock full of as many edges as one can have without creating a triangle. Such include complete bipartite graphs. There are such graph inputs, as can be realized from how any binary string except and encodes such a graph—and only its bitcomplement encodes the same labeled graph.
In order to keep we need for all such , so we need (at least) one clause to fail on . This means that all vertices touched by the edges in must be in the same partition. The more vertices touched, the fewer strings have all s (or all s) in the corresponding positions, which means the fewer graphs “covered” by that clause. We want to know how many clauses we need to cover all these graphs, hence we try to minimize the number of vertices touched by each clause. That number is at least . The number of graphs we cover is at most (the excludes the empty graph). Thus the number of clauses we need satisfies
By taking we can make in this case. We can actually get bigger functions with bigger , but this balances against case 1 where was the best we could do, so that is our lower bound.
Open Problems
Does this help in understanding the approximation method? Can you work out the concretely optimum choice of in the triangle example?
Would you prefer not changing and in the statement of Theorem 2? Then we would have worded the triangle example with “” rather than “.” The former is a little more suggestive of the idea of having two edges of a triangle. Doing so, however, could make notation in the proof of Theorem 2 somewhat messier. Another possibility was keeping Jukna’s usage throughout, so that the earlier version 1 of the theorem would say with and being small. We try to solve “exposition problems” in every post but feel a dilemma here. Comments might help us on a followup post.
by RJLipton+KWRegan at July 27, 2020 06:39 AM UTC
1) If a chess program is only trained against expert chess players then it might get confused if its opponent makes a bad move. This is not dangerous. But imagine a nuclear missle system that assumes the opponent is rational. If the opponent is not rational then it might launch and have an accidental nuclear war. So there must be a human component so that this won't happen.
I offer a story and a counter narrative. In the 5th season, 23rd episode of the TV show Castle,
title The Human Factor a character had the following story to tell:
The drone on its own was going to bomb a car. But the human noticed that there were red roses on the car, so it was a wedding couple, not a terrorist. If a human had not been involved the drone may have killed an innocent just married couple!
This scene bothered me. It could EASILY be the other way around: the human wants to bomb and the drone (which has better vision) notices the roses. Or there may be many other ways that a computer could be BETTER than a human. I am not saying that a completely automated system is better, I am saying that its not obvious which way to go. Both in some combination? What combination? Who has the final say? And in the drone scenario there may not be time for a human to consider the options.
2) The Sorcerer's apprentice scenario. In The Sorcerer's Apprentice segment of the (original) movie Fantasia, Mickey mouse tells a broom to get him a glass of water. The broom keeps bringing him water and Mickey almost drowns. Computers may take orders to literally and not stop. I wonder if automated stocktrading and automated auctions may have this problem. Is there a case known where this really did cause a problem?
by GASARCH (noreply@blogger.com) at July 27, 2020 03:18 AM UTC
Doyle derived his LQG counterexample in the time before the ubiquity of numerical computing. This meant that numerical examples did not carry the rhetorical weight of algebraic closed form instances. The need for clean, persuasive formulae also meant that controllers were idealized in continuous time. Continuoustime optimal control often produced policies that couldn’t be implemented because of the limits of physical reality: no system can act instantaneously with arbitrary power. These issues of infeasibility were certainly noted in the literature during the hey day of optimal control, but continuous time models still often made it difficult to pinpoint these issues.
Discretetime models don’t share many of these issues. In discrete time, we explicitly encode the sequential, computational nature of decision and control. Discretetime formulae are unfortunately less elegant than their continuoustime counterparts, but, as I hope to show here, they are often more revealing. Indeed, constructing examples where discretetime optimal control leads to fragile solutions seems to be surprisingly easy.
Here, I’ll highlight a few examples where relatively innocuous problem formulations lead to very fragile control policies. The examples are weirdly simple and almost comical to a point. But anyone who has played with discretetime optimal control may have stumbled into similar control policies and had to step back and think about why.
Let’s revisit the discretetime LQR problem:
We again assume $x_t$ is observed perfectly without noise. While such perfect state information is not realistic, even ideal state feedback ends up being fragile in discrete time. $w_t$ is assumed to be stochastic, but I don’t think much changes if we move to a more adversarial setting. Here, we need the decision variable $u_t$ to be causal. It must be a function of only the values $x_s$ and $u_s$ with $s\leq t$. For stochastic disturbances, the optimal $u$ can always be found by dynamic programming.
Consider the following innocuous dynamics:
This system is a simple, twostate shift register. I’ll write the state out with indexed components $x=[x^{(1)},x^{(2)}]^\top$. New states enter through the control $B$ into the second state. The first state, $x^{(1)}$ is simply whatever was in the second register at the previous time step. The open loop dynamics of this system are as stable as you could imagine. Both eigenvalues of $A$ are zero.
Let’s say our control objective aims to try to keep the two states equal to each other. We can model this with the quadratic cost:
I assume $R=0$ here for simplicity, as the formulae are particularly nice for this case. But, as I will discuss in a moment, the situation is not improved simply by having $R$ be positive. For the disturbance, assume that $w_t$ is zero mean, has bounded second moment, $\Sigma_t = \mathbb{E}[w_t w_t^\top]$, and is uncorrelated with $x_t$ and $u_t$.
The cost is asking to minimize
When $w_t=0$, $x_t^{(1)}+x_t^{(2)} = x_{t1}^{(2)}+u_{t1}$, so it seems like our best bet is to just set $u_{t}=x_t^{(2)}$. This turns out to be the optimal action, and you can prove this directly using standard dynamic programming computations. What this means is that the closed loop dynamics of the system are
This closedloop system is marginally stable, meaning that while signals don’t blow up, some states will persist forever and not converge to $0$. Indeed, the statetransition matrix here has eigenvalues $0$ and $1$. The $1$ corresponds the state where the two components are equal, and such a state can persist forever.
If we learned an incorrect model of the dynamics, how would that influence the closed loop behavior? The simplest scenario is that we identified $B$ from some preliminary experiments. We can immediately see that if the true $B_\star=\alpha B $, then the closed loop dynamics are
This system is unstable for any $\alpha>1$. That is, the system is arbitrarily sensitive to misidentification of the dynamics. Note that this lack of robustness has nothing to do with the noise sequence. The structure of the cost is what drives the system to fragility.
If $R>0$, you will get a slightly different policy. Again, using elementary dynamic programming shows that the optimal control is $u_t=\beta_t(R) x_t^{(2)}$ for some $\beta_t(R) \in (1/2,1)$. The closed loop system will be a bit more stable, but this comes at the price of reduced performance. And, at best, the gain margin of this system approaches $2$ as $R$ goes to infinity. You can also check that if you add $\epsilon$ times the identity to $Q$, you again get a control policy proportional to $x_t^{(2)}$.
This behavior can occur in even simpler systems. Consider the onestate linear system
The open loop system is again as stable as it gets. Now let’s aim to minimize $\Vert xu \Vert$. It doesn’t matter what norm you choose here or whether you treat the noise as stochastic or worst case with respect to $w$, the optimal control is going to be $u_t = x_t/b$. Once again, the closed loop system has a pole at $1$ and is arbitrary fragile to misspecification of $b$.
I could continue to construct nasty examples, but I hope these examples are sufficiently illustrative. They are certainly contrived and pathological, and it’s not at all clear that they reflect any optimal control problem you might have been hoping to solve. However, both examples involve systems that are robust and stable in open loop. It’s only when we close the feedback loop that we end up in a dangerous situation. That simple optimal control problems give some profoundly fragile solutions should be a clear warning: You can’t just optimize and hope to be robust. You have to consider uncertainty as a first class citizen when designing feedback systems.
In some sense, the core contribution of robust control is in raising awareness of fundamental tradeoffs in the design of feedback systems. Optimal control promises that you can roughly identify a system, model uncertainty as noise, solve an optimization problem, and then ship your policy. Hopefully, the examples in the last two posts have shown why this particular approach is fraught with danger.
If failure of a feedback system has any consequences, then a more holistic robust approach is necessary. We have to work with experts at different levels of the engineering pipeline, worry about unmodeled behaviors, and understand hard limits and practical tradeoffs. That is, engineering has to be more concerned with design than with optimization.
There are all sorts of questions that a robust, systems level engineering effort might ask. Where should you put that extra sensor? Which parts of the system are likely to create issues? Is it possible to avoid performance disruptions when updating a single component in a legacy system? These questions are important in all aspects of system engineering, and developing accessible tools for addressing them in machine learning systems remains a daunting but essential challenge.
I am emphatically not saying that the design of feedback systems is hopeless. It’s easy to walk away with the impression “Ben’s examples are pathologies and unlike what I see in practice” or the pessimistic feeling of “shoot, all of this ML stuff is hopeless, I’m going to go work on something tractable like vaccine development.” I’m not saying that engineering robust machine learning systems is hopeless. I’m just saying that our community has to work better to incorporate multiple levels of uncertainty in its thinking. What are the fundamental tradeoffs between performance and robustness in machine learning? What do we even want to be robust to? In the next post I want to describe some of these robustness tradeoffs without using the language of optimization, probing if that provides some possible paths forward.
by Lance Fortnow (noreply@blogger.com) at July 24, 2020 04:21 PM UTC
This is a guest post kindly contributed by Noam Lifshitz. Here is a pdf version. This post is a continuation of the post To cheer you up in difficult times 3: A guest post by Noam Lifshitz on the new hypercontractivity inequality of Peter Keevash, Noam Lifshitz, Eoin Long and Dor Minzer, and it gives the proof of the new hypercontractive inequality. We plan a third post where various applications will be mentioned.
Before we get to the post I want to mention that there are a lot of activities on the web. I may devote a special post to links and discussion (and contributing links in the comment section is very welcome.) but meanwhile a few links: 1) Advances in Boolean Function Analysis Lecture Series (thanks to Avishay Tal and Prasad Raghavendra for letting me know); 2) Online course in Foundations of Algebraic Geometry Given by Ravi Vakil from Stanford. You can take the course at varying levels of involvement. (Thanks to Tami Ziegler for telling me) A very very interesting way of online teaching. 3) A site with online mathematical lectures.
Aline Bonami with Szilard Revesz and me (2006). Aline Bonami first proved the 2point hypercontractive inequality which is very useful in the analysis of Boolean functions. (Leonard Gross proved it independently a few years later and William Beckner found important applications to harmonic analysis.)
Proof of the new hypercontractivity inequality
Our aim is to prove the hypercontractivity theorem for global functions. The proof here is taken from a joint paper with David Ellis and Guy Kindler that’ll soon be out on the Arxiv.
Theorem 1:
Here we use the notations given in the last blog post. Let us first get a feel for our hypercontractivity theorem by proving the case. Here the RHS is
1. Proof of the case
We will prove the following slightly stronger version of Theorem 1 for the case.
Proposition 2:
Let Let Then
Proof: Let us write for Rearranging, we have
The noise operator in the case is by definition equal to where is the expectation over operator, and is the identity operator. Hence,
Now when expanding the 4norm of the function , we obtain
where we used the fact that the expectation of is 0. When looking at the right hand side of the global hypercontractivity theorem, we see most of the above terms except for the one involving the third norm of the Laplacian. Indeed we have
Hence we see that the only term in the left hand side that doesn’t appear with a greater coefficient in the left hand side is the term and by AMGM we have
which allows us to upper bound the only term appearing in the left hand side but not in the right hand side by corresponding terms that do appear in the right hand side.
2. Tensorisation lemma
Next we are going to prove a theorem that doesn’t seem to fit to our setting, but we’re going to fit it in by force. Let be finite sets. Let us write for the linear space of complex valued functions on . The space can be identified with the space where a pair of function is identified with the function
in
Given two operators , the operator is the unique operator sending to . We write for The operator can also be defined more explictly in terms of its values on functions. The operator can be understood more explicitly by noting that it is the composition of the operators and Now the operator is given by where
Lemma 3: Let be measure spaces with finite underlying sets. Let be operators satisfying
for all functions
Then
for all
Here the spaces and are equipped with the product measure, where the measure of an atom is the product of the measures of its coordiates.
Proof: For each let be given by As mentioned Hence by hypothesis, we have
We may now repeat the same process on each of the other coordinates to replace the s by s one by one.
3. The main idea: Fourifying the 2norms.
The strategy of our proof is to take the theorem
which we established in the case for , and to turn it into an essentially equivalent statement about 4norms. We will then get a tensorised statement for general , which we will be able to convert back into our hypercontractivity theorem for global functions. Our idea is to encode our function as a function satisfying
and
The benefit of working with rather than is that in one may move between 4norms and 2norms by appealing to the hypercontractivity theorem there, which gives
at the cost of some noise.
To define we use Fourier analysis of Abelian groups. Let us briefly recall it. For simplicity let us assume that where is a prime. Let be a th root of unity. For any we have a character given by The are an orthonormal basis of and we write , where Note that is the constant function, and so we have
which gives
Our mission will first be to convert the norm of a function to the norm of a different function.
We define an encoding operator by setting
We have
as the are orthonormal and so are the Moreover, by the Fourier formula for Since norms are always smaller than 4norms on probability spaces, we’ve got the following corollary of Proposition 2.
Lemma 4. For all and all we have
We now reach the final little trick. We define a measure space whose underlying set is and where the measure is given by for and for We let be given by on and letting it be on This way Lemma 4 takes the form
4. Tensorised operators
The operator on satisfies where the latter refers to the noise operator on The characters satisfy and so we have the Fourier formula
We also have
and so
This will allow us to conclude that
We will also encounter the operator which by abusing notation we also call encodes
as the function on
Now finally we can get to the understanding of the operator The space is the disjoint union of spaces of the form
By definition of the tensor product, for is the function
5. Finishing the proof
Proof: Lemmas 3 and 4 yield:
for any We now have
The first equality follows from the formula and the fact that commutes with the encoding. The inequality used hypercontractivity on the discrete cube. The last equality follows from the fact that the operator preserves 2norms.
by Gil Kalai at July 24, 2020 03:21 PM UTC