Theory of Computing Blog Aggregator

We study parity decision trees for Boolean functions. The motivation of our study is the log-rank conjecture for XOR functions and its connection to Fourier analysis and parity decision tree complexity. Our contributions are as follows. Let $f : \mathbb{F}_2^n \to \{-1, 1\}$ be a Boolean function with Fourier support $\mathcal{S}$ and Fourier sparsity $k$. 1) We prove via the probabilistic method that there exists a parity decision tree of depth $O(\sqrt{k})$ that computes $f$. This matches the best known upper bound on the parity decision tree complexity of Boolean functions (Tsang, Wong, Xie, and Zhang, FOCS 2013). Moreover, while previous constructions (Tsang et al., FOCS 2013, Shpilka, Tal, and Volk, Comput. Complex. 2017) build the trees by carefully choosing the parities to be queried in each step, our proof shows that a naive sampling of the parities suffices. 2) We generalize the above result by showing that if the Fourier spectra of Boolean functions satisfy a natural "folding property", then the above proof can be adapted to establish existence of a tree of complexity polynomially smaller than $O(\sqrt k)$. More concretely, the folding property we consider is that for most distinct $\gamma, \delta$ in $\mathcal{S}$, there are at least a polynomial (in $k$) number of pairs $(\alpha, \beta)$ of parities in $\mathcal{S}$ such that $\alpha+\beta=\gamma+\delta$. We make a conjecture in this regard which, if true, implies that the communication complexity of an XOR function is bounded above by the fourth root of the rank of its communication matrix, improving upon the previously known upper bound of square root of rank (Tsang et al., FOCS 2013, Lovett, J. ACM. 2016). 3) Motivated by the above, we present some structural results about the Fourier spectra of Boolean functions. It can be shown by elementary techniques that for any Boolean function $f$ and all $(\alpha, \beta)$ in ${\mathcal{S} \choose 2}$, there exists another pair $(\gamma, \delta)$ in ${\mathcal{S} \choose 2}$ such that $\alpha + \beta = \gamma + \delta$. One can view this as a "trivial" folding property that all Boolean functions satisfy. Prior to our work, it was conceivable that for all $(\alpha, \beta) \in {\mathcal{S} \choose 2}$, there exists exactly one other pair $(\gamma, \delta) \in {\mathcal{S} \choose 2}$ with $\alpha + \beta = \gamma + \delta$. We show, among other results, that there must exist several $\gamma \in \mathbb{F}_2^n$ such that there are at least three pairs of parities $(\alpha_1, \alpha_2) \in {\mathcal{S} \choose 2}$ with $\alpha_1+\alpha_2=\gamma$. This, in particular, rules out the possibility stated earlier.

at August 09, 2020 04:45 PM UTC

  1. Good news, everyone! Following years of requests, this blog finally supports rich HTML and basic TeX in comments. Also, the German spam that used to plague the blog (when JavaScript was disabled) is gone. For all this, I owe deep gratitude to a reader and volunteer named Filip Dimitrovski.
  2. Filip refused to accept any payment for fixing this blog. Instead, he asked only one favor: namely, that I use my platform to raise public awareness about the plight of the MAOI antidepressant Nardil. Filip tells me that, while tens of thousands of people desperately need Nardil—no other antidepressant ever worked for them—it’s become increasingly unavailable because the pharma companies can no longer make money on it. He points me to a SlateStarCodex post from 2015 that explains the problem in more detail (anyone else miss SlateStarCodex?). More recent links about the worsening crisis here, here, and here.
  3. Here’s a fantastic interview of Bill Gates by Steven Levy, about the coronavirus debacle in the US. Gates, who’s always been notoriously and strategically nonpartisan, is more explicit than I’ve ever seen him before in explaining how the Trump administration’s world-historic incompetence led to where we are.
  4. Speaking of which, here’s another excellent article, this one in The American Interest, about the results of “wargames” trying to simulate what happens in the extremely likely event that Trump contests a loss of the November election. Notably, the article sets out six steps that could be taken over the next few months to decrease the chance of a crisis next to which all the previous crises of 2020 will pale.
  5. A reader asked me to share a link to an algorithm competition, related to cryptographic “proofs of time,” that ends on August 31. Apparently, my having shared a link to a predecessor of this competition—at the request of friend-of-the-blog Bram Cohen—played a big role in attracting good entries.
  6. Huge congratulations to my former PhD student Shalev Ben-David, as well as Eric Blais, for co-winning the FOCS’2020 Best Paper Award—along with two other papers—for highly unconventional work about a new minimax theorem for randomized algorithms. (Ben-David and Blais also have a second FOCS paper, which applies their award paper to get the first tight composition theorem for randomized query complexity. Here’s the full list of FOCS papers—lots of great stuff, for a conference that of course won’t physically convene!) Anyway, a central idea in Ben-David and Blais’s new work is to use proper scoring rules to measure the performance of randomized algorithms—algorithms that now make statements like “I’m 90% sure that this is a yes-input,” rather than just outputting a 1-bit guess. Notably, Shalev tells me that he learned about proper scoring rules by reading rationalist blogs. So next time you lament your untold hours sacrificed to that pastime, remind yourself of where it once led!
  7. What have I been up to lately? Besides Busy Beaver, hanging out with my kids, and just trying to survive? Mostly giving a lot of Zoom lectures! For those interested, here’s a Q&A that I recently did on the past and present of quantum computing, hosted by Andris Ambainis in Latvia. It did feel a bit surreal when my “interviewer” asked me to explain how I got into quantum computing research, and my answer was basically: “well, as you know, Andris, a lot of it started when I got hold of your seminal paper back in 1999…”

by Scott at August 09, 2020 08:36 AM UTC

Authors: Shay Kutten, William K. Moses Jr., Gopal Pandurangan, David Peleg
Download: PDF
Abstract: This paper concerns designing distributed algorithms that are singularly optimal, i.e., algorithms that are simultaneously time and message optimal, for the fundamental leader election problem in networks. Our main result is a randomized distributed leader election algorithm for asynchronous complete networks that is essentially (up to a polylogarithmic factor) singularly optimal. Our algorithm uses $O(n)$ messages with high probability and runs in $O(\log^2 n)$ time (with high probability) to elect a unique leader. The $O(n)$ message complexity should be contrasted with the $\Omega(n \log n)$ lower bounds for the deterministic message complexity of leader election algorithms (regardless of time), proven by Korach, Moran, and Zaks (TCS, 1989) for asynchronous algorithms and by Afek and Gafni (SIAM J. Comput., 1991) for synchronous networks. Hence, our result also separates the message complexities of randomized and deterministic leader election. More importantly, our (randomized) time complexity of $O(\log^2 n)$ for obtaining the optimal $O(n)$ message complexity is significantly smaller than the long-standing $\tilde{\Theta}(n)$ time complexity obtained by Afek and Gafni and by Singh (SIAM J. Comput., 1997) for message optimal (deterministic) election in asynchronous networks.

In synchronous complete networks, Afek and Gafni showed an essentially singularly optimal deterministic algorithm with $O(\log n)$ time and $O(n \log n)$ messages. Ramanathan et al. (Distrib. Comput. 2007) used randomization to improve the message complexity, and showed a randomized algorithm with $O(n)$ messages and $O(\log n)$ time (with failure probability $O(1 / \log^{\Omega(1)}n)$). Our second result is a tightly singularly optimal randomized algorithm, with $O(1)$ time and $O(n)$ messages, for this setting, whose time bound holds with certainty and message bound holds with high probability.

at August 09, 2020 11:22 PM UTC

Authors: Philipp Schepper
Download: PDF
Abstract: The currently fastest algorithm for regular expression pattern matching and membership improves the classical O(nm) time algorithm by a factor of about log^{3/2}n. Instead of focussing on general patterns we analyse homogeneous patterns of bounded depth in this work. For them a classification splitting the types in easy (strongly sub-quadratic) and hard (essentially quadratic time under SETH) is known. We take a very fine-grained look at the hard pattern types from this classification and show a dichotomy: few types allow super-poly-logarithmic improvements while the algorithms for the other pattern types can only be improved by a constant number of log-factors, assuming the Formula-SAT Hypothesis.

at August 09, 2020 12:00 AM UTC

Authors: Bhaskar Ray Chaudhury, Jugal Garg, Peter McGlaughlin, Ruta Mehta
Download: PDF
Abstract: We study the fair division problem of allocating a mixed manna under additively separable piecewise linear concave (SPLC) utilities. A mixed manna contains goods that everyone likes and bads that everyone dislikes, as well as items that some like and others dislike. The seminal work of Bogomolnaia et al. [Econometrica'17] argue why allocating a mixed manna is genuinely more complicated than a good or a bad manna, and why competitive equilibrium is the best mechanism. They also provide the existence of equilibrium and establish its peculiar properties (e.g., non-convex and disconnected set of equilibria even under linear utilities), but leave the problem of computing an equilibrium open. This problem remained unresolved even for only bad manna under linear utilities.

Our main result is a simplex-like algorithm based on Lemke's scheme for computing a competitive allocation of a mixed manna under SPLC utilities, a strict generalization of linear. Experimental results on randomly generated instances suggest that our algorithm will be fast in practice. The problem is known to be PPAD-hard for the case of good manna, and we also show a similar result for the case of bad manna. Given these PPAD-hardness results, designing such an algorithm is the only non-brute-force (non-enumerative) option known, e.g., the classic Lemke-Howson algorithm (1964) for computing a Nash equilibrium in a 2-player game is still one of the most widely used algorithms in practice.

Our algorithm also yields several new structural properties as simple corollaries. We obtain a (constructive) proof of existence for a far more general setting, membership of the problem in PPAD, rational-valued solution, and odd number of solutions property. The last property also settles the conjecture of Bogomolnaia et al. in the affirmative.

at August 09, 2020 12:00 AM UTC

Authors: R. Karasev, A. Skopenkov
Download: PDF
Abstract: A low-dimensional version of our main result is the following `converse' of the Conway-Gordon-Sachs Theorem on intrinsic linking of the graph $K_6$ in 3-space:

For any integer $z$ there are 6 points $1,2,3,4,5,6$ in 3-space, of which every two $i,j$ are joint by a polygonal line $ij$, the interior of one polygonal line is disjoint with any other polygonal line, the linking coefficient of any pair disjoint 3-cycles except for $\{123,456\}$ is zero, and for the exceptional pair $\{123,456\}$ is $2z+1$.

We prove a higher-dimensional analogue, which is a `converse' of a lemma by Segal-Spie\.z.

at August 09, 2020 12:00 AM UTC

Authors: Henrique Botelho, Francisco Zampirolli, Valério Ramos Batista
Download: PDF
Abstract: We present an algorithm to find near-optimal weighted Steiner minimal trees in the plane. The algorithm is implemented in Evolver programming language, which already contains many built-in energy minimisation routines. Some are invoked in the program, which enable it to consist of only 183 lines of source code. Our algorithm reproduces the physical experiment of a soap film detaching from connected pins towards a stable configuration. In the non-weighted case comparisons with GeoSteiner are drawn for terminals that form a pattern.

at August 09, 2020 12:00 AM UTC

Authors: Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, Ashwin Maran, N. V. Vinodchandran
Download: PDF
Abstract: We study the problem of efficiently estimating the effect of an intervention on a single variable (atomic interventions) using observational samples in a causal Bayesian network. Our goal is to give algorithms that are efficient in both time and sample complexity in a non-parametric setting.

Tian and Pearl (AAAI `02) have exactly characterized the class of causal graphs for which causal effects of atomic interventions can be identified from observational data. We make their result quantitative. Suppose P is a causal model on a set $\vec{V}$ of n observable variables with respect to a given causal graph G with observable distribution $P$. Let $P_x$ denote the interventional distribution over the observables with respect to an intervention of a designated variable X with x. Assuming that $G$ has bounded in-degree, bounded c-components ($k$), and that the observational distribution is identifiable and satisfies certain strong positivity condition, we give an algorithm that takes $m=\tilde{O}(n\epsilon^{-2})$ samples from $P$ and $O(mn)$ time, and outputs with high probability a description of a distribution $\hat{P}$ such that $d_{\mathrm{TV}}(P_x, \hat{P}) \leq \epsilon$, and:

1. [Evaluation] the description can return in $O(n)$ time the probability $\hat{P}(\vec{v})$ for any assignment $\vec{v}$ to $\vec{V}$

2. [Generation] the description can return an iid sample from $\hat{P}$ in $O(n)$ time.

We also show lower bounds for the sample complexity showing that our sample complexity has an optimal dependence on the parameters $n$ and $\epsilon$, as well as if $k=1$ on the strong positivity parameter.

at August 09, 2020 11:21 PM UTC

We lost a great computer scientist.

Frances Allen was one of the leaders who helped create the field of compilers research. Fran was an elite researcher at IBM, and won a Turing Award for this pioneering work. Allen also collected other awards.

Perhaps the coolest award is one that Fran could never win: The Frances E. Allen award created this year by the IEEE:

For innovative work in computing leading to lasting impact on other fields of engineering, technology, or science.

Today we will talk about Fran, who sadly just passed away.

I consider Fran a friend, although we never worked together—our areas of interest were different. One fond memory of mine is being on panel a while ago with Fran. What a delightful presence.

Fran always seemed to be smiling, always with that smile. The following images come in large part from interviews and lectures and award events-the fact that it is so easy to find them is a testament to her public engagement as well as scientific contributions.

Compliers were Key

There was a time when compliers were the most important program available on any new computer. Perhaps on any computer. Here is proof:

  1. Computers are there to solve problems.

  2. We must write programs to solve problems.

  3. Details of the computer and instructions, are complicated, which make writing programs hard.

  4. Thus, automating the writing of programs is important.

  5. QED: we must have compliers.

Okay this is not really a “proof”, but there is some truth to the argument. Fran was at IBM and worked on some of the early compliers, including FORTRAN and related languages. IBM wanted to sell computers, well actually in the early days rent them. One potential roadblock, IBM realized, was that new computers could be hard to program. Thus to ensure that companies rented new machines as fast as IBM could manufacture them was important. This created the need for compliers and even more for optimizing compilers.

In order to ship more machines, the code that a complier created had to be efficient. Hence, a stress on Allen was to figure out how compliers could generate high quality code. This led Fran and others like John Cocke to discover many complier techniques that are still used today. A short list of the ideas is:

  • Branch prediction

  • Register allocation

  • Control flow graphs

  • Program dependence graphs

  • And many more.

What is so important is that Allen’s work was not just applicable to this machine or that language. Rather the work was able to be used for almost any machine and for almost any language. This universal nature of the work on compliers reminds me of what we try to do in theory. Allen’s research was so important because it could be used for future hardware as well as future languages.

Register Allocation

Guy Steele interviewed Allen for the ACM here. During the interview Fran talked about register allocation:

I have a story about register allocation. FORTRAN back in the 1950’s had the beginnings of a theory of register allocation, even though there were only three registers on the target machine. Quite a bit later, John Backus became interested in applying graph coloring to allocating registers; he worked for about 10 years on that problem and just couldn’t solve it. I considered it the biggest outstanding problem in optimizing compilers for a long time. Optimizing transformations would produce code with symbolic registers; the issue was then to map symbolic registers to real machine registers, of which there was a limited set. For high-performance computing, register allocation often conflicts with instruction scheduling. There wasn’t a good algorithm until the Chaitin algorithm. Chaitin was working on the PL compiler for the 801 system. Ashok Chandra, another student of Knuth’s, joined the department and told about how he had worked on the graph coloring problem, which Knuth had given out in class, and had solved it not by solving the coloring problem directly, but in terms of what is the minimal number of colors needed to color the graph. Greg immediately recognized that he could apply this solution to the register allocator issue. It was a wonderful kind of serendipity.

Compliers Create Theory

The early goal of creating compliers lead directly to some wonderful theory problems. One whole area that dominated early theory research was language theory. In particular understanding questions that arise in defining programming languages. Syntax came first—later semantics was formalized.

Noam Chomsky created context-free grammars to help understand natural languages in the 1950s. His ideas were used by John Backus, also a Turing award winner from IBM, to describe the then new programming language IAL. This is known today as ALGOL 58, which became ALGOL 60. Peter Naur on the ALGOL 60 committee called Backus’s notation for ALGOL’s syntax: Backus normal form, but is now called BNF-Backus-Naur form.

Theorists worked on, I confess I did, many questions about such languages. Existence problems, decidability problems, efficient algorithms, and closure properties were just some of the examples. Not clear how much of this theory effected compiler design, but I would like to think that some was useful. Theorist should thank the complier researchers. I do.

For instance the 1970 STOC program had many papers on language related topics—here are some:

  • A Result on the Relationship between Simple Precedence Languages and Reducing Transition Languages. Gary Lindstrom

  • The Design of Parsers for Incremental Language Processors: Ronald Book, Sheila Greibach, Ben Wegbreit

  • Tape- and Time-Bounded Turing Acceptors and AFLs. David Lewis

  • Closure of Families of Languages under Substitution Operators. William Rounds

  • Tree-Oriented Proofs of Some Theorems on Context-Free and Indexed Languages. Barry Rosen

  • On Syntax-Directed Transduction and Tree Transducers. Alfred Aho, Jeffrey Ullman

  • The Analysis of Two-Dimensional Patterns using Picture Processing Grammars. Jean-Francois Perrot

  • On Some Families of Languages Related to the Dyck Language. Joseph Ullian

  • Three Theorems on Abstract Families of Languages. Joseph Ullian

By the way Abstract Families of Languages or AFLs was created by Seymour Ginsburg and Sheila Greibach in 1967 as a way to generalize context free languages.

Open Problems

Fran was asked by Steele in that interview: Any advice for the future?

Yes, I do have one thing. Students aren’t joining our field, computer science, and I don’t know why. It’s just such an amazing field, and it’s changed the world, and we’re just at the beginning of the change. We have to find a way to get our excitement out to be more publicly visible. It is exciting, in the 50 years that I’ve been involved, the change has been astounding.

Thanks Fran. Much of that change is due to pioneers like you. Thanks for everything.

by rjlipton at August 08, 2020 02:40 PM UTC

I spent the last few days participating in the Canadian Conference in Computational Geometry, originally planned for Saskatoon but organized virtually instead.

The way the conference was organized was that (after the usual submission reviewing process) the accepted authors provided both a proceedings paper and a 10-15 minute talk video to the conference organizers. Participants were required to register, but with no registration fee, and were provided with links to the papers and talks (which are all still live on the conference program). Then, during the conference itself, live online Zoom sessions ran for only 2-3 hours daily, scheduled for 10AM-1PM Saskatoon time: very convenient for anywhere in North America or Europe, not so much for the participants in Iran, India, China, Japan, and Korea (all of which did have participants). The sessions included a daily 1-hour live invited talk, and question-and-answer sessions for the contributed works, in which we were shown a one-minute teaser for each video and then invited to ask questions of authors, at least one of whom was required to be present.

I think it all worked very well; so well, in fact, that during the business meeting there were calls for having at least some of the content similarly online so that people could participate remotely again. The ability to ask and answer questions either by live video on Zoom or through Zoom chat was useful, and used. Attendance was far above previous levels: 162 registrants, and over 120 unique participants at the most heavily attended of the two daily parallel sessions, compared to roughly 60 attendees each at the last two physical conferences. Despite the free registration, the conference organization was not without cost: they spent roughly $1500 (Canadian) in video production costs and video conferencing fees, but this was more than made up for by institutional support for the conference, so they ended up running a surplus which may (if it can be kept) end up providing some float for future conference organizers.

There are two changes I would suggest for future events of this type:

  • The contributed sessions were for a short enough time that holding them in parallel seemed unnecessary, and made it impossible to participate in all discussions. So this format may work better with less parallelism.

  • The one-minute teaser videos were cut together by the conference organizers from the longer videos provided by the authors, but in some cases the pacing of the longer videos from which they were cut meant that these teaser videos could not clearly state the results of their papers. I think it would have been better to ask authors to provide these alongside the longer talk videos.

Some of the highlights of the event:

  • Wednesday’s invited talk by Erik Demaine was a moving tribute to Godfried Toussaint and a survey of some of both Toussaint’s research, and Erik’s research on problems started by Toussaint, including convexifying polygons by flips, sona curves (the subject of one of my own contributions), the geometry of musical rhythms, and perhaps most importantly “supercollaboration”, the model of shared research and shared authorship developed by Toussaint at the Barbados research workshops and also used by Erik within many of his MIT classes. Erik’s talk was recorded and is now on YouTube; I hope the same will be true of the other two invited talks.

  • In Wednesday’s contributed session on unfolding (in which I had two papers) I particularly liked Satyan Devadoss’s talk, “Nets of higher-dimensional cubes”. The main result is that if you unfold a hypercube, then no path of facets of the unfolded shape can contain a u-turn: if the path takes a step in one any coordinate direction, it cannot step in the opposite direction. This implies that all dual spanning trees unfold flat without self-intersection. The same property was known for all the Platonic solids, and Satyan can prove it for regular simplices in any dimension, but that still leaves several regular polytopes for which it is still open whether all unfoldings work: the cross-polytopes in all dimensions, and the three exceptional regular polytopes in four dimensions.

  • Thursday’s invited talk was by Jeff Erickson, “Chasing puppies”. It was an entertaining presentation of an elegant topological proof of the following result: if you and a puppy can both move around a simple closed curve in the plane, with the puppy always moving along the curve to a local minimum of distance to you, then you can always find a path to follow that will bring you and the puppy to the same point.

  • There wasn’t an official prize for best contributed presentation, but in the data structures session on Thursday, several comments nominated Don Sheehy as the unofficial winner, for a video artfully mixing live action with computer animations. His paper, “One-hop greedy permutations” concerned heuristics for improving the farthest-first traversal of a set of points by looking near each point as the sequence is constructed for a better point to use instead.

  • There was an official best student paper award, and it went to Alejandro Flores Velazco for his paper “Social distancing is good for points too!” It concerns the problem of reducing the size of a data set while preserving the quality of nearest-neighbor classification using the reduced set. It proves that FCNN, which it calls the most popular heuristic for this problem, can produce significantly less-reduced outputs than some other proven heuristics, and shows how to modify FCNN to get a heuristic with guaranteed output quality.

  • The third of the invited sessions was by Yusu Wang, newly moved from Ohio State to UC San Diego. She gave a nice introduction to combinatorial methods for reconstructing road networks (or other networks embedded into higher-dimensional geometry) from noisy samples of points on the network by combining discrete Morse theory to find the ridge lines of sample density with persistent homology to clean some of the noise from the data.

  • The final technical component of the conference was an open problem session, also recorded and presumably to be uploaded at some point. Satyan posed his question on regular polytope unfolding there. Mike Paterson asked whether one can construct “Plato’s torus”, an embedded torus with six equilateral-triangle faces meeting at each vertex; in a blog post I made on this problem in 2009 I traced its history to Nick Halloway in 1997 but Mike says he discussed it already with Christopher Zeeman in the 1970s. Another problem that caught my attention asked for an algorithmic version of the polyhedral theorem of the three geodesics, the existence of a path across the surface of a convex polyhedron that stays straight across each edge or face of the polyhedron, and has at most \(\pi\) surface angle on each side of it when it passes through a vertex. Again there’s some history here: Joe O’Rourke says he once mentioned the problem to Gromov, who said it was easy but unfortunately didn’t elaborate.

CCCG 2021 is planned for Halifax, colocated with WADS. One somewhat controversial issue is that the current plan is to have both conferences overlap for two days, with one overlap-free day for each conference at each end of the overlap period. But if both conferences are double-session, this means that participants can only choose one of four overlapping talks. At this point everyone is still hoping that events allow for a physical conference by then but that remains to be seen.

(Discuss on Mastodon)

by David Eppstein at August 07, 2020 05:46 PM UTC

This is a non-partisan post. Even so:  I plan to vote for Joe Biden.

(Lance says that whenever I write `this is a non-partisan post'  its partisan anyway.)

I've had posts on predicting VPs before:

In this post I looked at a Nate Silver column a few months ago where they were trying to predict the democratic VP nomination before the Prez nominee was known. Some of what they said seems relevant.

In this post I PREDICTED that bets on INTRADE would not do well on picking VPs since VP picks are hard to predict and often are not from anyone's short list. I DID NOT PREDICT that INTRADE would go out of business.

I came across a statistic recently that seems relevant and inspired this post. Actually this post asks IS this statistic relevant?

From Prez candidate  Harry Truman to Prez candidate Hillary Clinton there have been 15 Dem VP nominees (not  counting incumbents) of which 13 have been Senators:

Harry Truman(Prez)-Alben Barkely. Sen from Kentucky
Adelaide Stevenson(Gov-Illinois) -John Sparkman. Sen from Alabama
Adelaide Stevenson(Gov-Illinois)-Estes Kefauver, Sen from Tennessee 
John F Kennedy(Sen-Mass)-Lyndon B Johnson, Sen from Texas
Lyndon B Johnson(Prez)-Hubert Humphrey,Sen from Minnesoda
Hubert Humphrey(VP)-Ed Muskie, Sen from Maine
George McGovern Sen-South Dakota)-Sgt Shriver Amb to France, Office of Econ Activity
Jimmy Carter(Gov Georgia)-Walter Mondale- Senator from Minnesota
Walter Mondale(Senator Minnesota)-Geraldine Ferraro Congressperson- NY
Mike Dukakis (Gov Mass), Lloyd Benson-Sen Texas
Bill Clinton(Gov Arkansas), Al Gore Sen from Tennesee
Al Gore (VP), Joe Lieberman Senator from Conn
John Kerry (Sen-Mass), John Edwards Sen from NC
Barack Obama (Sen-Illinois), Joe Biden Sen from Del
Hillary Clinton (Sen-NY), Tom Kaine- Sen Virginia

Of the 15 Dem VP nominees, 13 are Senators

Of the 13 Rep VP nominees, 4 are Senators  see here

My Speculations: 

1)  When a gov runs he wants someone with fed experience. That explains 5 of the cases above.

2) Senators tend to be better known than other politicians, so perhaps being well known is what you need. Note on the Republican Side Paul Ryan was a House member but was well known.

3) Why do the Reps and Dems differ on this? Is the sample size too small to make this question interesting? 

So here is the question: When trying to predict who the Dem VP will be (this year or any year) should being a Senator give someone a plus? A higher weight in a Neural Net? 

Kamela Harris is the favorite on the betting markets.  Is this because she is a Senator?

She has a national profile and has already been vetted since she rana serious campaign  for Prez in the primaries. So I would say that THATS the reason (and other reasons), not that she is a Senator. But would she have been able to run a serious campaign  in the primaries if she wasn't already a Senator? I ask non rhetorically.

by gasarch ( at August 06, 2020 06:59 PM UTC

The 4th Workshop on Local Algorithms (WoLA 2020) recently concluded: aimed at “fostering dialogue and cross-pollination of ideas between the various communities” related to local algorithms, broadly construed, it featured invited and contributed talks on a variety of topics, many (if not most) very relevant to the sublinear algorithms and property testing community.

Because of the online format of the workshop (imposed by the current circumstances), all talks were recorded and posted online. As such, all videos all available on the workshop’s website and YouTube channel: a good list of resources to peruse!

by Clement Canonne at August 06, 2020 12:33 AM UTC

Peter Winkler is a world expert on mathematical puzzles (he is also an excellent researcher and this year’s resident mathematician of the MoMath). I just learned about two things that he is currently up to. Winkler is running Mind-Benders for the Quarantined. After signing up to this free service, you receive a mathematical puzzle every […]

by Adam Sheffer at August 05, 2020 10:51 PM UTC

Years ago when I learned about Google PageRank algorithm, my first reaction was this is not the way it should be done! There should be some proof. This probably just shows that my CS education was too theoretical ;). Years later I have learned that indeed there are some nice tools to argue about the running time of PageRank algorithm. And very recently we were able to give some new parallel (in MPC model) algorithms for computing vanilla PageRank. We improved the number of rounds needed from O(log n) to O(log^2 log n) time. You can hear Solbodan talking out it here: .

by sank at August 05, 2020 06:55 PM UTC

Background: Suppose we are interested in computing the distance between two vertices in a graph. Under edge or node differential privacy, this problem is not promising because the removal of a single edge can make distances change from 1 to \(n − 1\) or can even disconnect the graph. However, a different setting that makes sense to consider is that of a weighted graph \((G, w)\) whose topology \(G = (V, E)\) is publicly known but edge weight function \(w : E \to \mathbb{R}^+\) must be kept private. (For instance, consider transit times on a road network. The topology of the road network may be publicly available as a map, but the edge weights corresponding to transit times may be based on private GPS locations of individual cars.)

Suppose that two weight functions \(w\) and \(w’\) of the same graph \(G\) are considered to be neighbors if they differ by at most 1 in \(\ell^1\) norm. Then the length of any fixed path is sensitivity-one, so the distance between any pair of vertices is also sensitivity-one and can be released privately via the Laplace mechanism. But what if we want to release all \(\Theta(n^2)\) distances between all pairs of vertices in \(G\)? We can do this with accuracy roughly \(O(n \log n )/\varepsilon\) by adding noise to each edge, or roughly \(O(n \sqrt{\log(1/\delta)}/\varepsilon)\) using composition theorems. Both of these are roughly \(n/\varepsilon\). But is this linear dependence on \(n\) inherent, or is it possible to release all-pairs distances with error sublinear in \(n\)?

This setting and question were considered in [S16].

Problem 1: Let \(G\) be an arbitrary public graph, and \(w : E \to \mathbb{R}^+\) be an edge weight function. Can we release approximate all-pairs distances in \((G, w)\) with accuracy sublinear in \(n\) while preserving the privacy of the edge weight function, where two weight functions \(w, w’\) are neighbors if \(\|w − w’\|_1 \le 1\)? Or can we show that any private algorithm must have error \(\Omega(n)\)? A weaker (but nontrivial) lower bound would also be nice.

Reward: A bar of chocolate.

Other related work: [S16] provided algorithms with better error for two special cases, trees and graphs of a priori bounded weight. For trees, it is possible to release all-pairs distances with error roughly \(O(\log^{1.5} n)/\varepsilon\), while for arbitrary graphs with edge weights restricted to the interval \([0,M]\), it is possible to release all-pairs distances with error roughly \(O( \sqrt{nM\varepsilon^{-1}\log(1/\delta)})\)

Submitted by Adam Sealfon on April 9, 2019.

by Audra McMillan at August 05, 2020 06:00 PM UTC

Should Paper X cite Paper Y as previous or concurrent/independent work? This is sometimes tricky: maybe Paper Y circulated privately before Paper X, maybe the authors of Paper X knew about Paper Y maybe not — nobody can know for sure. One can say that the authors of Paper Y should have posted Paper Y online earlier to prevent this issue, but that is not standard practice and might lead to other problems, including Paper Y never getting published!

I propose the following guiding principle:

“If a different accept/reject outcome would have forced paper X to cite paper Y as previous work, then paper X should cite paper Y as previous work.”

The reasons behind my principle seem to me especially valid in the fast-moving theoretical computer science community, where papers are typically sent to conferences and thus seen by the entire program committee plus around 3 external referees, who are typically experts — only to be rejected. Moreover, the progress is extremely fast, with the next conference cycle making obsolete a number of papers in just the previous cycle.

by Manu at August 05, 2020 02:50 PM UTC

We consider the problem of testing asymmetry in the bounded-degree graph model, where a graph is called asymmetric if the identity permutation is its only automorphism. Seeking to determine the query complexity of this testing problem, we provide partial results. Considering the special case of $n$-vertex graphs with connected components of size at most $s(n)=\Omega(\log n)$, we show that the query complexity of $\epsilon$-testing asymmetry (in this case) is at most $O({\sqrt n}\cdot s(n)/\epsilon)$, whereas the query complexity of $o(1/s(n))$-testing asymmetry (in this case) is at least $\Omega({\sqrt{n/s(n)}})$. In addition, we show that testing asymmetry in the dense graph model is almost trivial.

at August 04, 2020 09:41 PM UTC

Integration by parts is a highlight of any calculus class. It leads to multiple classical applications for integration of logarithms, exponentials, etc., and it is the source of an infinite number of exercises and applications to special functions. In this post, I will look at a classical discrete extension that is useful in machine learning and optimization, namely Abel transformation, with applications to convergence proofs for the (stochastic) subgradient method. Next month, extensions to higher dimensions will be considered, with applications to score functions [2, 3] and randomized smoothing [4, 5].

Abel transformation: from continuous to discrete

The most classical version of integration by parts goes as follows. Given two continuously differentiable functions from \(\mathbb{R}\) to \(\mathbb{R}\), we have: $$ \int_a^b \!\!\!\!f(x)g'(x) dx = \Big[ f(x) g(x) \Big]_a^b \!-\! \int_a^b\!\!\! \!f'(x) g(x) dx = f(b) g(b)\, – f(a)g(a)-\! \int_a^b\! \!\!\! f'(x) g(x) dx.$$ This is valid for less regular functions, but this is not the main concern here. The proof follows naturally from the derivative of a product, but there is a nice “proof without words” (see, e.g., [1, p. 42] or here).

There is a discrete analogue referred to as Abel transformation or summation by parts, where derivatives are replaced by increments: given two real-valued sequences \((a_n)_{n \geq 0}\) and \((b_n)_{n \geq 0}\) (the second sequence could also be taken vector-valued), we can expand $$ \sum_{k=1}^n a_k ( b_k\, – b_{k-1}) =\sum_{k=1}^n a_k b_k \ – \sum_{k=1}^n a_k b_{k-1} = \sum_{k=1}^n a_k b_k \ – \sum_{k=0}^{n-1} a_{k+1} b_{k},$$ using a simple index increment in the second sum. Rearranging terms, this leads to $$ \sum_{k=1}^n a_k ( b_k\, – b_{k-1}) = a_n b_n \ – a_0 b_0\ – \sum_{k=0}^{n-1} ( a_{k+1} – a_{k } ) b_k.$$ In other words, we can transfer the first-order difference from the sequence \((b_k)_{k \geq 0}\) to the sequence \((a_k)_{k \geq 0}\). A few remarks:

  • Warning! It is very easy/common to make mistakes with indices and signs.
  • I gave the direct proof but a proof through explicit integration by part is also possible, by introducing the piecewise-constant function \(f\) equal to \(a_k\) on \([k,k+1)\), and \(g\) continuous piecewise affine equal to \(b_{k} + (t-k) ( b_{k+1}-b_{k})\) for \(t \in [k,k+1]\), and integrating between \(0+\) and \(n+\).

There are classical applications for the convergence of series (see here), but in this post, I will show how it can lead to an elegant result for stochastic gradient descent for non-smooth functions and decaying step-sizes.

Decaying step-sizes in stochastic gradient descent

The Abel summation formula is quite useful when analyzing optimization algorithms, and we give a simple example below. We consider a sequence of random potentially non-smooth convex functions \((f_k)_{k \geq 0}\) which are independent and identically distributed functions from \(\mathbb{R}^d \) to \(\mathbb{R}\), with expectation \(F\). The goal is to find a minimizer \(x_\ast\) of \(F\) over a some convex bounded set \(\mathcal{C}\), only being given access to some stochastic gradients of \(f_k\) at well-chosen points. The most classical example is supervised machine learning, where \(f_k(\theta)\) is the loss of a random observation for the predictor parameterized by \(\theta\). The difficulty here is the potential non-smoothness of the function \(f_k\) (e.g., for the hinge loss and the support vector machine).

We consider the projected stochastic subgradient descent method. The deterministic version of this method dates back to Naum Shor [6] in 1962 (see nice history here). The method goes as follows: starting from some \(\theta_0 \in \mathbb{R}^d\), we perform the iteration $$ \theta_{k} = \Pi_{ \mathcal{C} } \big( \theta_{k-1} – \gamma_k \nabla f_k(\theta_{k-1}) \big),$$ where \(\Pi_{ \mathcal{C}}: \mathbb{R}^d \to \mathbb{R}^d\) is the orthogonal projection onto the set \(\mathcal{C}\), and \(\nabla f_k(\theta_{k-1})\) is any subgradient of \(f_k\) at \(\theta_{k-1}\).

One step of projected (sub)gradient descent: from a vector \(\theta\), we go down the direction of a negative subgradient \(\nabla f(\theta)\) of the function \(f\) (here typically a random function) and an orthogonal projection is performed to obtain the new vector \(\theta_+\).

We make the following standard assumptions: (a) the set \(\mathcal{C}\) is convex and compact with diameter \(\Delta\) (with respect to the \(\ell_2\)-norm), (b) the functions \(f_k\) are almost surely convex and \(B\)-Lipschitz-continuous (or equivalently with gradients bounded in \(\ell_2\)-norm by \(B\)). We denote by \(\theta_\ast\) a minimizer of \(f\) on \(\mathcal{C}\) (there can be multiple ones).

For non-smooth problems, choosing a constant step-size does not lead to an algorithm converging to a global minimizer: decaying step-sizes are then needed.

Convergence proof through Lyapunov functions

Since the functions \(f_k\) are non-smooth, we cannot use Taylor expansions, and we rely on a now classical proof technique dating back from the 1960’s (see, e.g., a paper by Boris Polyak [7] in Russian), that has led to several extensions in particular for online learning [8]. The proof relies on the concept of “Lyapunov functions“, often also referred to as “potential functions”. This is a non-negative function \(V(\theta_k)\) of the iterates \(\theta_k\), that is supposed to go down along iterations (at least in expectation). In optimization, standard Lyapunov functions are \(V(\theta) = F(\theta)\, – F(\theta_\ast)\) or \(V(\theta) = \| \theta \ – \theta_\ast\|_2^2\).

For the subgradient method, we will not be able to show that the Lyapunov function is decreasing, but this will lead through a manipulation which is standard in linear dynamical system analysis to a convergence proof for the averaged iterate: that is, if \(V(\theta_k) \leqslant V(\theta_{k-1})\ – W(\theta_{k-1}) + \varepsilon_k\), for a certain function \(W\) and extra positive terms \(\varepsilon_k\), then, using telescoping sums, $$ \frac{1}{n} \sum_{k=1}^n W(\theta_{k-1}) \leqslant \frac{1}{n} \big( V(\theta_0)\ – V(\theta_n) \big) + \frac{1}{n} \sum_{k=1}^n \varepsilon_k.$$ We can then either use Jensen’s inequality to get a bound on \(W \big( \frac{1}{n} \sum_{k=1}^n \theta_{k-1} \big)\), or directly get a bound on \(\min_{k \in \{1,\dots,n\}} W(\theta_{k-1})\). The first solution gives a performance guarantee for a well-defined iterate, while the second solution only shows that among the first \(n-1\) iterates, one of them has a performance guarantee; in the stochastic set-up where latex \(W\) is an expectation, it is not easily possible to know which one, so we will consider only averaging below.

Standard inequality. We have, by contractivity of orthogonal projections: $$ \|\theta_k \ – \theta_\ast\|_2^2 = \big\| \Pi_{ \mathcal{C} } \big( \theta_{k-1} – \gamma_k \nabla f_k(\theta_{k-1}) \big) – \Pi_{ \mathcal{C} } (\theta_\ast) \big\|_2^2 \leqslant \big\| \theta_{k-1} – \gamma_k \nabla f_k(\theta_{k-1}) -\ \theta_\ast \big\|_2^2.$$ We can then expand the squared Euclidean norm to get: $$ \|\theta_k – \theta_\ast\|_2^2 \leqslant \|\theta_{k-1} – \theta_\ast\|_2^2 \ – 2\gamma_k (\theta_{k-1} – \theta_\ast)^\top \nabla f_k (\theta_{k-1}) + \gamma_k^2 \| \nabla f_k(\theta_{k-1})\|_2^2.$$ The last term is upper-bounded by \(\gamma_k^2 B^2\) because of the regularity assumption on \(f_k\). For the middle term, we use the convexity of \(f_k\), that is, the function \(f_k\) is greater than its tangent at \(\theta_{k-1}\). See figure below.

Convex function above its tangent at \(\theta_{k-1}\), leading to the desired inequality.

We then obtain $$ f_k(\theta_\ast) \geqslant f_k(\theta_{k-1}) + \nabla f_k(\theta_{k-1})^\top ( \theta_{\ast} – \theta_{k-1}).$$

Putting everything together, this leads to $$ \|\theta_k \ – \theta_\ast\|_2^2 \leqslant \|\theta_{k-1}\ – \theta_\ast\|_2^2 \ – 2\gamma_k \big[ f_k(\theta_{k-1}) \ – f_k(\theta_\ast) \big] + \gamma_k^2 B^2.$$ At this point, except the last term, all terms are random. We can now take expectations, with a particular focus on the term \(\mathbb{E} \big[ f_k(\theta_{k-1}) \big]\), for which we can use the fact that the random function \(f_k\) is independent from the past, so that $$ \mathbb{E} \big[ f_k(\theta_{k-1}) \big] = \mathbb{E} \Big[ \mathbb{E} \big[ f_k(\theta_{k-1}) \big| f_{1},\dots,f_{k-1} \big] \Big] =\mathbb{E} \big[ F(\theta_{k-1}) \big] . $$ We thus get $$ \mathbb{E} \big[ \|\theta_k – \theta_\ast\|_2^2\big] \leqslant \mathbb{E} \big[ \|\theta_{k-1} – \theta_\ast\|_2^2\big] – 2\gamma_k \big( \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \big) + \gamma_k^2 B^2.$$ As above, we can now isolate the excess in function values as: $$ \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{1}{2 \gamma_k} \Big( \mathbb{E} \big[ \|\theta_{k-1} – \theta_\ast\|_2^2\big] – \mathbb{E} \big[ \|\theta_{k} – \theta_\ast\|_2^2\big] \Big) + \frac{\gamma_k}{2} B^2.$$ At this point, the “optimization part” of the proof is done. Only algebraic manipulations are needed to obtain a convergence rate. This is where Abel transformation will come in.

From fixed horizon to anytime algorithms

The lazy way. At this point, many authors (including me sometimes) will take a constant step-size \(\gamma_k = \gamma\) so as to obtain a telescopic sum, leading to $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{1}{2n\gamma} \Big( \mathbb{E} \big[ \|\theta_{0} \ – \theta_\ast\|_2^2\big] – \mathbb{E} \big[ \|\theta_{n}\ – \theta_\ast\|_2^2\big] \Big) + \frac{\gamma}{2} B^2,$$ which is less than \(\displaystyle \frac{\Delta^2}{2n \gamma} + \frac{\gamma}{2} B^2\), and minimized for \(\displaystyle \gamma = \frac{ \Delta}{B \sqrt{n}}\), leading to a convergence rate less than \(\displaystyle \frac{ B \Delta}{\sqrt{n}}\). Using Jensen’s inequality, we then get for \(\bar{\theta}_n = \frac{1}{n} \sum_{k=1}^n \theta_{k-1}\): $$\mathbb{E} \big[ F(\bar{\theta}_{n}) \big] – F(\theta_\ast) \leqslant \frac{ B \Delta}{\sqrt{n}} .$$ This result leads to the desired rate but can be improved in at least one way: the step-size currently has to depend on the “horizon” \(n\) (which has to be known in advance), and the algorithm is not “anytime”, which is not desirable in practice (where one often launches an algorithm and stops it when it the performance gains have plateaued or when the user gets bored waiting).

Non-uniform averaging. Another way [9] is to consider the non-uniform average $$ \eta_{k} = \frac{\sum_{k=1}^n \gamma_{k} \theta_{k-1}}{\sum_{k=1}^n \gamma_{k}}, $$ for which telescoping sums apply as before, to get $$ \mathbb{E} \big[ F(\eta_k) \big] – F(\theta_\ast) \leqslant \frac{1}{2} \frac{\Delta^2 + B^2 \sum_{k=1}^n \gamma_k^2}{\sum_{k=1}^n \gamma_{k}}.$$ Then, by selecting a decaying step-size \(\displaystyle \gamma_k = \frac{ \Delta}{B \sqrt{k}}\), that depends on the iteration number, we get a rate proportional to \(\displaystyle \frac{ B \Delta}{\sqrt{n}} ( 1 + \log n)\). We now have an anytime algorithm, but we have lost a logarithmic term, which is not the end of the world, but still disappointing. In [9], “tail-averaging” (only averaging iterates between a constant times \(n\) and \(n\)) is proposed, that removes the logarithmic term but requires to store iterates (moreover, the non-uniform averaging puts too much weight on the first iterates, slowing down convergence).

Using Abel transformation. If we start to sum inequalities from \(k=1\) to \(k=n\), we get, with \(\delta_k = \mathbb{E} \big[ \|\theta_{k} – \theta_\ast\|_2^2\big]\) (which is always between \(0\) and \(\Delta^2\)): $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^n \bigg( \frac{1}{2 \gamma_k} \Big( \delta_{k-1} – \delta_k \Big)\bigg) + \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2,$$ which can be transformed through Abel transformation into $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^{n-1} {\delta_k} \bigg(\frac{1}{ 2\gamma_{k+1}}- \frac{1}{ 2\gamma_{k}} \bigg) + \frac{\delta_0}{2 \gamma_1}- \frac{\delta_t}{2 \gamma_t}+ \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2.$$ For decreasing step-size sequences, this leads to $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{1}{n} \sum_{k=1}^{n-1} {\Delta^2} \bigg(\frac{1}{ 2\gamma_{k+1}}- \frac{1}{ 2\gamma_{k}} \bigg) + \frac{\Delta^2}{2 \gamma_1}+ \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2,$$ and thus $$ \frac{1}{n} \sum_{k=1}^n \mathbb{E} \big[ F(\theta_{k-1}) \big] – F(\theta_\ast) \leqslant \frac{\Delta^2 }{2 n \gamma_n} + \frac{1}{n} \sum_{k=1}^n \frac{\gamma_k}{2} B^2.$$ For \(\gamma_k = \frac{ \Delta}{B \sqrt{k}}\), this leads to an upper bound $$\frac{\Delta B }{2 \sqrt{n}} \big( 1+ \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{1}{\sqrt{k}}\big) \leqslant \frac{3 \Delta B }{2 \sqrt{n}},$$ which is up to a factor \(\frac{3}{2}\) exactly the same bound as with a constant step-size, but now with an anytime algorithm.


To illustrate the behaviors above, let’s consider minimizing \(\mathbb{E}_x \| x – \theta \|_1\), with respect to \(\theta\), with \(f_k(\theta) = \| x_k- \theta\|_1\), where \(x_k\) is sampled independently from a given distribution (here independent log-normal distributions for each coordinate). The global optimum \(\theta_\ast\) is the per-coordinate median of the distribution of \(x\)’s.

When applying SGD, the chosen subgradient of \(f_k\) has components in \(\{-1,1\}\). Hence in the plots below in two dimensions, the iterates are always on a grid. With a constant step-size: if the \(\gamma\) is too large (right), there are large oscillations, while if \(\gamma\) is too small (left), optimization is too slow. Note that while the SGD iterate with a constant step-size is always oscillating, the averaged iterate converges to some point (which is not the global optimum, and is typically at distance \(O(\gamma)\) away from it [11]).

Stochastic gradient descent (averaged or not), with constant step-size. Left: small step-size. Right: large step-size (8 times larger).

With a decaying step-size (figure below), the initial conditions are forgotten reasonably fast and the iterates converge to the global optimum (and of course, we get an anytime algorithm!).

Stochastic gradient descent (averaged or not), with decreasing step-size.

We can now compare in terms of function values, showing that a constant step-size only works well for a specific range of iteration numbers.

Comparison of expected performance for decaying and constant-step sizes. Several constant step-sizes are tested, with uniform spacings in log-scale (hence the the uniform spacings in performance for large \(n\)).


Being able to deal with decaying step-sizes and anytime algorithms is arguably not a major improvement, but quite a satisfactory one, at least to me! Discrete integration by parts is the key enabler here.

There is another rewarding aspect which is totally unrelated to integration by parts: when applied to supervised machine learning, we just obtained from elementary principles (convexity) and few calculations a generalization bound on unseen data, which is as good as regular bounds from statistics [10] that use much more complex tools such as Rademacher complexities (but typically no convexity assumptions): here, statistics considered independently from optimization is not only slower (considering the empirical risk and minimizing it using the plain non-stochastic subgradient method would lead to an \(n\) times slower algorithm) but also more difficult to analyze!


[1] Roger B. Nelsen, Proofs without Words: Exercises in Visual Thinking, Mathematical Association of America, 1997.
[2] Aapo Hyvärinen, Estimation of non-normalized statistical models by score matchingJournal of Machine Learning Research6(Apr), 695-709, 2005.
[3] Thomas M. Stoker, Consistent estimation of scaled coefficients.  Econometrica: Journal of the Econometric Society, 54(6):1461-1481, 1986.
[4] Tamir Hazan, George Papandreou, and Daniel Tarlow. Perturbation, Optimization, and Statistics. MIT Press, 2016.
[5] Quentin Berthet, Matthieu Blondel, Olivier Teboul, Marco Cuturi, Jean-Philippe Vert, Francis Bach, Learning with differentiable perturbed optimizers. Technical report arXiv 2002.08676, 2020.
[6] Naum Z. Shor. An application of the method of gradient descent to the solution of the network transportation problem. Notes of Scientific Seminar on Theory and Applications of Cybernetics and Operations ResearchUkrainian Academy of Sciences, Kiev, 9–17, 1962.
[7] Boris T. Polyak, A general method for solving extremal problems. Doklady Akademii Nauk SSSR, 174(1):33–36, 1967.
[8] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascentProceedings of the international conference on machine learning )(ICML), 2003.
[9] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, Alexander Shapiro. Robust stochastic approximation approach to stochastic programmingSIAM Journal on optimization, 19(4):1574-1609, 2009.
[10] Stéphane Boucheron, Olivier Bousquet, Gabor Lugosi. Theory of classification: A survey of some recent advancesESAIM: probability and statistics9, 323-375, 2005.
[11] Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. Annals of Statistics, 48(3):1348-1382, 2020.

by Francis Bach at August 04, 2020 03:55 PM UTC

In this work, we continue the research started in [HIMS18], where the authors suggested to study the half-duplex communication complexity. Unlike the classical model of communication complexity introduced by Yao, in the half-duplex model, Alice and Bob can speak or listen simultaneously, as if they were talking using a walkie-talkie. The motivation for such a communication model comes from the study of the KRW conjecture. Following the open questions formulated in [HIMS18], we prove lower bounds for the disjointness function in all variants of half-duplex models and an upper bound in the half-duplex model with zero, that separates disjointness from the inner product function in this setting. Next, we prove lower and upper bounds on the half-duplex complexity of the Karchmer-Wigderson games for the counting functions and for the recursive majority function, adapting the ideas used in the classical communication complexity. Finally, we define the non-deterministic half-duplex complexity and establish bounds connecting it with non-deterministic complexity in the classical model.

at August 04, 2020 03:11 PM UTC

We hope you’re all staying safe and healthy! To bring you some news (and distraction?) during this… atypical summer,here are the recent papers on property testing and sublinear algorithms we saw appear this month. Graphs, probability distributions, functions… there is a something for everyone.

On Testing Hamiltonicity in the Bounded Degree Graph Model, by Oded Goldreich (ECCC). The title sort of gives it away: this relatively short paper shows that testing whether an unknown bounded-degree graph has a Hamiltonian path (or Hamiltonian cycle) in the bounded-degree model requires a number of queries linear in \(n\), the number of nodes. The results also hold for directed graphs (with respect to directed Hamiltonian path or cycle), and are shown via a local reduction to a promise problem of satisfiability of 3CNF formulae. Also included: a complete proof of the linear lower bound for another problem, Independent Set Size; and an open problem: what is the query complexity of testing graph isomorphism in the bounded-degree model?

Local Access to Sparse Connected Subgraphs Via Edge Sampling, by Rogers Epstein (arXiv). Given access to a connected graph \(G=(V,E)\), can we efficiently provide access to some sparse connected subgraph \(G’=(V,E’)\subseteq G\) with \(|E’|\ll |E|\)? This question, well-studied in particular for the case where \(G\) had bounded degree and the goal is to achieve \(|E’|\leq (1-\varepsilon)|V|\), is the focus of this paper which provides a trade-off between the query complexity of the oracle and \(|E’|\). Specifically, for every parameter \(T\), one can give oracle access to \(G’\) with \(|E’|=O(|V|T)\), with a query complexity \(=\tilde{O}(|E|/T)\).

Switching gears, we move from graphs to probability distributions:

Tolerant Distribution Testing in the Conditional Sampling Model, by Shyam Narayanan (arXiv). In the conditional sampling model for distribution testing, which we have covered a few times on this blog, the algorithm at each step gets to specify a subset \(S\) of the domain, and observe a sample from the distribution conditioned on \(S\). As it turns out, this can speed things up a lot: as Canonne, Ron, and Servedio (2015) showed, even tolerant uniformity testing, which with i.i.d. samples requires a near-linear (in the domain size \(n\)) number of samples, can be done in a constant number of conditional queries. Well, sort of constant: no dependence on \(n\), but the dependence on the distance parameter \(\varepsilon\) was, in CRS15, quite bad: \(\tilde{O}(1/\varepsilon^{20})\). This work gets rid of this badness, and shows the (nearly) optimal \(\tilde{O}(1/\varepsilon^{2})\) query complexity! Among other results, it also generalizes it to tolerant identity testing (\(\tilde{O}(1/\varepsilon^{4})\)), for which previously no constant-query upper bound was known. Things have become truly sublinear.

Interactive Inference under Information Constraints, by Jayadev Acharya, Clément Canonne, Yuhan Liu, Ziteng Sun, and Himanshu Tyagi (arXiv). Say you want to do uniformity/identity testing (or learn, but let’s focus on testing) on a discrete distribution, but you can’t actually observe the i.i.d. samples: instead, you can only do some sort of limited, “local” measurement on each sample. How hard is the task, compared to what you’d do if you fully had the samples? This setting, which captures things like distributed testing with communication or local privacy constraints, erasure channels, etc., was well-understood from previous recent work in the non-adaptive setting. But what if the “measurements” could be made adaptively? This paper shows general lower bounds for identity testing and learning, as a function of the type of local measurement allowed: as a corollary, this gives tight bounds for communication constraints and local privacy, and shows the first separation between adaptive and non-adaptive uniformity testing, for a type of “leaky” membership query measurement.

Efficient Parameter Estimation of Truncated Boolean Product Distributions, by Dimitris Fotakis, Alkis Kalavasis, and Christos Tzamos (arXiv). Suppose there is a fixed and unknown subset \(S\) of the hypercube, a “truncation” set, which you can only accessible via membership query; and you receive i.i.d. samples from an unknown product distribution on the hypercube, truncated on that set \(S\) (for instance, because your polling strategy or experimental measurements have limitations). Can you still learn that distribution efficiently? Can you test it for various properties, as you typically really would like to? (or is it just me?) This paper identifies some natural sufficient condition on \(S\), which they call fatness, under which the answer is a resounding yes. Specifically, if \(S\) satisfies this condition, one can actually generate honest-to-goodness i.i.d. samples (non-truncated) from the true distribution, given truncated samples!

Leaving distribution testing, our last paper is on testing functions in the distribution-free model:

Downsampling for Testing and Learning in Product Distributions, by Nathaniel Harms and Yuichi Yoshida (arXiv). Suppose you want to test (or learn) a class of Boolean functions \(\mathcal{C}\) over some domain \(\Omega^n\), with respect to some (unknown) product distribution (i.e., in the distribution-free testing model, or PAC-learning model). This paper develops a general technique, downsampling, which allows one to reduce such distribution-free testing of \(\mathcal{C}\) under a product distribution to testing \(\mathcal{C}\) over \([r]^d\) under the uniform distribution, for a suitable parameter \(r=r(d,\varepsilon,\mathcal{C})\). This allows the authors, among many other things and learning results, to easily re-establish (and, in the second case, improve upon) recent results on testing of monotonicity over \([n]^d\) (uniform distribution) and over \(\mathbb{R}^d\) (distribution-free).

by Clement Canonne at August 04, 2020 02:59 AM UTC

A breakthrough on the separating words problem

Zachary Chase is a graduate student of Ben Green at Oxford. Chase has already solved a number of interesting problems–check his site for more details. His advisor is famous for his brilliant work—especially in additive combinatorics. One example is his joint work with Terence Tao proving this amazing statement:

Theorem 1 The prime numbers contain arbitrarily long arithmetic progressions.

Today we wish to report Chase’s new paper on a problem we have twice discussed before.

But first Ken wants to say something about Oxford where he got his degree long before Green arrived.

Oxford Making Waves

Green moved to Oxford in 2013. He holds a professorship associated to Magdalen College. I (Ken) did not know him when I started at Oxford in 1981. It would have been hard, as Green was only 4 years old at the time. But I did know the preteen Ruth Lawrence when she started there and even once played a departmental croquet match including her in which Bryan Birch made some epic long shots. Lawrence had joined St. Hugh’s College in 1983 at the age of twelve.

Oxford has been Dick’s and my mind more in the past six years than before. Both of us were guests of Joël Ouaknine in 2012–2015 when he was there. Oxford has developed a front-line group in quantum computation, which fits as David Deutsch’s role as an originator began from there—note my story in the middle of this recent post.

Recently Oxford has been in the news for developing a promising Covid-19 vaccine. ChAdOx1 heads Wikipedia’s list of candidate vaccines and has gone to final trials, though there is still a long evaluation process before approval for general use.

Before that, a modeling study from Oxford in March raised the question of whether many more people have had Covid-19 without symptoms or any knowledge. This kind of possibility has since been likened to a “dark matter” hypothesis, not just now regarding Covid-19 but a decade ago and before.

A main supporting argument is that a wide class of mathematical models can be fitted with higher relative likelihood if the hypothesis is true. I have wanted to take time to evaluate this argument amid the wider backdrop of controversy over inference methods in physics, but online chess with unfortunately ramped-up frequency of cheating has filled up all disposable time and more.

The Problem

Back to Chase’s new results on the following problem:

Given two distinct binary strings of length {n} there is always a finite state deterministic automaton (FSA) that accepts one and rejects the other. How few states can such a machine have?

This is called the separating words problem (SWP). Here we consider it for binary strings only.

John Robson proved {O(n^{2/5})} states are enough—we suppress any log factors. Some like to write this as {\tilde{O}(n^{2/5})}. Chase improves this to {\tilde{O}(n^{1/3})}:

Theorem 2 For any distinct {x,y \in \{0,1\}^{n}}, there is a finite state deterministic automaton with {O(n^{1/3} \log^{7} n)} states that accepts {x} but not {y}.

We previously discussed this twice at GLL. We discussed the background and early results here. The original problem is due to Pavel Goralcik and Vaclav Koubek. They proved an upper bound that was {o(n)}. Then we went over Robson’s bound here. The best upper bound was Robson’s result until Chase came along.

The Approach

All the approaches to SWP seem to have a common thread. They find some family of “hash” functions {H} so that:

  1. Any {h} in {H} can be computed by a FSA with few states.

  2. For any {x \neq y} binary strings of length {n}, there is an {h \in H} so that {h(x) \neq h(y)}.

The challenge is to find clever families that can do do both. Be easy to compute and also be able to tell strings apart. Actually this is only a coarse outline—Chase’s situation is a bit more complicated.

The Proof

We have taken the statement of Theorem 2 verbatim from the paper. It has a common pecadillo of beginning a sentence for a specific {n} but writing {O(\cdots n \cdots)} later. However, this is how we think intuitively: in terms of how the pieces of the formula behave. Chase declares right away his intent to ignore the power of {\log n}. How he gets the power {1/3} of {n} is the real point. We can convey the intuition in brief.

A length-{n} binary string can be identified with its set {A \subseteq [n]} of positions where the string has a {1}. Chase begins by showing how a power of {1/2} on {n} is obtainable by considering sets of the form

\displaystyle  A_{i,p} = \{j : j \in A \wedge j \equiv i \pmod{p}\},

where {p} is prime and {i < p}. Suppose we know a bound {k} such that for all distinct {A,B \subseteq n} (that is, all distinct binary strings of legnth {n}) there is a prime {p < k} and {i < p} such that

\displaystyle  |A_{i,p}| \neq |B_{i,p}|.

Then by the Chinese Remainder Theorem, there is a prime {q} of magnitude about {\log n} such that

\displaystyle  |A_{i,p}| \not\equiv |B_{i,p}| \pmod{q}.

Now we can make a finite automaton {M_{A,B}} with states {(j,\ell)} that always increments {j} modulo {p} and increments {\ell} modulo {q} each time it reads a {1} when {j \equiv i \pmod{p}}. Then {M_{A,B}} has order-of {pq \approx k\log n} states. The finisher is that {k = \tilde{O}(n^{1/2})} suffices. Again we ignore the pecadillo but we add some redundant words to the statement in the paper between dashes:

Lemma 3 For any distinct {A,B \subseteq [n]}—of size at most {n}—there is a prime {p = \tilde{O}(n^{1/2})} such that for some {i \in [p]}, {|A_{i,p}| \neq |B_{i,p}|.}

The power {1/2} is of course weaker than Robson’s {2/5}, but this statement conceals two “levers” that enable leap-frogging {2/5} to get {1/3}. The first is that we don’t have to limit attention to sets {A,B} that come from places where the corresponding strings {x,y} have a {1}. Consider any string {w} and take {A_w} to be the set of index positions {j} in which {x} has the substring {w} beginning at place {j}. Define {B_w} likewise for {y}. Then we can try to prove results of the following form given {m < n}:

Proposition 4 For all distinct {x,y \in \{0,1\}^n} there is {w \in \{0,1\}^m} such that {A_w \neq B_w} and

\displaystyle  |A_w|,|B_w| = O(\frac{n}{m}).

A finite automaton using this extension needs {m} states to store {w} in its finite control. The second lever is to try to prove results of this form, where now the words “of size at most {N}” are not redundant:

Lemma 5 (?) For any distinct {A,B \subseteq [n]} of size at most {N} there is a prime {p = \tilde{O}(N^{1/2})} such that for some {i \in [p]}, {|A_{i,p}| \neq |B_{i,p}|.} [Update: see note below.]

Now we need to balance the levers using the proposition and the lemma together. Since {w} will add order-{m} states to the automaton, we balance it against {p} from the previous argument. So take {m = n^{1/3}}. Then {N = \frac{n}{m} \approx n^{2/3}}. Lemma 5 then gives the bound

\displaystyle  k = \tilde{O}(N^{1/2}) = \tilde{O}(n^{1/3})

on the magnitude of the needed primes {p}. This yields the {\tilde{O}(n^{1/3})} breakthrough on SWP.

Here a famous New Yorker cartoon with the caption “If only it were so simple” comes to mind. But there is a catch. Chase is not quite able to prove lemma 5. However, the {w} lever comes with extra flexibility that enables finding {w} that make {A_w \neq B_w} and also give those sets an extra regularity property {X}. Using {X}, he is able to show the existence of good hash functions of a certain type. The modified lemma is enough to prove his new bound. The proof still uses intricate analysis including integrals.

This is classic high-power mathematics. When some idea is blocked, try to weaken the requirements. Sometimes it is possible to still proceed. It is a lesson that we sometimes forget, but a valuable one nevertheless.

Open Problems

We like the SWP and think Chase’s contribution is impressive. Note that it adds a third element {w} to {p} and {q} in the automaton. Can the argument be pushed further by finding more levers to add more elements? Is Lemma 5 true as stated, and with what (other) tradeoffs of {m} and {N} between it and Proposition 4? [Update: not for extreme tradeoffs—see this comment—but plausibly for {m,n,N} polynomially related.]

We feel there could also be interesting applications for his theorem as it stands. Is the ability to tell two strings apart with a simple device—a FSA with not many states—useful? Could it solve some open problem? It does seem like a basic insight, yet we have no candidate application. Perhaps you have an idea.

[added Q on Lemma 5 to “Open Problems”, “lower” bound –> “upper” bound in third section, update in Open Problems.]

by RJLipton+KWRegan at August 03, 2020 02:50 PM UTC

The 12th Innovations in Theoretical Computer Science (ITCS) conference will be held online from January 6-8, 2021. The submission deadline is September 7, 2020.

The program committee encourages you to send your papers our way! See the call for papers for information about submitting to the conference.

ITCS seeks to promote research that carries a strong conceptual message (e.g., introducing a new concept, model or understanding, opening a new line of inquiry within traditional or interdisciplinary areas, introducing new mathematical techniques and methodologies, or new applications of known techniques). ITCS welcomes both conceptual and technical contributions whose contents will advance and inspire the greater theory community.

Important dates

  • Submission deadline: September 7, 2020 (05:59PM PDT)
  • Notification to authors: November 1, 2020
  • Conference dates: January 6-8, 2021

by James at August 03, 2020 02:06 AM UTC

When teaching discrete math a while back I told the following story which some had already heard in High School: 

When Gauss was in 1st grade the class was being bad. So the teacher made them sit down and add up the numbers from 1 to 100. Gauss did it in 2 minutes by noting that if S was the answer then 

2S = (100+1) +(99+2) + ... + (1 + 100) = 100*101

So S = 50*101.  Then he went to Google and typed in 50*101 for the answer.

The class laughed because of course the last part about Google was false. But I then told them that the entire story was false and showed them the following slides:  here  Take a look at them (there are only 4 of them) before reading on.

(ADDED LATER: here is an article by Brian Hayes that documents the history of the story.


So I told them the Gauss Story was false (I am right about this) and then told them a lie- that the story's progression over time was orderly. I then told them that that was false (hmmm- actually I might not of, oh well).

One of my students emailed me this semester

Dr Gasarch- one of my Math professors is telling the Gauss story as if its true! You should make a public service announcement and tell people its false!

I do not think this is needed. I also don't know how one goes about making a public service announcement  I also  suspect the teacher knew it was false but told it anyway.

OKAY- what do you do if you have a nice story that has some good MATH in it but  its not true?


Tell it and let the students think its true.

Tell it and debunk it.

Tell it and debunk it and tell another myth

Tell it and debunk it and tell another myth and then debunk that

Ask your readers what they would do. Which I do now: What do you do? 

by Unknown ( at August 03, 2020 02:02 AM UTC

The last of my CCCG 2020 papers is now on the arXiv: “New Results in Sona Drawing: Hardness and TSP Separation”, arXiv:2007.15784, with Chiu, Demaine, Diomidov, Hearn, Hesterberg, Korman, Parada, and Rudoy. (As you might infer from the long list of coauthors, it’s a Barbados workshop paper.) The paper studies a mathematical formalization of the lusona drawings of southwest Africa; in this formalization, a sona curve for a given set of points is a curve that can be drawn in a single motion, intersecting itself only at simple crossings, and surrounding each given point in a separate region of the plane, with no empty regions. The paper proves that it’s hard to find the shortest one, hard even to find whether one exists when restricted to grid edges, and gives tighter bounds for the widest possible ratio between sona curve length and TSP tour length; see the preprint or the video I already posted for more information.

To save this post from being content-free, here’s a research question that we didn’t even state in the paper, let alone make any progress on solving: just how many of these curves can a given set of points have? A sona curve can be described as a 4-regular plane multigraph (satisfying certain extra conditions) together with an assignment of the given points to its bounded faces, so there are finitely many of these things up to some sort of topological equivalence. And because this is topological it shouldn’t matter where the points are placed in the plane: the number of curves should be a function only of the number of points. I tried hand-enumerating the curves for up to three points but it was already messy and I’m not certain I got them all. (In an earlier version of this post I definitely didn’t get them all — I had to update the figure below after finding more.) Here are the ones I found:

Sona curves for up to three points

If this hand enumeration is correct, then the numbers of sona curves for \(n\) labeled points form an integer sequence beginning \(1, 3, 24,\dots\) and the numbers for unlabeled points form a sequence beginning \(1, 2, 5,\dots\) but I don’t really know anything more than that for this problem.

Another research direction I don’t know much about yet: given a topological equivalence class of sona drawings, how can we find a good layout for it as an explicit drawing? There’s lots of research on drawing plane graphs nicely but it’s not clear how much of it carries over to making nice sona curves.

(Discuss on Mastodon)

by David Eppstein at August 02, 2020 11:48 PM UTC

In this paper, we propose a new conjecture, the XOR-KRW conjecture, which is a relaxation of the Karchmer-Raz-Wigderson conjecture [KRW95]. This relaxation is still strong enough to imply $\mathbf{P} \not\subseteq \mathbf{NC}^1$ if proven. We also present a weaker version of this conjecture that might be used for breaking $n^3$ lower bound for De~Morgan formulas. Our study of this conjecture allows us to partially answer an open question stated in [GMWW17] regarding the composition of the universal relation with a function. To be more precise, we prove that there exists a function $g$ such that the composition of the universal relation with $g$ is significantly harder than just a universal relation. The fact that we can only prove the existence of $g$ is an inherent feature of our approach. The paper's main technical contribution is a method of converting lower bounds for multiplexer-type relations into lower bounds against functions. In order to do this, we develop techniques to lower bound communication complexity using reductions from non-deterministic communication complexity and non-classical models: half-duplex and partially half-duplex communication models.

at August 02, 2020 06:10 AM UTC

Today we talk about the paper, Proof of Tomaszewski’s Conjecture on Randomly Signed Sums, by Nathan Keller and Ohad Klein.

Consider a unit vector a=(a_1,a_2,\dots, a_n). That is latex \sum_{i=1}^n a_i^2=1$. Consider all (2^n) signed sums \displaystyle \epsilon_1a_1+\epsilon_2a_2+\cdots +\epsilon_n a_n, where each \epsilon_k is either 1 or -1.

Theorem (Keller and Klein (2020) asked by Boguslav Tomaszewski (1986)): For at least 2^{n-1} signed sums \displaystyle |\epsilon_1a_1+\epsilon_2a_2+\cdots +\epsilon_n a_n| \le 1 .

Another way to state the theorem is that the probability of a signed sum to be in the interval [-1, 1] is at least 1/2.

To see that this is best possible consider the case that n=2 and let a_1,a_2 be non zero. For the sum in question to exceed one we need both summands to have the same sign which happens half of the times. There is another example of importance, the vector (\frac{1}{2}, \frac{1}{2}, \frac {1}{2}, \frac{1}{2}). Here 3/8 of the absolute values of signed sums (6 out of 16) are below 1 (in fact, equal to zero), 1/2 equal to 1 and 1/8 exceed 1. Holzman and Kleitman proved in 1992 that the fraction of absolute values of signed sums below 1 is always at least 3/8.

Congratulations to Nathan and Ohad. I will say a little more about the problem below but before that, a few more things.

A few more things

Luca Trevisan posted on his blog In Theory a post “Silver linings” about two cheerful pieces of news. The first one is “Karlin, Klein, and Oveis Gharan have just posted a paper in which, at long last, they improve over the 1.5 approximation ratio for metric TSP which was achieved, in 1974, by Christofides.”

The second  one is about breaking the logarithmic barrier for Roth’s theorem that we wrote about here. This was also discussed by Bill Gasarch on Computational Complexity. In the comment section of my post there is an interesting discussion regarding timetable for future achievements and how surprising they would be.

The third is about Ron Graham, a friend and a mathematical giant who passed away a few days ago. Here is a moving post by Fan Chung, a web page for Ron set by Fan, and a blog post by Dick and Ken on GLL.

The fourth is that there is a nice collection of open problems on Boolean functions that is cited in the paper of Nathan and Ohad:  Y. Filmus, H. Hatami, S. Heilman, E. Mossel, R. O’Donnell, S. Sachdeva, A. Wan, and K. Wimmer, Real analysis in computer science: A collection of open problems.

The fifth is that both our (HUJI) combinatorics seminar and basic notions seminar are running and are recorded. Here are the links. (Hmm, the links are not yet available, I will update.)

Back to the result of Keller and Klein

Daniel Kleitman and Ron Holzman

A quick orientation

If the a_is are all the same, or small, or random, then to compute the probability that the weighted sum is between -1 and 1, we can use some Gaussian approximation and then we will find ourselves in a clash of constants that goes our way. The probability will be close to a constant well above 1/2. So what we need to understand is the case where some a_is are large.

Early papers on the problem

The problem first appeared in the American Math Monthly.  Richard Guy collected several problems  and challenged the readers Any Answers Anent These Analytical Enigmas? (I don’t know what the fate of the other questions is.)  Holzman and Kleitman proved in 1992 that the fraction of absolute values of signed sums below 1 is always at least 3/8, and this is tight. For many years, 3/8 was the record for the original problem, until  the 2017 paper by Ravi Boppana and Ron Holzman: Tomaszewski’s problem on randomly signed sums: Breaking the 3/8 barrier, where a lower bound of 0.406, was proved. The current record 0f 0.46 was proved in the paper Improved Bound for Tomaszewski’s Problem by Vojtěch Dvořák, Peter van Hintum, and Marius Tiba. The new definite result by Nathan and Ohad used some ideas of these early papers.

What is the crux of matters

Let me quote what the authors kindly wrote me:

“The crux of the matter is how to deal with the case of very large coefficients (a_1+a_2>1). We gave a short semi-inductive argument covering this case (this is Section 5 of the paper). The argument is only semi-inductive, as it requires the full assertion of Tomaszewski for any n'<n, and gives only the case (a_1+a_2>1) for n. But this means that if we can handle all other cases by other methods then we will be done.

The semi-inductive argument takes only 3 pages. Handling the other cases takes 72 more pages and requires several new tools, but is closer to things that were done in previous works. (Actually, after we found the 3-page argument, we were quite sure we will be able to finalize the proof; this indeed happened, but took a year).”

Most of the paper deals with the case of small coefficients. This requires several ideas and new tools.

Rademacher sums: Improved Berry-Esseen and local tail inequalities

If all coefficients are “sufficiently small”, then we can
approximate X by a Gaussian and the inequality should follow. However, using the standard Berry-Esseen bound, this holds only if all coefficients are less than 0.16.
Nathan and Ohad showed that for Rademacher sums, namely random variables of the form X=\sum a_i x_i, as discussed in the conjecture, a stronger Berry-Esseen
bound can be obtained, and this bound shows immediately that Tomaszewski’s assertion holds whenever all coefficients are less than 0.31. The stronger bound stems
from a method of Prawitz, presented in the 1972 paper. H. Prawitz, Limits for a distribution, if the characteristic function is given in a finite domain, which appeared in the Scandinavian Actuarial journal.

The second tool is local tail inequalities for Rademacher sums, of the form \Pr[a<X<b] \leq \Pr[c<X<d], where a,b,c,d satisfy certain conditions. Inequalities of this kind were obtained before by Devroye and Lugosi in the 2008 paper:  Local tail bounds for functions of independent random variables.

These local tail inequalities already have some other applications, e.g., to analysis of Boolean functions. They were developed and applied in an earlier paper paper of Keller and Klein: Biased halfspaces, noise sensitivity, and relative Chernoff inequalities. Let me mention my related MO question A variance tail description for continuous probability distributions.

A couple more ingredients

A  stopping time argument. Variants of Tomaszewski’s problem appeared in various fields. The problem was stated independently in a 2002 paper by Ben-Tal, Nemirovski, Roos, Robust solutions of uncertain quadratic and conic-quadratic problems.  A stopping time argument introduced there (for proving a lower bound of 1/3) played a crucial role in subsequent works and the critical semi-inductive argument by Nathan and Ohad.

Refinements of the famous Chebyshev’s inequality.  (Did you know Chebyshev’s full name? Ans: Pafnuty Lvovich Chebyshev.)

Questions and connections that come to mind

Q1: What can be said about families \cal F of signs that can serve as those signs for which  |\epsilon_1a_1+\epsilon_2a_2+\cdots +\epsilon_n a_n| \le 1 , for some vector a.

Q2: What can be said about the complex version or even more generally about high dimensions?

Q3: Are there any relations to Littlewood-Offord type problems?

Q4: Is there any relation to the Komlos Conjecture?

See also this MO question by Luca Trevisan and this one by George Lowther.

Is there a simpler proof?

We can ask about simpler or just different proofs for almost every result we discuss here. But here the statement is so simple…

by Gil Kalai at August 01, 2020 06:32 PM UTC

The Busy Beaver function, with its incomprehensibly rapid growth, has captivated generations of computer scientists, mathematicians, and hobbyists. In this survey, I offer a personal view of the BB function 58 years after its introduction, emphasizing lesser-known insights, recent progress, and especially favorite open problems. Examples of such problems include: when does the BB function first exceed the Ackermann function? Is the value of BB(20) independent of set theory? Can we prove that BB(n+1)>2^BB(n) for large enough n? Given BB(n), how many advice bits are needed to compute BB(n+1)? Do all Busy Beavers halt on all inputs, not just the 0 input? Is it decidable, given n, whether BB(n) is even or odd?

at August 01, 2020 05:13 AM UTC

by David Eppstein at July 31, 2020 09:03 PM UTC

My newest arXiv preprint is “Acutely triangulated, stacked, and very ununfoldable polyhedra” with Erik and Martin Demaine (arXiv:2007.14525). It’s about polyhedra with acute-triangle faces that cannot be unfolded without cutting their surface into many separate polygons. I already posted a video for the paper so see that for more information.

Instead, I thought I’d go into a little more detail about a throwaway remark in the video and the paper (one that I already got an email query about). It says that ideal hyperbolic polyhedra can always be unfolded (into the hyperbolic plane). These polyhedra are the hyperbolic convex hulls of finitely many limit points of the hyperbolic space; their faces are ideal polygons, glued together along entire hyperbolic lines. More strongly, if you cut an ideal polyhedron along any spanning tree of its vertices and edges, the result always unfolds into a convex ideal hyperbolic polygon. Here, for instance, is a net for an ideal cube:

Net for an ideal cube

I don’t know of a previous reference for this result, and the paper and video state it without proof (because it’s an introductory remark and not the topic of the paper), but it’s easy to prove a stronger statement by induction: any collection of ideal hyperbolic polygons (like the faces of an ideal polyhedron), when connected edge-to-edge in a complex with the connectivity of a tree (like the faces of any convex polyhedron when you cut it along a spanning tree), unfolds to an ideal convex polygon. As a base case, when you have one polygon in your collection, it unfolds to itself. When you have more than one, find a leaf polygon of the tree structure, remove it, and unfold the rest into a convex ideal polygon. Now add back the leaf. It needs to be connected to the rest of the complex along a hyperbolic line, which (by the induction hypothesis that the rest unfolds convexly) has the rest of the complex on one side and an empty hyperbolic halfplane on the other side. Any convex ideal polygon can be placed within this halfplane so that the side on which it should be glued matches up with the boundary line of the halfplane, with enough freedom to match up the points along this line that should be matched up.

This caused me to wonder: which Euclidean convex polyhedra have the same property, that cutting them along any spanning tree leads to a convex unfolding? The answer is: not very many. By Descartes’ theorem on total angular defect, the angular defects at the vertices of a convex polyhedron add up to \(4\pi\). If a polyhedron is to have all spanning trees produce a (weakly) convex unfolding, then each vertex has to have angular defect at least \(\pi\), because otherwise cutting along a spanning tree that has a leaf at that vertex will make an unfolding that is non-convex at that vertex. And this is the only thing that can go wrong, because if all angular defects are at least \(\pi\) then the unfolding will be convex at each of its vertices and cannot self-overlap.

So to answer the question about Euclidean polyhedra with all unfoldings convex, we need only look for ways to partition the total angular defect of \(4\pi\) among some set of vertices so that each one gets at least \(\pi\). If we know the defects of all the vertices and the distances between vertices, then by Alexandrov’s uniqueness theorem the shape of the polyhedron will be determined. Since we’re using Alexandrov, we should also consider a dihedron (two mirror-image convex faces glued at their edges) to be a special case of a polyhedron. This leaves, as the only cases:

  • A triangular dihedron based on a right or acute triangle.

  • A rectangular dihedron.

  • A tetrahedron with angular defect exactly \(\pi\) at each vertex.

Convex unfoldings of dihedra and a disphenoid

The unfoldings of the dihedra have two copies of their face, mirrored across a joining edge. The tetrahedra with all-convex unfoldings are exactly the disphenoids, the tetrahedra whose four faces are congruent. They unfold either to a copy of the same face shape, expanded by a factor of two in each dimension and creased into four copies along its medial triangle, or a parallelogram, creased to form a strip of four congruent triangles. Their unfoldings were discussed by Jin Akiyama in his paper “Tile-makers and semi-tile-makers” (American Mathematical Monthly 2007, doi:10.1080/00029890.2007.11920450, jstor:27642275), as part of a broader investigation of polyhedra whose every unfolding tiles the plane.

(Discuss on Mastodon)

by David Eppstein at July 29, 2020 10:18 PM UTC

Junior and Advanced Fellows of the ETH Institute for Theoretical Studies are independent postdocs of exceptional talent and promise, having achieved significant results in mathematics, theoretical computer science or the theoretical natural sciences. Junior Fellows stay at the Institute for up to three years, Advanced Fellows up to five years.


by shacharlovett at July 29, 2020 02:52 PM UTC

The disjointness problem - where Alice and Bob are given two subsets of $\{1, \dots, n\}$ and they have to check if their sets intersect - is a central problem in the world of communication complexity. While both deterministic and randomized communication complexities for this problem are known to be $\Theta(n)$, it is also known that if the sets are assumed to be drawn from some restricted set systems then the communication complexity can be much lower. In this work, we explore how communication complexity measures change with respect to the complexity of the underlying set system. The complexity measure for the set system that we use in this work is the Vapnik–Chervonenkis (VC) dimension. More precisely, on any set system with VC dimension bounded by $d$, we analyze how large can the deterministic and randomized communication complexities be, as a function of $d$ and $n$. The $d$-sparse set disjointness problem, where the sets have size at most $d$, is one such set system with VC dimension $d$. The deterministic and the randomized communication complexities of the $d$-sparse set disjointness problem have been well studied and is known to be $\Theta \left( d \log \left({n}/{d}\right)\right)$ and $\Theta(d)$, respectively, in the multi-round communication setting. In this paper, we address the question of whether the randomized communication complexity is always upper bounded by a function of the VC dimension of the set system, and does there always exist a gap between the deterministic and randomized communication complexity for set systems with small VC dimension. In this paper, we construct two natural set systems of VC dimension $d$, motivated from geometry. Using these set systems we show that the deterministic and randomized communication complexity can be $\widetilde{\Theta}\left(d\log \left( n/d \right)\right)$ for set systems of VC dimension $d$ and this matches the deterministic upper bound for all set systems of VC dimension $d$. We also study the deterministic and randomized communication complexities of the set intersection problem when sets belong to a set system of bounded VC dimension. We show that there exists set systems of VC dimension $d$ such that both deterministic and randomized (one-way and multi-round) complexities for the set intersection problem can be as high as $\Theta\left( d\log \left( n/d \right) \right)$, and this is tight among all set systems of VC dimension $d$.

at July 29, 2020 02:15 PM UTC

Locally correctable codes (LCCs) are error correcting codes C : \Sigma^k \to \Sigma^n which admit local algorithms that correct any individual symbol of a corrupted codeword via a minuscule number of queries. This notion is stronger than that of locally decodable codes (LDCs), where the goal is to only recover individual symbols of the message. One of the central problems in algorithmic coding theory is to construct O(1)-query LCCs and LDCs with minimal block length. Alas, state-of-the-art of such codes requires super-polynomial block length to admit O(1)-query algorithms for local correction and decoding, despite much attention during the last two decades. This lack of progress prompted the study of relaxed LCCs and LDCs, which allow the correction algorithm to abort (but not err) on a small fraction of the locations. This relaxation turned out to allow constant-query correcting and decoding algorithms for codes with polynomial block length. Focusing on local correction, Gur, Ramnarayan, and Rothblum (ITCS~2018) showed that there exist O(1)-query relaxed LCCs that achieve nearly-quartic block length n = k^{4+\alpha}, for an arbitrarily small constant \alpha>0. We construct an O(1)-query relaxed LCC with nearly-linear block length n = k^{1+\alpha}, for an arbitrarily small constant \alpha>0. This significantly narrows the gap between the lower bound which states that there are no O(1)-query relaxed LCCs with block length n = k^{1+o(1)}. In particular, our construction matches the parameters achieved by Ben-Sasson et al. (SIAM J. Comput. 2006), who constructed relaxed LDCs with the same parameters. This resolves an open problem raised by Gur, Ramnarayan, and Rothblum (ITCS 2018).

at July 27, 2020 08:27 PM UTC

Dependency quantified Boolean formulas (DQBF) describe an NEXPTIME-complete generalisation of QBF, which in turn generalises SAT. QRAT is a recently proposed proof system for quantified Boolean formulas (QBF), which simulates the full suite of QBF preprocessing techniques and thus forms a uniform proof checking format for solver verification. In this work, we study QRAT in the more general DQBF context, obtaining a sound and complete refutational DQBF proof system that we call DQRAT. We show that DQRAT can simulate the full suite of dedicated DQBF preprocessing techniques, except those relying on defined variables, which we cover with the introduction of a new form of prefix modification. Our work enables generalisations of further QBF preprocessing techniques (e.g. blocked literal elimination) that were not previously considered for DQBF.

at July 27, 2020 07:18 PM UTC

And Razborov’s brilliant proof method

Stasys Jukna is the author of the book Extremal Combinatorics With Applications in Computer Science.

Today we talk about Jukna’s book on extremal combinatorics.

The structure of his book is great. The material is useful and well presented. Rather than add more general comments about his book, we thought we might highlight one tiny part—the part on monotone circuit lower bounds. Here goes. All below is based directly on his discussion. Any errors or misguided comments are ours.

Monotone Boolean Functions

Fix an input size {n} and consider some property of subsets {S} of {[n]}. Let {f(S)=1} exactly when {S} has the property. We can think of {f} as a Boolean function. You believe that this property is hard to compute—how do you go about proving that?

In general we have no tools, but if the property is monotone, then there are some powerful methods. Recall monotone means that if {f(S)=1} then any set {T} so that {S \subset T} still has the property. For example, {f(S)} could be that {S} includes at least half of the elements of {[n]}. It cannot be that {S} has an even number of elements. Another example is when {f} is given in disjunctive normal form (DNF),

\displaystyle  f \equiv T_1 \vee T_2 \vee \cdots T_m,

where each term {T_k} is a conjunction of variables. Each {T_k} can be regarded as a subset of {[n]}. Then {f(S) = 1} if and only if {S \supseteq T_k} for some {k}. Every monotone function also has a conjunctive normal form (CNF)

\displaystyle  f \equiv C_f = C_1 \wedge C_2 \wedge \cdots \wedge C_\ell,

where each clause {C_k} is a disjunction of variables. Then {f(S) = 1} if and only if {S \cap C_k \neq \emptyset} for all {k.} The problem is that the numbers {m} of terms and {\ell} of clauses involved may be huge. The clauses may have different sizes. Given a CNF {C} of maximum clause size {s}, we write {C^s} for the conjunction of clauses of size exactly {s} and {C^{<s}} for the rest. We similarly write {D^r} and {D^{<r}} for DNFs {D}.

The lower bound methods are on the size of a monotone circuit for {f(S)}. That is the circuit can only use gates {AND} and {OR}, but no other types of gates, especially not {NOT} gates. Of course, if {f} has no small monotone circuits, then it has no small DNF or CNF formulas either.

The neat fact on which the lower-bound technique builds is that if {f} does have small monotone circuits, then we can “wrap” it between a CNF and a DNF in various customizable ways:

Theorem 1 (informal) For every {f} with small monotone circuits and {r,s > 0} we can find a CNF {C} of maximum clause size {s} and a DNF {D} of maximum term size {r} such that

\displaystyle  C \leq f \leq D \qquad\text{and also}\qquad D^{<r} \leq C^{<s}.

Moreover, {C^s} and {D^r} are small.

We have said “wrap” not “sandwich” because although {D} is the “upper slice,” the part of {D} with smaller terms—but there could be many of them—wraps around to be under the corresponding part of {C}. This fact will enable us to throw away the smaller clauses and terms. How small is “small”? We will say later. We are trying to solve problems of exposition by keeping a high-level view at the start.

Exposition Problems

Tim Gowers has written an article about the lower method for monotone functions. The method is due to Alexander Razborov in his seminal 1985 paper and extended by Noga Alon and Ravi Boppana in their paper right afterward, and by Benjamin Rossman in his 2009 paper, to name a few.

Gowers says right away that the original papers on this method are clear and well written. But he believes that there is need for more exposition. The method is so important that it must be made easy for all to understand. He says his article is an attempt to solve an open exposition problem. The notion of an exposition problem is due to Timothy Chow who wrote:

All mathematicians are familiar with the concept of an open research problem. I propose the less familiar concept of an open exposition problem.

Chow raised this issue with respect to the forcing method in set theory due to Paul Cohen. A modest suggestion: Read Chow on forcing, a great exposition; read Gowers on the monotone lower bound method, another great one. Both are much better than anything we can do. But we will put our own spin on the lower bound method. And hope to add to the quest to solve the exposition problem.

The Method—High Level

Suppose that {C} is a monotone boolean circuit that has {n} inputs and computes {f(x)} at the last gate. The method is called the approximation method because the idea is that it builds two other boolean functions {\mathsf{lower}} and {\mathsf{upper}}: for all {x} in {\{0,1\}^{n}}:

\displaystyle  \mathsf{lower}(x) \le f(x) \le \mathsf{upper}(x).

This follows a tradition in math that we often replace a complex function, {f(x)}, with simpler upper and lower bounds. Standard stuff.

Usually the point is that the approximators are not only easier to understand but also simpler in some objective sense. For example, Christophe Chesneau and Yogesh Bagul give a nice short compendium of approximating formulas involving trigonometric functions by formulas without them, including that for all {0<x<1},

\displaystyle  \exp(-bx^{2}) < \sin(x)/x < \exp(-x^{2}/6),

with {b \approx 0.172604}. If you have to reason about the behavior of {\sin(x)/x}, it is nice to have these upper and lower bounds. Note that the upper bound kind-of wraps around because it is the same kind of function as the lower bound.

What gives the monotone method a special twist is that {\mathsf{lower}} and {\mathsf{upper}} are not necessarily simple in the sense of being small. Rather, they make simple errors—ones that can be corrected with small effort. The correction process yields {C^s} and {D^r.} Isolating what is small, however, requires us to trade an “AND” of two inequalities for an “OR” of two economical ones. We know that at least one of the latter inequalities must be true. We arrange that either one gives us the kind of lower bound we seek.

Some More Detail

Here is how the trade happens. From Theorem 1 we have:

\displaystyle  C^s \wedge C^{<s} \leq f \leq D^{<r} \vee D^r,

where: {C^s} and {D^r} are small, and while {C^{<s}} and {D^{<r}} might be big, we have {D^{<r} \leq C^{<s}}. The trick is to ask:

Is {C^{<s}} empty—that is, is it the trivial {1} function?

  • If yes, then it goes away on the left-hand side. We get:

    \displaystyle  C^s \leq f.

    Since {C^s} is small, this is something we want. We got a small lower bound on {f} that holds for all arguments {x}.

  • If no, then it has a nontrivial clause corresponding to a set {E} of size at most {s-1}. This is where the wraparound comes in. We have:

    \displaystyle  D^{<r} \leq C^{<s} \leq E,

    since we chose at least one clause. Substituting on the right-hand side thus gives us:

    \displaystyle  f \leq E \vee D^r.

    Now {E} is small, since it is just one clause, and {D^r} is small. We got a small upper bound rather than lower bound, but the fact that it has a restricted form and holds for all cases we can input to {f} will give us a lower bound on {f}.

Finally we are ready to state the theorem, which quantifies “small.” To follow Jukna, we now need to replace “{r}” by “{r+1}” and “{s}” by “{s+1}.” But the essence is the same.

Theorem 2 If {f} has a monotone Boolean circuit of size {t}, then for any {r,s} such that {1 \leq r,s \leq n-1}, we can build a conjunction {C} of at most {t \cdot r^{s+1}} clauses of size exactly {s+1}, a disjunction {D} of at most {t \cdot s^{r+1}} terms of size exactly {r+1}, and a set {E} of size at most {s} such that either {C \leq f} or {f \leq D \cup E}.

Rather than re-prove this, we will continue the discussion with a concrete example. An exposition trick is: give examples before the general case and then abstract. Our example will involve graphs {G}—so the variables have the form {x_{i,j}}, where {x_{i,j} = 1} means there is an edge between vertex {i} and vertex {j}, {x_{i,j} = 0} otherwise. Putting {m} as the number of vertices, the number of possible edges is {n = \binom{m}{2}}. We think of {G} as a set of edges, so {G \subseteq [n]}.

Checking for Triangles

Let {f(G)=1} hold precisely when {G} has a triangle. This is clearly a monotone property. Our goal is to use the lower and upper bounds to prove that the monotone complexity of {f(G)} is almost of order {m^{3}}. A side note is that the general complexity is much less via {m \times m} matrix products.

The first beauty of using the method is that you get to choose the parameters {r} and {s} with a goal {t} in mind. The {r} and {s} must be in {[1,n-1]}. The value of {t} will be a lower bound on the size of any monotone boolean circuit for {f}. The parameters {r,s} are bounds on the clause and term size of the DNF and the CNF. You can select them any way you wish. But of course choose them wisely.

In this case we know that {r=1} is a right choice. We will say what {s} is later but we will have {s=(\log n)^{O(1)}}. Once you pick them, the CNF {C} and DNF {D} (and small set {E}, a set of {O(\log n)} edges in this case) are chosen for you. You have no control over the sets {T_k} that make up the terms of {D} and the sets {C_\ell} that correspond to the clauses of {C}. Well you do know something about them. Here is what you do know about how many sets there are and how big the sets are:

  1. For {k=1,\dots,t \cdot s^{r+1} = ts^2}, each {T_k} is of size {2}.

  2. For {\ell=1,\dots, t \cdot r^{s+1} = t}, each {C_\ell} is of size {s+1}.

The goal in either case is to force {t} to be large. We’ve numbered the right-hand case first.

  1. Case {f \leq D \cup E}. Here we want to consider graphs {G} that do have a triangle—and nothing else. Because {E} includes at most {s} edges, hence touches at most {2s} vertices, and {2s \ll m}, we can focus on triangles among the {m' = m - 2s} untouched vertices. There are {T = \binom{m'}{3} = \Theta(m^3)} such triangles, hence {T} graphs {G} to consider.

    Since these graphs {G} have no edges in {E} but make {f(G) = 1}, there must be some {k} such that {T_k(G) = 1}. Since {T_k} has size {2}, this means {T_k} has two edges of the triangle. Now the point is:

    For each {T_k}, there is at most one triangle that {T_k} can be two edges of.

    Hence there must be at least as many terms as possible triangles. This means:

    \displaystyle  ts^2 \geq \binom{m'}{3}.

    Because {s = (\log n)^{O(1)}}, we finally get {t = \tilde{\Omega}(m^3)}, where the tilde means to ignore factors of {\log n}.

  2. Case {C \leq f}. Here we want to consider graphs {G} such that {f(G) = 0} but {G} is chock full of as many edges as one can have without creating a triangle. Such {G} include complete bipartite graphs. There are {2^{m-1} - 1} such graph inputs, as can be realized from how any binary string {w} except {0^m} and {1^m} encodes such a graph—and only its bit-complement {w'} encodes the same labeled graph.

    In order to keep {C \leq f} we need {C(G) = 0} for all such {G}, so we need (at least) one clause {C_k} to fail on {G}. This means that all vertices touched by the edges in {C_k} must be in the same partition. The more vertices touched, the fewer strings {w} have all {0}s (or all {1}s) in the corresponding positions, which means the fewer graphs {G} “covered” by that clause. We want to know how many clauses we need to cover all these graphs, hence we try to minimize the number of vertices touched by each clause. That number is at least {s' = \lceil \sqrt{2s}\rceil}. The number of graphs we cover is at most {2^{m - s'} - 1} (the {-1} excludes the empty graph). Thus the number {t} of clauses we need satisfies

    \displaystyle  t \geq \frac{2^{m-1} - 1}{2^{m - s'} - 1} \geq 2^{s' - 1}.

    By taking {s' > 4.5\log^2 m} we can make {t \geq m^3} in this case. We can actually get bigger functions with bigger {s}, but this balances against case 1 where {t = \tilde{\Omega}(m^3)} was the best we could do, so that is our lower bound.

Open Problems

Does this help in understanding the approximation method? Can you work out the concretely optimum choice of {s} in the triangle example?

Would you prefer not changing {r} and {s} in the statement of Theorem 2? Then we would have worded the triangle example with “{r = 2}” rather than “{r = 1}.” The former is a little more suggestive of the idea of having two edges of a triangle. Doing so, however, could make notation in the proof of Theorem 2 somewhat messier. Another possibility was keeping Jukna’s usage throughout, so that the earlier version 1 of the theorem would say {D^{\leq r} \leq C^{\leq s}} with {C^{s+1}} and {D^{r+1}} being small. We try to solve “exposition problems” in every post but feel a dilemma here. Comments might help us on a followup post.

by RJLipton+KWRegan at July 27, 2020 06:39 AM UTC

Norbert Weiner wrote a paper Some Moral and Technical Consequences of Automation in 1960. It warns of the dangers of computers in two ways:

1) If a chess program is only trained against expert chess players then it might get confused if its opponent makes a bad move. This is not dangerous. But imagine a nuclear missle system that assumes the opponent is rational. If the opponent is not rational then it might launch and have an accidental nuclear war. So there must be a human component so that this won't happen.

I offer a story and a counter narrative. In the 5th season, 23rd episode of the TV show Castle,
title The Human Factor a character had the following story to tell:

The drone on its own was going to bomb a car. But the human noticed that there were red roses on the car, so it was a wedding couple, not a terrorist. If a human had not been involved the drone may have killed an innocent just married couple!

This scene bothered me. It could EASILY be the other way around: the human wants to bomb and the drone (which has better vision) notices the roses. Or there may be many other ways that a computer could be BETTER than a human. I am not saying that a completely automated system is better, I am saying that its not obvious which way to go.  Both in some combination? What combination? Who has the final say? And in the drone scenario there may not be time for a human to consider the options.

2) The Sorcerer's apprentice scenario. In The Sorcerer's Apprentice segment of the (original) movie Fantasia, Mickey mouse tells a broom to get him a glass of water. The broom keeps bringing him water and Mickey almost drowns. Computers may take orders to literally and not stop. I wonder if  automated stock-trading and automated auctions may have this problem. Is there a case known where this really did cause a problem?

So what do you think?

NOW- do computers (or, more generally technology) make us more safe or less safe?

FUTURE- same question.

by GASARCH ( at July 27, 2020 03:18 AM UTC

Doyle derived his LQG counterexample in the time before the ubiquity of numerical computing. This meant that numerical examples did not carry the rhetorical weight of algebraic closed form instances. The need for clean, persuasive formulae also meant that controllers were idealized in continuous time. Continuous-time optimal control often produced policies that couldn’t be implemented because of the limits of physical reality: no system can act instantaneously with arbitrary power. These issues of infeasibility were certainly noted in the literature during the hey day of optimal control, but continuous time models still often made it difficult to pinpoint these issues.

Discrete-time models don’t share many of these issues. In discrete time, we explicitly encode the sequential, computational nature of decision and control. Discrete-time formulae are unfortunately less elegant than their continuous-time counterparts, but, as I hope to show here, they are often more revealing. Indeed, constructing examples where discrete-time optimal control leads to fragile solutions seems to be surprisingly easy.

Here, I’ll highlight a few examples where relatively innocuous problem formulations lead to very fragile control policies. The examples are weirdly simple and almost comical to a point. But anyone who has played with discrete-time optimal control may have stumbled into similar control policies and had to step back and think about why.

Let’s revisit the discrete-time LQR problem:

We again assume $x_t$ is observed perfectly without noise. While such perfect state information is not realistic, even ideal state feedback ends up being fragile in discrete time. $w_t$ is assumed to be stochastic, but I don’t think much changes if we move to a more adversarial setting. Here, we need the decision variable $u_t$ to be causal. It must be a function of only the values $x_s$ and $u_s$ with $s\leq t$. For stochastic disturbances, the optimal $u$ can always be found by dynamic programming.

Consider the following innocuous dynamics:

This system is a simple, two-state shift register. I’ll write the state out with indexed components $x=[x^{(1)},x^{(2)}]^\top$. New states enter through the control $B$ into the second state. The first state, $x^{(1)}$ is simply whatever was in the second register at the previous time step. The open loop dynamics of this system are as stable as you could imagine. Both eigenvalues of $A$ are zero.

Let’s say our control objective aims to try to keep the two states equal to each other. We can model this with the quadratic cost:

I assume $R=0$ here for simplicity, as the formulae are particularly nice for this case. But, as I will discuss in a moment, the situation is not improved simply by having $R$ be positive. For the disturbance, assume that $w_t$ is zero mean, has bounded second moment, $\Sigma_t = \mathbb{E}[w_t w_t^\top]$, and is uncorrelated with $x_t$ and $u_t$.

The cost is asking to minimize

When $w_t=0$, $x_t^{(1)}+x_t^{(2)} = x_{t-1}^{(2)}+u_{t-1}$, so it seems like our best bet is to just set $u_{t}=x_t^{(2)}$. This turns out to be the optimal action, and you can prove this directly using standard dynamic programming computations. What this means is that the closed loop dynamics of the system are

This closed-loop system is marginally stable, meaning that while signals don’t blow up, some states will persist forever and not converge to $0$. Indeed, the state-transition matrix here has eigenvalues $0$ and $1$. The $1$ corresponds the state where the two components are equal, and such a state can persist forever.

If we learned an incorrect model of the dynamics, how would that influence the closed loop behavior? The simplest scenario is that we identified $B$ from some preliminary experiments. We can immediately see that if the true $B_\star=\alpha B $, then the closed loop dynamics are

This system is unstable for any $\alpha>1$. That is, the system is arbitrarily sensitive to misidentification of the dynamics. Note that this lack of robustness has nothing to do with the noise sequence. The structure of the cost is what drives the system to fragility.

If $R>0$, you will get a slightly different policy. Again, using elementary dynamic programming shows that the optimal control is $u_t=\beta_t(R) x_t^{(2)}$ for some $\beta_t(R) \in (1/2,1)$. The closed loop system will be a bit more stable, but this comes at the price of reduced performance. And, at best, the gain margin of this system approaches $2$ as $R$ goes to infinity. You can also check that if you add $\epsilon$ times the identity to $Q$, you again get a control policy proportional to $x_t^{(2)}$.

This behavior can occur in even simpler systems. Consider the one-state linear system

The open loop system is again as stable as it gets. Now let’s aim to minimize $\Vert x-u \Vert$. It doesn’t matter what norm you choose here or whether you treat the noise as stochastic or worst case with respect to $w$, the optimal control is going to be $u_t = x_t/b$. Once again, the closed loop system has a pole at $1$ and is arbitrary fragile to misspecification of $b$.

I could continue to construct nasty examples, but I hope these examples are sufficiently illustrative. They are certainly contrived and pathological, and it’s not at all clear that they reflect any optimal control problem you might have been hoping to solve. However, both examples involve systems that are robust and stable in open loop. It’s only when we close the feedback loop that we end up in a dangerous situation. That simple optimal control problems give some profoundly fragile solutions should be a clear warning: You can’t just optimize and hope to be robust. You have to consider uncertainty as a first class citizen when designing feedback systems.

In some sense, the core contribution of robust control is in raising awareness of fundamental tradeoffs in the design of feedback systems. Optimal control promises that you can roughly identify a system, model uncertainty as noise, solve an optimization problem, and then ship your policy. Hopefully, the examples in the last two posts have shown why this particular approach is fraught with danger.

If failure of a feedback system has any consequences, then a more holistic robust approach is necessary. We have to work with experts at different levels of the engineering pipeline, worry about unmodeled behaviors, and understand hard limits and practical tradeoffs. That is, engineering has to be more concerned with design than with optimization.

There are all sorts of questions that a robust, systems level engineering effort might ask. Where should you put that extra sensor? Which parts of the system are likely to create issues? Is it possible to avoid performance disruptions when updating a single component in a legacy system? These questions are important in all aspects of system engineering, and developing accessible tools for addressing them in machine learning systems remains a daunting but essential challenge.

I am emphatically not saying that the design of feedback systems is hopeless. It’s easy to walk away with the impression “Ben’s examples are pathologies and unlike what I see in practice” or the pessimistic feeling of “shoot, all of this ML stuff is hopeless, I’m going to go work on something tractable like vaccine development.” I’m not saying that engineering robust machine learning systems is hopeless. I’m just saying that our community has to work better to incorporate multiple levels of uncertainty in its thinking. What are the fundamental tradeoffs between performance and robustness in machine learning? What do we even want to be robust to? In the next post I want to describe some of these robustness tradeoffs without using the language of optimization, probing if that provides some possible paths forward.

at July 27, 2020 12:00 AM UTC

In the first post on Private Set Intersection, I presented the problem of Private Set Intersection, its applications and the simple protocol of [KMRS14], that allows Alice and Bob to learn the intersection of their sets with the aid of an untrusted third party Steve who is assumed to not...

at July 26, 2020 11:00 PM UTC

Query-to-communication lifting theorems translate lower bounds on query complexity to lower bounds for the corresponding communication model. In this paper, we give a simplified proof of deterministic lifting (in both the tree-like and dag-like settings). Whereas previous proofs used sophisticated Fourier analytic techniques, our proof uses elementary counting together with the sunflower lemma.

at July 25, 2020 05:05 AM UTC

The Complexity Complexity Conference, the conference that shares its name and URL with this blog, originally scheduled for Saarbrücken will be held virtually next week. Registration is free for non-authors. Talks are already posted. Looking forward to seeing you at the business meeting and the social.

Award winners have already been announced: The Best Student Paper Award goes to Rahul Ilango for Connecting Perebor Conjectures: Towards a Search to Decision Reduction for Minimizing Formulas and the Best Paper Award goes to Daniel Dadush and Samarth Tiwari for On the Complexity of Branching Proofs.

Virtual conferences give an opportunity for far more people to attend since you don't have the expense and time needed to go to Germany. On the other hand it's hard to dedicate time for a conference when you aren't there. I missed STOC which would have been walking distance from where I live but I did attend parts of the Economics and Computation conference which was supposed to be in Budapest. EC made great use of where you can wander around virtual rooms bumping into and talking to people. I caught up with a few people there. Complexity plans to use gather for its social meeting next week. Looking forward to the virtual beer.

by Lance Fortnow ( at July 24, 2020 04:21 PM UTC

This is a guest post kindly contributed by Noam Lifshitz. Here is a pdf version.  This post is a continuation of the post  To cheer you up in difficult times 3: A guest post by Noam Lifshitz on the new hypercontractivity inequality of Peter Keevash, Noam Lifshitz, Eoin Long and Dor Minzer, and it gives the proof of the new hypercontractive inequality. We plan a third post where various applications will be mentioned.

Before we get to the post I want to mention that there are a lot of activities on the web. I may devote a special post to links and discussion (and contributing links in the comment section is very welcome.) but meanwhile a few links:  1) Advances in Boolean Function Analysis Lecture Series (thanks to Avishay Tal and Prasad Raghavendra for letting me know); 2) Online course in Foundations of Algebraic Geometry Given by Ravi Vakil from Stanford. You can take the course at varying levels of involvement. (Thanks to Tami Ziegler for telling me) A very very interesting way of online teaching. 3) A site with online mathematical lectures.


Aline Bonami with Szilard Revesz and me (2006). Aline Bonami first proved the 2-point hypercontractive inequality which is very useful in the analysis of Boolean functions. (Leonard Gross proved it independently a few years later and William Beckner found important applications to harmonic analysis.)

Proof of the new hypercontractivity inequality

Our aim is to prove the hypercontractivity theorem for global functions. The proof here is taken from a joint paper with David Ellis and Guy Kindler that’ll soon be out on the Arxiv.

Theorem 1:

\displaystyle \|\mathrm{T}_{1/100}f\|_{4}^{4}\le\sum_{S\subseteq\left[n\right]}\mathbb{E}_{x\in\left\{ 1,\ldots,m\right\} ^{S}}\|L_{S}[f]_{S\rightarrow x}\|_{2}^{4},

Here we use the notations given in the last blog post. Let us first get a feel for our hypercontractivity theorem by proving the {n=1} case. Here the RHS is {\|f\|_{2}^{4}+\|f-\mathbb{E}\left[f\right]\|_{4}^{4}.}

1. Proof of the {n=1} case

We will prove the following slightly stronger version of Theorem 1 for the {n=1} case.

Proposition 2:

Let {f\colon\left\{ 1,\ldots,m\right\} \rightarrow\mathbb{C}.} Let {\rho\le\frac{1}{10}.} Then
\displaystyle \|\mathrm{T}_{\rho}f\|_{4}^{4}\le\|f\|_{2}^{4}+\|f-\mathbb{E}\left[f\right]\|_{4}^{4}.

Proof: Let us write {L\left[f\right]} for {L_{1}\left[f\right]=f-\mathbb{E}\left[f\right].} Rearranging, we have
\displaystyle f=\mathbb{E}\left[f\right]+L\left[f\right].
The noise operator in the {n=1} case is by definition equal to {\rho Id+\left(1-\rho\right)\mathbb{E},} where {\mathbb{E}} is the expectation over {\text{\ensuremath{\left\{ 1,\ldots,m\right\} }}} operator, and {Id} is the identity operator. Hence,
\displaystyle \mathrm{T}_{\rho}f=\mathbb{E}\left[f\right]+\rho L[f].

Now when expanding the 4-norm of the function {\|\mathrm{T}_{1/100}f\|_{4}^{4}}, we obtain

\displaystyle \|\mathrm{T}_{\rho}f\|_{4}^{4}  \le\left|\mathbb{E}\left[f\right]\right|^{4}
\displaystyle +6\rho^{2}\left|\mathbb{E}\left[f\right]\right|^{2}\|Lf\|_{2}^{2}+
\displaystyle  +4\rho^{3}\left|\mathbb{E}\left[f\right]\right|\|L\left[f\right]\|_{3}^{3}+
\displaystyle + \rho^{4}\|L\left[f\right]\|_{4}^{4},

where we used the fact that the expectation of {L\left[f\right]} is 0. When looking at the right hand side of the global hypercontractivity theorem, we see most of the above terms except for the one involving the third norm of the Laplacian. Indeed we have

\displaystyle RHS =\|f\|_{2}^{4}+\|L\left[f\right]\|_{4}^{4}
\displaystyle =\left|\mathbb{E}\left[f\right]\right|^{4}
\displaystyle +2\|L\left[f\right]\|_{2}^{2}\left|\mathbb{E}\left[f\right]\right|^{2}
\displaystyle +\|L\left[f\right]\|_{2}^{4}
\displaystyle +\|L\text{\ensuremath{\left[f\right]\|}}_{4}^{4}.

Hence we see that the only term in the left hand side that doesn’t appear with a greater coefficient in the left hand side is the term {\left|\mathbb{E}\left[f\right]\right|\|L\left[f\right]\|_{3}^{3},} and by AM-GM we have

\displaystyle \left|\mathbb{E}\left[f\right]\right|\|L\left[f\right]\|_{3}^{3} \displaystyle =\mathbb{E}\left[\left|\mathbb{E}\left[f\right]\right|\left|L\left[f\right]\right|^{3}\right]
\displaystyle \le\mathbb{E}\left[\frac{\left|\mathbb{E}\left[f\right]\right|^{2}\left|L\left[f\right]\right|^{2}+\left|L\text{\ensuremath{\left[f\right]}}\right|^{4}}{2}\right]
\displaystyle =\frac{1}{2}\left|\mathbb{E}\left[f\right]\right|^{2}\|L\left[f\right]\|_{2}^{2}+\frac{1}{2}\|L\left[f\right]\|_{4}^{4},

which allows us to upper bound the only term appearing in the left hand side but not in the right hand side by corresponding terms that do appear in the right hand side. \Box

2. Tensorisation lemma

Next we are going to prove a theorem that doesn’t seem to fit to our setting, but we’re going to fit it in by force. Let {X,Y} be finite sets. Let us write {\mathcal{F}\left(X\right)} for the linear space of complex valued functions on {X}. The space {\mathcal{F}\left(X\times Y\right)} can be identified with the space {\mathcal{F}\left(X\right)\otimes\mathcal{F}\left(Y\right),} where a pair of function {f\otimes g} is identified with the function
\displaystyle \left(x,y\right)\mapsto f\left(x\right)g\left(y\right)
in {\mathcal{F}\left(X\times Y\right).}

Given two operators {A_{1}\colon\mathcal{F}\left(X_{1}\right)\rightarrow\mathcal{F}\left(Y_{1}\right),A_{2}\colon\mathcal{F}\left(X_{2}\right)\rightarrow\mathcal{F}\left(Y_{2}\right)}, the operator {A_{1}\otimes A_{2}\colon\mathcal{F}\left(X_{1}\times X_{2}\right)\rightarrow\mathcal{F}\left(Y_{1}\times Y_{2}\right)} is the unique operator sending {f\otimes g} to {A_{1}f\otimes A_{2}g}. We write {A^{\otimes n}} for {A\otimes\cdots\otimes A.} The operator {A_{1}\otimes A_{2}} can also be defined more explictly in terms of its values on functions. The operator {A_{1}\otimes A_{2}} can be understood more explicitly by noting that it is the composition of the operators {A_{1}\otimes I} and {I\otimes A_{2}.} Now the operator {A\otimes I} is given by {A\otimes If\left(x,y\right)=Af_{y}\left(x\right),} where {f_{y}\left(x\right)=f\left(x,y\right).}

Lemma 3: Let {X,Y,Z} be measure spaces with finite underlying sets. Let {A\colon\mathcal{F}\left(X\right)\rightarrow\mathcal{F}\left(Y\right),B\colon\mathcal{F}\left(X\right)\rightarrow\mathcal{F}\left(Z\right)} be operators satisfying

\displaystyle \|Af\|_{4}\le\|Bf\|_{4}

for all functions {f\in\mathcal{F}\left(X\right).}

\displaystyle \|A^{\otimes n}f\|_{4}\le\|B^{\otimes n}f\|_{4}

for all {f\in\mathcal{F}\left(X^{n}\right).}

Here the spaces {X^{n},Y^{n},} and {Z^{n}} are equipped with the product measure, where the measure of an atom is the product of the measures of its coordiates.

Proof: For each {y\in X,} let {g_{y}} be given by {g_{y}:=A^{\otimes\left(n-1\right)}f\left(\cdot,y\right).} As mentioned {A^{\otimes n}f\left(x,y\right)=Ag_{y}\left(x\right).} Hence by hypothesis, we have

\displaystyle \mathbb{E}\left[\left|\mathrm{A}^{\otimes n}f\right|^{4}\right] \displaystyle =\mathbb{E}_{y}\mathbb{E}_{x}\left|Ag_{y}\left(x\right)\right|^{4}
\displaystyle  \le\mathbb{E}_{y}\mathbb{E}_{x}\left|Bg_{y}\left(x\right)\right|^{4}
\displaystyle =\|A^{\otimes n-1}\otimes B\|_{4}^{4}. We may now repeat the same process on each of the other coordinates to replace the {A}s by {B}s one by one. \Box

3. The main idea: Fourifying the 2-norms.

The strategy of our proof is to take the theorem

\displaystyle \|\mathrm{T}_{\rho}f\|_{4}\le\|f\|_{2}^{4}+\|f-\mathbb{E}\left[f\right]\|_{4}^{4},

which we established in the {n=1} case for {\rho\le\frac{1}{10}}, and to turn it into an essentially equivalent statement about 4-norms. We will then get a tensorised statement for general {n}, which we will be able to convert back into our hypercontractivity theorem for global functions. Our idea is to encode our function {f} as a function {\mathrm{En}\left(f\right)\colon\left\{ -1,1\right\} ^{n\left(p-1\right)}\rightarrow\mathbb{R}} satisfying

\displaystyle \mathrm{En}\circ T_{\rho}=\mathrm{T}_{\rho}\circ\mathrm{En}


\displaystyle \|\mathrm{En}f\|_{2}=\|f\|_{2}.

The benefit of working with {\mathrm{En}f} rather than {f} is that in {\left\{ 0,1\right\} ^{n\left(p-1\right)}} one may move between 4-norms and 2-norms by appealing to the hypercontractivity theorem there, which gives

\displaystyle \|\mathrm{T}_{\frac{1}{\sqrt{3}}}\circ\mathrm{En}f\|_{4}\le\|\mathrm{E}nf\|_{2}\le\|\mathrm{En}f\|_{4}

at the cost of some noise.

To define {\mathrm{En}} we use Fourier analysis of Abelian groups. Let us briefly recall it. For simplicity let us assume that {f\colon\left(\mathbb{Z}/p\mathbb{Z}\right)^{n}\rightarrow\mathbb{C},} where {p} is a prime. Let {\omega} be a {p}th root of unity. For any {\gamma\in\left(\mathbb{Z}/p\mathbb{Z}\right)^{n}} we have a character {\chi_{\gamma}\colon\left(\mathbb{Z}/p\mathbb{Z}\right)^{n}\rightarrow\mathbb{C}} given by {\chi_{\gamma}\left(x\right)=\omega^{\left\langle \gamma,x\right\rangle }.} The {\chi_{\gamma}} are an orthonormal basis of {\left(\mathbb{Z}/p\right)^{n}} and we write {f=\sum\hat{f}\left(\gamma\right)\chi_{\gamma}}, where {\hat{f}\left(\gamma\right)=\left\langle f,\chi_{\gamma}\right\rangle .} Note that {\chi_{0}} is the constant function, and so we have

\displaystyle \hat{f}\left(0\right)=\left\langle f,\chi_{0}\right\rangle =\mathbb{E}\left[f\right],

which gives

\displaystyle f=\mathbb{E}\left[f\right]+\sum\hat{f}\left(i\right)\chi_{i}.

Our mission will first be to convert the {2}-norm of a function {f\colon\mathbb{Z}/p\rightarrow\mathbb{R}} to the {4-}norm of a different function.

We define an encoding operator {\mathrm{En}\colon\mathbb{Z}/p\mathbb{Z}\rightarrow\left\{ -1,1\right\} ^{p-1}} by setting

\displaystyle f\mapsto\mathbb{E}\left[f\right]+\sum_{i\in\left\{ 1,\ldots,p-1\right\} }\hat{f}\left(i\right)x_{i}.

We have

\displaystyle \|f\|_{2}^{2}=\|\mathrm{En}f\|_{2}^{2},

as the {\chi_{i}} are orthonormal and so are the {x_{i}.} Moreover, {\mathrm{T}_{\rho}\circ\mathrm{En}=\mathrm{En}\circ T_{\rho}} by the Fourier formula for {\mathrm{T}_{\rho}.} Since {2}-norms are always smaller than 4-norms on probability spaces, we’ve got the following corollary of Proposition 2.

Lemma 4. For all {\rho\le\frac{1}{10}} and all {f\colon\mathbb{Z}/p\mathbb{Z}\rightarrow\mathbb{C}} we have

\displaystyle \|\mathrm{T}_{\rho}f\|_{4}^{4}\le\|\mathrm{En}\left(f\right)\|_{4}^{4}+\|f-\mathbb{E}\left[f\right]\|_{4}^{4}.

We now reach the final little trick. We define a measure space {Y} whose underlying set is {\mathbb{Z}/p\mathbb{Z}\sqcup\left\{ 0,1\right\} ^{\left\{ p-1\right\} },} and where the measure is given by {\mu\left(i\right)=\frac{1}{p}} for {i\in\mathbb{Z}/p\mathbb{Z}} and {\mu\left(x\right)=\frac{1}{2}^{p-1}} for {x\in\left\{ -1,1\right\} ^{p-1}.} We let {B\colon\mathcal{F}\left(X\right)\rightarrow\mathcal{F}\left(Y\right)} be given by {Bf=\mathrm{En}f} on {\left\{ -1,1\right\} ^{p-1}} and letting it be {f-\mathbb{E}\left[f\right]} on {\mathbb{Z}/p\mathbb{Z}.} This way Lemma 4 takes the form {\|\mathrm{T}_{\rho}f\|_{4}\le\|Bf\|_{4}.}

4. Tensorised operators

The operator {\mathrm{T}_{\rho}} on {\mathbb{Z}/p\mathbb{Z}^{n}} satisfies {\mathrm{T}_{\rho}=\mathrm{T}_{\rho}^{\otimes n},} where the latter {\mathrm{T}_{\rho}} refers to the noise operator on {\mathbb{Z}/p.} The characters {\chi_{\gamma}} satisfy {\chi_{\gamma}=\bigotimes\chi_{\gamma_{i}},} and so we have the Fourier formula

\displaystyle \mathrm{T}_{\rho}f  =\sum_{\gamma}\rho^{\#\left\{ i:\gamma_{i}\ne0\right\} }\hat{f}\left(\gamma\right)\chi_{\gamma}.

We also have

\displaystyle L_{S}\left[f\right]=\bigotimes_{i\in S}\left(f\mapsto f-\mathbb{E}\left[f\right]\right)\otimes\bigotimes_{i\notin S}Id,

and so

\displaystyle L_{S}\left[f\right]=\sum_{\gamma:\gamma_{i}\ne0\text{ for all }i\in S}\hat{f}\left(\gamma\right)\chi_{\gamma}.

This will allow us to conclude that

\displaystyle L_{S}\left[\mathrm{T}_{\rho}f\right]_{S\rightarrow x}=\rho^{\left|S\right|}T_{\rho}L_{S}[f]_{S\rightarrow x}.

We will also encounter the operator {\mathrm{En}^{\otimes n},} which by abusing notation we also call {\mathrm{En}} encodes

\displaystyle f=\sum_{\gamma}\hat{f}\left(\gamma\right)\chi_{\gamma}

as the function {\sum_{\gamma}\widehat{f}\left(\gamma\right)\prod_{i=1}^{n}x_{pi+\gamma_{i}}} on {\left\{ -1,1\right\} ^{n\left(p-1\right)}.}
Now finally we can get to the understanding of the operator {B^{\otimes n}.} The space {Y^{n}} is the disjoint union of {2^{n}} spaces of the form

\displaystyle \left(\mathbb{Z}/p\mathbb{Z}\right)^{S}\times\left(\left\{ -1,1\right\} ^{p-1}\right)^{\left[n\right]\setminus S}.

By definition of the tensor product, for {x,y\in\left(\mathbb{Z}/p\mathbb{Z}\right)^{S}\times\left(\left\{ -1,1\right\} ^{p-1}\right)^{\left[n\right]\setminus S}} is the function

\displaystyle B^{n}f\left(x,y\right)=\mathrm{En}\left(L_{S}[f]_{S\rightarrow x}\right)\left(y\right).

5. Finishing the proof

Proof: Lemmas 3 and 4 yield:

\displaystyle \|\mathrm{T}_{\rho}f\|_{4}^{4}  \le\|B^{\otimes n}f\|_{4}^{4} \displaystyle =\sum_{S\subseteq\left[n\right]}\mathbb{E}_{x\sim\mathbb{Z}/p\mathbb{Z}^{S}}\|\mathrm{En}\left(L_{S}[f]_{S\rightarrow x}\right)\|_{4}^{4},

for any {\rho\le\frac{1}{10}.} We now have

\displaystyle \|\mathrm{T}_{\frac{\rho}{\sqrt{3}}}f\|_{4}^{4} \displaystyle =\sum_{S\subseteq\left[n\right]}(\frac{1}{\sqrt{3}})^{\left|S\right|}\mathbb{E}_{x\sim\mathbb{Z}/p\mathbb{Z}^{S}}\|\mathrm{T}_{\frac{1}{\sqrt{3}}}\left(\mathrm{En}\left(L_{S}[f]_{S\rightarrow x}\right)\right)\|_{4}^{4},
\displaystyle  \le\sum_{S\subseteq\left[n\right]}\mathbb{E}_{x\sim\mathbb{Z}/p\mathbb{Z}^{S}}\|\mathrm{En}\left(L_{S}[f]_{S\rightarrow x}\right)\|_{2}^{4}
\displaystyle =\sum_{S\subseteq\left[n\right]}\mathbb{E}_{x\sim\mathbb{Z}/p\mathbb{Z}^{S}}\|L_{S}[f]_{S\rightarrow x}\|_{2}^{4}.

The first equality follows from the formula {L_{S}\left[\mathrm{T}_{\rho}f\right]_{S\rightarrow x}=\rho^{\left|S\right|}\mathrm{T}_{\rho}L_{S}\left[f\right]_{S\rightarrow x}} and the fact that {\mathrm{T_{\rho}}} commutes with the encoding. The inequality used hypercontractivity on the discrete cube. The last equality follows from the fact that the {\mathrm{En}} operator preserves 2-norms. \Box






by Gil Kalai at July 24, 2020 03:21 PM UTC