Theory of Computing Blog Aggregator

The stable matching problem takes as input the preferences from two groups of agents (most famously medical students and supervisors of internships), and pairs up agents from each group in a way that encourages everyone to play along: no pair of agents would rather go their own way together than take the pairings they were both given. A solution can always be found by the Gale–Shapley algorithm, but there are generally many solutions, described by the lattice of stable matchings. Some pairs of agents are included in at least one stable matching, while some other pairs are never matched. In this way, each instance of stable matchings gives rise to a graph, the graph of stably matchable pairs. This graph is the subject and title of my latest preprint, arXiv:2010.09230, which asks: Which graphs can arise this way? How hard is it to recognize these graphs, and infer a stable matching instance that might have generated them? How does the graph structure relate to the lattice structure?

For some answers, see the preprint. One detail is connected to my previous post, on polyhedra with no two disjoint faces (even though there are no polyhedra in the new preprint): the (prism,\(K_{3,3}\))-minor-free graphs discussed there come up in proving an equivalence between outerplanar graphs of stably matchable pairs and lattices of closures of oriented trees. Instead of providing any technical details of any the other results in the paper, though, I thought it would be more fun to show a few visual highlights.

The following figure shows a cute mirror-inversion trick (probably already known, although I don’t know where or by whom) for embedding an arbitrary bipartite graph as an induced subgraph of a regular bipartite graph. I use it to show that graphs of stably matchable pairs have no forbidden induced subgraphs:

Embedding a bipartite graph as an induced subgraph of a regular bipartite graph

This next one depicts a combinatorial description of a stable matching instance having a \(6\times 5\) grid as its graph, in terms of the top and bottom matchings in the lattice of matchings, the “rotations” that can be used to move between matchings in this lattice, and a partial order on the rotations. For what I was doing in this paper, these rotation systems were much more convenient to work with than preferences.

Rotation system describing a system of stable matchings having a 6x5 grid as its graph

All the main ideas for a proof of NP-completeness of recognizing these graphs, by reduction from not-all-equal 3-satisfiability, are visible in the next picture. The proof now in the paper is significantly more complicated, though, because the construction in this image produces nonplanar graphs but I wanted a proof that would also apply in the planar case.

NP-completeness reduction from NAE3SAT to recognizing graphs of stably matchable pairs

The last one shows a sparse graph that can be represented as a graph of stably-matching pairs (because it’s outerplanar, bipartite, and biconnected) but has a high-degree vertex. If we tried to test whether it could be realized by doing a brute-force search over preference systems, the time would be factorial in the degree, but my preprint provides faster algorithms that are only singly exponential in the number of edges.

Outerplanar graph of stably matchable pairs with a factorial number of potential preference systems

(Discuss on Mastodon)

by David Eppstein at October 19, 2020 08:29 PM UTC

 Since I was born on Oct 1, 1960 (that's not true---if I posted  my real birthday I might get my  identity stolen), I will do a nature vs nurture post based on my life, which seems less likely to offend then doing it on someone else's life. I'll just rattle off some random points on Nature vs Nurture.

1) Is it plausible that I was born with some math talent? Is plausible that I was born with some talent to understand the polynomial van der Warden theorem? What is the granularity of nature-given or nurture-given abilities?

2) My dad was a HS English teacher and later Vice-Principal. My mom taught English at a Community college. Readers of the blog might think, given my spelling and grammar, that I was switched at birth. My mom says (jokingly?) that I was switched at birth since she thinks I am good at math.

a) I am not THAT good at math. Also see next point.

b) While there are some math families, there are not many. See my post here.

c) I think being raised in an intellectual atmosphere by two EDUCATORS who had the money to send me to college and allowed me the freedom to study what I wanted to  is far more important than the rather incidental matter of what field I studied.

d) Since my parents never went into math or the sciences it is very hard to tell  if they were `good at math' or even what that means.

3) There were early signs I was INTERESTED in math, though not that I was good at it.

a) In fourth grade I wanted to know how many SECONDS were in a century so I spend some time figuring it out on paper. Did I get the right answer?  I forgot about leap years.

b) I was either a beneficiary of, or a victim of, THE NEW MATH. So I learned about comm. and assoc. operations in 8th grade. We were asked to come up with our own operations. I wanted to come up with an operation that was comm. but not assoc. I did! Today I would write it as f(x,y) = |x-y|. This is the earliest I can think of where I made up a nice math problem. Might have been the last time I made up a nice math problem AND solved it without help. 

c) In 10th grade I took some Martin Gardner books out of the library. The first theorem I learned not-in-school was that a graph is Eulerian iff every vertex has even degree. I read the chapter on Soma cubes and bought a set. (Soma cubes are explained here.) 

d) I had a talent (nature?) at Soma Cubes.  I did every puzzle in the book in a week, diagrammed them, and even understood (on some level) the proofs that some could not be done. Oddly I am NOT good at 3-dim geom. Or even 2-dim geom.  For 1-dim I hold my own!

e) Throughout my childhood I noticed odd logic and odd math things that were said: 

``Here at WCOZ (a radio station) we have an AXIOM, that's like a saying man, that weekends should be SEVEN DAYS LONG'' (Unless that axiom resolves CH, I don't think it should be assumed.) 

``To PROVE we have the lowest prices in town we will give you a free camera!'' (how does that prove anything?) 

``This margarine tastes JUST LIKE BUTTER'' (Okay-- so why not just buy butter?)

e) In 9th grade when I learned the quadratic formula I re-derived it once-a-month since I though it was important that one can prove such things.  I heard (not sure from where) that there was no 5th degree equation. At that very moment I told my parents:

I am going to major in math so I can find out why there is no 5th degree equation.

There are worse things for parents to hear from their children. See here for dad's reaction. 

f) When I learned that the earth's orbit around the sun is an ellipse and that the earth was one of the foci, I wondered where the other foci is and if its important. I still wonder about this one. Google has not helped me here, though perhaps I have not phrased the question properly. If you know the answer please leave a comment. 

g) I also thought about The Ketchup problem and other problems, that I won't go into since I already blogged about them  here

4) I was on the math team in high school, but wasn't very good at it. I WAS good at making up math team problems. I am now on the committee that makes up the Univ of MD HS math competition. I am still not good at solving the problems but good at making them up. 

5) From 9th grade on before I would study for an exam by making up what I thought would be a good exam and doing that. Often my exam was a better test of knowledge than the exam given. In college I helped people in Math and Physics by making up exams for them to work on as practice. 

6) I was good at reading, understanding, and explaining papers. 

7) I was never shy about asking for help. My curiosity exeeded by ego... by a lot!

8) Note that items 5,6, and 7 above do not mention SOLVING problems. The papers I have written are of three (overlapping) types:

a) I come up with the problem, make some inroads on it based on knowledge, and then have people cleverer (this is often) or with more knowledge (this is rarer) help me solve the problems.

b) I come up with the problem, and combine two things I know from other papers to solve it. 

c) Someone else asks for my help on something and I have the knowledge needed. I can only recall one time where this lead to a paper. 

NOTE- I do not think I have ever had a clever or new technique. CAVEAT: the diff between combining  known knowledge in new ways and having a clever or new technique is murky. 

8) Over time these strengths and weaknesses have gotten more extreme. It has become a self-fulfilling prophecy where I spend A LOT of time making up problems without asking for help, but when I am trying to solve a problem I early on ask for help. Earlier than I should? Hard to know. 

9) One aspect is `how good am I at math' But a diff angle is that I like to work on things that I KNOW are going to work out, so reading an article is better than trying to create new work. This could be a psychological thing. But is that nature or nurture?  

10) Could I be a better problem solver? Probably. Could I be a MUCH better problem solver? NO. Could I have been a better problem solver  I did more work on that angle when I was younger? Who knows? 

11) Back to the Quintic: I had the following thought in ninth grade, though I could not possibly have expressed it: The question of, given a problem, how hard is it, upper and lower bounds, is a fundamental one that is worth a lifetime of study. As such my interest in complexity theory and recursion theory goes back to ninth grade or even further. My interest in Ramsey Theory for its own sake (and not in the service of complexity theory) is much more recent and does not quite fit into my narrative. But HEY- real life does not have as well defined narratives as fiction does. 

12) Timing and Luck: IF I had been in grad student at a slight diff time I can imagine doing work on  algorithmic  Galois theory. Here  is a talk on Algorithmic  Galois theory. Note that one of the earliest results is by Landau and Miller from 1985---I had a course from Miller on Alg. Group Theory in 1982. This is NOT a wistful `What might have been' thought. Maybe I would have sucked at it, so its just as well I ended up doing recursion theory, then Ramsey theory, then recursion-theoretic Ramsey Theory, then muffins. 

by gasarch ( at October 19, 2020 03:55 PM UTC

Yamini Bansal, Gal Kaplun, and Boaz Barak

(See also paper on arxiv, code on gitlab, upcoming talk by Yamini&Boaz, video of past talk)

A central puzzle of deep learning is the question of generalization. In other words, what can we deduce from the training performance of a neural network about its test performance on fresh unseen examples. An influential paper of Zhang, Bengio, Hardt, Recht, and Vinyals showed that the answer could be “nothing at all.”

Zhang et al. gave examples where modern deep neural networks achieve 100% accuracy on classifying their training data, but their performance on unseen data may be no better than chance. Therefore we cannot give meaningful guarantees for deep learning using traditional “generalization bounds” that bound the difference between test and train performance by some quantity that tends to zero as the number of datapoints n increases. This is why (to quote their title), Zhang et al. claimed that “understanding deep learning requires rethinking generalization”.

But what if the issue isn’t that we’ve been doing generalization bounds wrong, but rather that we’ve been doing deep learning (or more accurately, supervised deep learning) wrong?

Self Supervised + Simple fit (SSS) learning

To explain what we mean, let’s take a small detour to contrast “traditional” or “end-to-end” supervised learning with a different approach to supervised learning, which we’ll call here “Self-Supervised + Simple fit” or “SSS algorithms.” (While the name “SSS algorithms” is new, the approach itself has a long history and has recently been used with great success in practice; our work gives no new methods—only new analysis.)

The classical or “end-to-end” approach for supervised learning can be phrased as “ask and you shall receive”. Given labeled data, you ask (i.e., run an optimizer) for a complex classifier (e.g., a deep neural net) that fits the data (i.e., outputs the given labels on the given data points) and hope that it will be successful on future, unseen, data points as well. End-to-end supervised learning achieves state-of-art results for many classification problems, particularly for computer vision datasets ImageNet and CIFAR-10.

However, end-to-end learning does not directly correspond to the way humans learn to recognize objects (see also this talk of LeCun). A baby may see millions of images in the first year of her life, but most of them do not come with explicit labels. After seeing those images, a baby can make future classifications using very few labeled examples. For example, it might be enough to show her once what is a dog and what is a cat for her to correctly classify future dogs and cats, even if they look quite different from these examples.

End-to-end learning vs SSS algorithms.Figure 1: Cartoon of end-to-end vs SSS learning

In recent years, practitioners have proposed algorithms that are more similar to human learning than supervised learning. Such methods separate the process into two stages. In the first stage, we do representation learning whereby we use unlabeled data to learn a representation: a complex map (e.g., a deep neural net) mapping the inputs into some “representation space.” In the second stage, we fit a simple classifier (e.g., a linear threshold function) to the representation of the datapoints and the given labels. We call such algorithms “Self-Supervision + Simple fit” or SSS algorithms. (Note that, unlike other representation-learning based classifiers, the complex representation is “frozen” and not “fine-tuned” in the second stage, where only a simple classifier is used on top of it.)

While we don’t have a formal definition, a “good representation” should make downstream tasks easier, in the sense of allowing for fewer examples or simpler classifiers. We typically learn a representation via self supervision , whereby one finds a representation minimizing an objective function that intuitively requires some “insight” into the data. Approaches for self-supervision include reconstruction, where the objective involves recovering data points from partial information (e.g., recover missing words or pixels), and contrastive learning, where the objective is to find a representation that make similar points close and dissimilar points far (e.g., in Euclidean space).

SSS algorithms have been traditionally used in natural language processing, where unlabeled data is plentiful, but labeled data for a particular task is often scarce. But recently SSS algorithms were also used with great success even for vision tasks such as ImageNet and CIFAR10 where all data is labeled! While SSS algorithms do not yet beat the state-of-art supervised learning algorithms, they do get pretty close. SSS algorithms also have other practical advantages over “end-to-end supervised learning”: they can make use of unlabeled data, the representation could be useful for non-classification tasks, and may have improved out of distribution performance. There has also been recent theoretical analysis of contrastive and reconstruction learning under certain statistical assumptions (see Arora et al and Lee et al).

The generalization gap of SSS algorithms

In a recent paper, we show that SSS algorithms not only work in practice, but work in theory too.

Specifically, we show that such algorithms have (1) small generalization gap and (2) we can prove (under reasonable assumptions) that their generalization gap tends to zero with the number of samples, with bounds that are meaningful for many modern classifiers on the CIFAR-10 and ImageNet datasets. We consider the setting where all data is labeled, and the same dataset is used for both learning the representation and fitting a simple classifier. The resulting classifier includes the overparameterized representation, and so we cannot simply apply “off the shelf” generalization bounds. Indeed, a priori it’s not at all clear that the generalization gap for SSS algorithms should be small.

To get some intuition for the generalization gap of SSS algorithms, consider the experiment where we inject some label noise into our distribution. That is, we corrupt an \eta fraction of the labels in both the train and test set, replacing them with random labels. Already in the noiseless case (\eta=0), the generalization gap of SSS algorithms is noticeably smaller than that of end-to-end supervised learning. As we increase the noise, the difference becomes starker. End-to-end supervised learning algorithms can always achieve 100% training accuracy, even as the test accuracy deteriorates, since they can “memorize” all the training labels they are given. In contrast, for SSS algorithms, both training and testing accuracy decrease together as we increase the noise, with training accuracy correlating with test performance.

Figure 2: Generalization gap of end-to-end and SSS algorithms on CIFAR 10 as a function of noise (since there are 10 classes, 90% noisy samples corresponds to the Zhang et al experiment). See also interactive version.

Our main theoretical result is a formal proof of the above statement. To do so, we consider training with a small amount of label noise (say \eta=5\%) and define the following quantities:

  • The robustness gap is the amount by which training accuracy degrades between the “clean” (\eta=0) experiment and the noisy one. (In this and all other quantities, the training accuracy is measured with respect to the original uncorrupted labels.)
  • The memorization gap considers the noisy experiment (\eta=5\%) and measures the amount by which performance on the corrupted data samples (where we received the wrong label) is worse than performance on the overall training set. If the algorithm can memorize all given labels, it will be perfectly wrong on the corrupted data samples, leading to a large memorization gap.
  • The rationality gap is the difference between the performance on the corrupted data samples and performance on unseen test examples. For example, if x is an image of a dog, then it measures the difference between the probability that f(x)=\text{"dog"} when (x,\text{"cat"}) is in the training set and the probability that f(x)=\text{"dog"} when x is not in the training set at all. Since intuitively, getting the wrong label should be worse than getting no label at all, we typically expect the rationality gap to be around zero or negative. Formally we define the rationality gap to the maximum between 0 and the difference above, so it is always non-negative. We think of an algorithm with a significant positive rationality gap as “irrational.”

By summing up the quantities above, we get the following inequality, which we call the RRM bound

generalization gap \leq robustness gap + rationality gap + memorization gap

In practice, the robustness and rationality gaps are always small, both for end-to-end supervised algorithms (which have a large generalization gap), and for SSS algorithms (which have a small generalization gap). Thus the main contribution to the generalization gap comes from the memorization gap. Roughly speaking, our main result is the following:

If the complexity of the second-stage classifier of an SSS algorithm is smaller than the number of samples then the generalization gap is small.

See the paper for the precise definition of “complexity,” but it is bounded by the number of bits that it takes to describe the simple classifier (no matter how complex is the representation used in the first stage). Our bound yields non-vacuous results in various practical settings; see the figures below or their interactive version.

Figure 3: Empirical study of the generalization gap of a variety of of SSS algorithms on CIFAR-10. Each vertical line corresponds to one model, sorted by generalization gap. The RRM bound is typically near-tight, and our complexity upper bound is often non vacuous. Use this webpage to interact with figures 3 and 4.
Figure 4: Empirical study of gaps for the ImageNet dataset. Because of limited computational resources, we only evaluated the theoretical bound for two models in this dataset.

What’s next

There are still many open questions. Can we prove rigorous bounds on robustness and rationality? We have some preliminary results in the paper, but there is much room for improvement. Similarly, our complexity-based upper bound is far from tight at the moment, though the RRM bound itself is often surprisingly tight. Our work only applies to SSS algorithms, but people have the intuition that even end-to-end supervised learning algorithms implicitly learn a representation. So perhaps these tools can apply to such algorithms as well. As mentioned, we don’t yet have formal definitions for “good representations,” and the choice of the self-supervision task is still somewhat of a “black art” – can we find a more principled approach?

by Boaz Barak at October 19, 2020 12:30 AM UTC

Authors: Penglong Zhai, Shihua Zhang
Download: PDF
Abstract: Low-rank approximation models of data matrices have become important machine learning and data mining tools in many fields including computer vision, text mining, bioinformatics and many others. They allow for embedding high-dimensional data into low-dimensional spaces, which mitigates the effects of noise and uncovers latent relations. In order to make the learned representations inherit the structures in the original data, graph-regularization terms are often added to the loss function. However, the prior graph construction often fails to reflect the true network connectivity and the intrinsic relationships. In addition, many graph-regularized methods fail to take the dual spaces into account. Probabilistic models are often used to model the distribution of the representations, but most of previous methods often assume that the hidden variables are independent and identically distributed for simplicity. To this end, we propose a learnable graph-regularization model for matrix decomposition (LGMD), which builds a bridge between graph-regularized methods and probabilistic matrix decomposition models. LGMD learns two graphical structures (i.e., two precision matrices) in real-time in an iterative manner via sparse precision matrix estimation and is more robust to noise and missing entries. Extensive numerical results and comparison with competing methods demonstrate its effectiveness.

at October 19, 2020 11:25 PM UTC

Authors: Adrian de Wynter
Download: PDF
Abstract: We consider the problem of finding the set of architectural parameters for a chosen deep neural network which is optimal under three metrics: parameter size, inference speed, and error rate. In this paper we state the problem formally, and present an approximation algorithm that, for a large subset of instances behaves like an FPTAS with an approximation error of $\rho \leq |{1- \epsilon}|$, and that runs in $O(|{\Xi}| + |{W^*_T}|(1 + |{\Theta}||{B}||{\Xi}|/({\epsilon\, s^{3/2})}))$ steps, where $\epsilon$ and $s$ are input parameters; $|{B}|$ is the batch size; $|{W^*_T}|$ denotes the cardinality of the largest weight set assignment; and $|{\Xi}|$ and $|{\Theta}|$ are the cardinalities of the candidate architecture and hyperparameter spaces, respectively.

at October 19, 2020 11:22 PM UTC

Authors: Pavel Dvořák, Michal Koucký
Download: PDF
Abstract: In this paper we study the computational complexity of functions that have efficient card-based protocols. Card-based protocols were proposed by den Boer [EUROCRYPT '89] as a means for secure two-party computation. Our contribution is two-fold: We classify a large class of protocols with respect to the computational complexity of functions they compute, and we propose other encodings of inputs which require fewer cards than the usual 2-card representation.

at October 19, 2020 11:20 PM UTC

Authors: Suhas Thejaswi, Aristides Gionis
Download: PDF
Abstract: We study a family of temporal reachability problems under waiting-time restrictions. In particular, given a temporal graph and a set of source vertices, we find the set of vertices that are reachable from a source via a time-respecting path, and such that the difference in timestamps between consecutive edges is at most a resting time. This kind of problems have several interesting applications in understanding the spread of a disease in a network, tracing contacts in epidemic outbreaks, and finding signaling pathways in the brain network.

We present an algebraic algorithm based on constrained multilinear sieving for solving the restless reachability problems we propose. With an open-source implementation we demonstrate that the algorithm can scale to large temporal graphs with tens of millions of edges, despite the problem being NP-hard. The implementation is efficiently engineered and highly optimized. For instance, we can solve the restless reachability problem by restricting the path length to $9$ in a real-world graph dataset with over 36 million directed edges in less than one hour on a 4-core Haswell desktop.

at October 19, 2020 11:22 PM UTC

Authors: Mathieu Carrière, Frédéric Chazal, Marc Glisse, Yuichi Ike, Hariprasad Kannan
Download: PDF
Abstract: Solving optimization tasks based on functions and losses with a topological flavor is a very active and growing field of research in Topological Data Analysis, with plenty of applications in non-convex optimization, statistics and machine learning. All of these methods rely on the fact that most of the topological constructions are actually stratifiable and differentiable almost everywhere. However, the corresponding gradient and associated code is always anchored to a specific application and/or topological construction, and do not come with theoretical guarantees. In this article, we study the differentiability of a general functional associated with the most common topological construction, that is, the persistence map, and we prove a convergence result of stochastic subgradient descent for such a functional. This result encompasses all the constructions and applications for topological optimization in the literature, and comes with code that is easy to handle and mix with other non-topological constraints, and that can be used to reproduce the experiments described in the literature.

at October 19, 2020 11:30 PM UTC

Authors: Marcin Jurdziński, Rémi Morvan, Pierre Ohlmann, K. S. Thejaswini
Download: PDF
Abstract: Progress-measure lifting algorithms for solving parity games have the best worst-case asymptotic runtime, but are limited by their asymmetric nature, and known from the work of Czerwi\'nski et al. (2018) to be subject to a matching quasi-polynomial lower bound inherited from the combinatorics of universal trees. Parys (2019) has developed an ingenious quasi-polynomial McNaughton- Zielonka-style algorithm, and Lehtinen et al. (2019) have improved its worst-case runtime. Jurdzi\'nski and Morvan (2020) have recently brought forward a generic attractor-based algorithm, formalizing a second class of quasi-polynomial solutions to solving parity games, which have runtime quadratic in the size of universal trees. First, we adapt the framework of iterative lifting algorithms to computing attractor-based strategies. Second, we design a symmetric lifting algorithm in this setting, in which two lifting iterations, one for each player, accelerate each other in a recursive fashion. The symmetric algorithm performs at least as well as progress-measure liftings in the worst-case, whilst bypassing their inherent asymmetric limitation. Thirdly, we argue that the behaviour of the generic attractor-based algorithm of Jurdzinski and Morvan (2020) can be reproduced by a specific deceleration of our symmetric lifting algorithm, in which some of the information collected by the algorithm is repeatedly discarded. This yields a novel interpretation of McNaughton-Zielonka-style algorithms as progress-measure lifting iterations (with deliberate set-backs), further strengthening the ties between all known quasi-polynomial algorithms to date.

at October 19, 2020 11:29 PM UTC

Authors: Biao Zhang, Peter Wonka
Download: PDF
Abstract: We propose a novel 3d shape representation for 3d shape reconstruction from a single image. Rather than predicting a shape directly, we train a network to generate a training set which will be feed into another learning algorithm to define the shape. Training data generating networks establish a link between few-shot learning and 3d shape analysis. We propose a novel meta-learning framework to jointly train the data generating network and other components. We improve upon recent work on standard benchmarks for 3d shape reconstruction, but our novel shape representation has many applications.

at October 19, 2020 11:31 PM UTC

Authors: Marek Adamczyk, Brian Brubach, Fabrizio Grandoni, Karthik A. Sankararaman, Aravind Srinivasan, Pan Xu
Download: PDF
Abstract: We consider the Stochastic Matching problem, which is motivated by applications in kidney exchange and online dating. In this problem, we are given an undirected graph. Each edge is assigned a known, independent probability of existence and a positive weight (or profit). We must probe an edge to discover whether or not it exists. Each node is assigned a positive integer called a timeout (or a patience). On this random graph we are executing a process, which probes the edges one-by-one and gradually constructs a matching. The process is constrained in two ways. First, if a probed edge exists, it must be added irrevocably to the matching (the query-commit model). Second, the timeout of a node $v$ upper-bounds the number of edges incident to $v$ that can be probed. The goal is to maximize the expected weight of the constructed matching.

For this problem, Bansal et al. (Algorithmica 2012) provided a $0.33$-approximation algorithm for bipartite graphs and a $0.25$-approximation for general graphs. We improve the approximation factors to $0.39$ and $0.269$, respectively.

The main technical ingredient in our result is a novel way of probing edges according to a not-uniformly-random permutation. Patching this method with an algorithm that works best for large-probability edges (plus additional ideas) leads to our improved approximation factors.

at October 19, 2020 11:27 PM UTC

Authors: Suman K. Bera, Noujan Pashanasangi, C. Seshadhri
Download: PDF
Abstract: Counting homomorphisms of a constant sized pattern graph $H$ in an input graph $G$ is a fundamental computational problem. There is a rich history of studying the complexity of this problem, under various constraints on the input $G$ and the pattern $H$. Given the significance of this problem and the large sizes of modern inputs, we investigate when near-linear time algorithms are possible. We focus on the case when the input graph has bounded degeneracy, a commonly studied and practically relevant class for homomorphism counting. It is known from previous work that for certain classes of $H$, $H$-homomorphisms can be counted exactly in near-linear time in bounded degeneracy graphs. Can we precisely characterize the patterns $H$ for which near-linear time algorithms are possible?

We completely resolve this problem, discovering a clean dichotomy using fine-grained complexity. Let $m$ denote the number of edges in $G$. We prove the following: if the largest induced cycle in $H$ has length at most $5$, then there is an $O(m\log m)$ algorithm for counting $H$-homomorphisms in bounded degeneracy graphs. If the largest induced cycle in $H$ has length at least $6$, then (assuming standard fine-grained complexity conjectures) there is a constant $\gamma > 0$, such that there is no $o(m^{1+\gamma})$ time algorithm for counting $H$-homomorphisms.

at October 19, 2020 12:00 AM UTC

Authors: Pierre Bourhis, Alejandro Grez, Louis Jachiet, Cristian Riveros
Download: PDF
Abstract: In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.

In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and regular complex event processing queries and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words.

at October 19, 2020 11:25 PM UTC

Authors: Adrian de Wynter
Download: PDF
Abstract: We present a greedy algorithm for solving binary classification problems in situations where the dataset is either too small or not fully representative of the problem being solved, and obtaining more data is not possible. This algorithm is of particular interest when training small models that have trouble generalizing. It relies on a trained model with loose accuracy constraints, an iterative hyperparameter pruning procedure, and a function used to generate new data. Analysis on correctness and runtime complexity under ideal conditions and an extension to deep neural networks is provided. In the former case we obtain an asymptotic bound of $O\left(|\Theta^2|\left(\log{|\Theta|} + |\theta^2| + T_f\left(| D|\right)\right) + \bar{S}|\Theta||{E}|\right)$, where $|{\Theta}|$ is the cardinality of the set of hyperparameters $\theta$ to be searched; $|{E}|$ and $|{D}|$ are the sizes of the evaluation and training datasets, respectively; $\bar{S}$ and $\bar{f}$ are the inference times for the trained model and the candidate model; and $T_f({|{D}|})$ is a polynomial on $|{D}|$ and $\bar{f}$. Under these conditions, this algorithm returns a solution that is $1 \leq r \leq 2(1 - {2^{-|{\Theta}|}})$ times better than simply enumerating and training with any $\theta \in \Theta$. As part of our analysis of the generating function we also prove that, under certain assumptions, if an open cover of $D$ has the same homology as the manifold where the support of the underlying probability distribution lies, then $D$ is learnable, and viceversa.

at October 19, 2020 12:00 AM UTC

Authors: Evan Sala, Joe Sawada, Abbas Alhakim
Download: PDF
Abstract: The greedy Prefer-same de Bruijn sequence construction was first presented by Eldert et al.[AIEE Transactions 77 (1958)]. As a greedy algorithm, it has one major downside: it requires an exponential amount of space to store the length $2^n$ de Bruijn sequence. Though de Bruijn sequences have been heavily studied over the last 60 years, finding an efficient construction for the Prefer-same de Bruijn sequence has remained a tantalizing open problem. In this paper, we unveil the underlying structure of the Prefer-same de Bruijn sequence and solve the open problem by presenting an efficient algorithm to construct it using $O(n)$ time per bit and only $O(n)$ space. Following a similar approach, we also present an efficient algorithm to construct the Prefer-opposite de Bruijn sequence.

at October 19, 2020 11:24 PM UTC

Authors: Thomas Kesselheim, Sahil Singla
Download: PDF
Abstract: We introduce online learning with vector costs (\OLVCp) where in each time step $t \in \{1,\ldots, T\}$, we need to play an action $i \in \{1,\ldots,n\}$ that incurs an unknown vector cost in $[0,1]^{d}$. The goal of the online algorithm is to minimize the $\ell_p$ norm of the sum of its cost vectors. This captures the classical online learning setting for $d=1$, and is interesting for general $d$ because of applications like online scheduling where we want to balance the load between different machines (dimensions).

We study \OLVCp in both stochastic and adversarial arrival settings, and give a general procedure to reduce the problem from $d$ dimensions to a single dimension. This allows us to use classical online learning algorithms in both full and bandit feedback models to obtain (near) optimal results. In particular, we obtain a single algorithm (up to the choice of learning rate) that gives sublinear regret for stochastic arrivals and a tight $O(\min\{p, \log d\})$ competitive ratio for adversarial arrivals.

The \OLVCp problem also occurs as a natural subproblem when trying to solve the popular Bandits with Knapsacks (\BwK) problem. This connection allows us to use our \OLVCp techniques to obtain (near) optimal results for \BwK in both stochastic and adversarial settings. In particular, we obtain a tight $O(\log d \cdot \log T)$ competitive ratio algorithm for adversarial \BwK, which improves over the $O(d \cdot \log T)$ competitive ratio algorithm of Immorlica et al. [FOCS'19].

at October 19, 2020 11:29 PM UTC

Some research I’ve been doing led me to consider the (prism,\(K_{3,3}\))-minor-free graphs. It’s not always easy to go from forbidden minors to the graphs that forbid them, or vice versa, but in this case I think there’s a nice characterization, which I’m posting here because it doesn’t fit into the research writeup: these are the graphs whose nontrivial triconnected components are \(K_5\), wheel graphs, or the graph \(K_5-e\) of the triangular bipyramid. The illustration below shows an example of a graph with this structure, with its nontrivial triconnected components colored red and yellow. There’s a simpler and more geometric way to say almost the same thing: the only convex polyhedra that do not have two vertex-disjoint faces are the pyramids and the triangular bipyramid.

A (prism, K_{3,3})-minor-free graph, with its nontrivial triconnected components colored red and yellow

Some definitions:

  • Here by the prism graph I mean the graph of the triangular prism. Any other prism has this one as a minor, and so is irrelevant as a forbidden minor. However, the pyramids in this structure can have any polygon as their base, corresponding to wheel graphs with arbitrarily many vertices.

  • \(K_{3,3}\) is a complete bipartite graph with three vertices on each side of its bipartition, famous as the utility graph, one of the two forbidden minors for planar graphs. The triangular prism graph and \(K_{3,3}\) are the only two 3-regular graphs with six vertices.

The prism graph and K_{3,3}

  • The triconnected components of a graph are the graphs associated with the nodes of its SPQR tree, or of the SPQR trees of its biconnected components. These are cycle graphs, dipole multigraphs, or 3-connected graphs, and by “nontrivial” I mean the ones that are not cycles or dipoles. A triconnected component might not be a subgraph of the given graph, because it can have additional edges that correspond to paths in the given graph. For instance, subdividing the edges of any graph into paths, or more generally replacing edges by arbitrary series-parallel graphs, does not change its set of nontrivial triconnected components.

  • I’m using “face” in the usual three-dimensional meaning, a two-dimensional subset of the boundary of the polyhedron. For higher-dimensional polytopes, “face” has a different meaning that also includes vertices and edges, and “facet” would be used to refer to the \((d-1)\)-dimensional faces, but using that terminology seems overly pedantic here.

Sketch of proof of the characterization of polyhedra without two disjoint faces: Consider any polyhedron without disjoint faces. If one face shares an edge with all the others, it’s a Halin graph, a graph formed by linking the leaves of a tree into a cycle; if the tree is a star, it’s a pyramid, and otherwise contracting all but one of the interior edges of the tree, and then all but four of the cycle edges, will produce a prism minor. In the remaining case, some two faces share only a vertex \(v\), which must have degree four or more. Each face that is disjoint from \(v\) must touch all that faces incident to \(v\), which can only happen when there is one face disjoint from \(v\) (a pyramid) or two faces disjoint from \(v\), neither of which has an edge disjoint from the other one (a bipyramid).

Sketch of a lemma that every convex polyhedron with two disjoint faces has a prism minor: glue a pyramidal cap into each of the two faces, producing a larger convex polyhedron which by either Steinitz’s theorem or Balinski’s theorem is necessarily 3-connected, and find three vertex-disjoint paths between the apexes of the attached pyramids. The parts of these paths outside the two glued pyramids, together with the boundaries of the two faces, form a subdivision of a prism.

Sketch of proof of the characterization of (prism,\(K_{3,3}\))-minor-free graphs: The nontrivial triconnected components are exactly the maximal triconnected minors of the given graph, so if either of the two triconnected forbidden minors is to be found in the given graph, it will be found in one of the triconnected components. \(K_5\) and the triangular bipyramid are too small to have one of the forbidden minors. The only 3-connected minors of the pyramid graphs are smaller pyramids, obtained by contracting one of the cycle edges of the pyramid, so these also do not have a forbidden minor. Therefore the graphs of the stated form are all (prism,\(K_{3,3}\))-minor-free.

In the other direction, suppose that a graph is (prism,\(K_{3,3}\))-minor-free. Each triconnected component is a minor, so it must also be (prism,\(K_{3,3}\))-minor-free. What can these components look like? Forbidding \(K_{3,3}\) as a minor rules out nonplanar components other than \(K_5\), by a theorem of Wagner1 and Hall.2 So the remaining components that we need to consider are triconnected planar graphs with no prism minor. These cannot have two disjoint faces by the lemma, and so they can only be pyramids or the triangular bipyramid.

  1. K. Wagner. Über eine Erweiterung des Satzes von Kuratowski. Deutsche Mathematik, 2:280–285, 1937. 

  2. D. W. Hall. A note on primitive skew curves. Bulletin of the American Mathematical Society, 49(12):935–936, 1943. doi:10.1090/ S0002-9904-1943-08065-2

(Discuss on Mastodon)

by David Eppstein at October 18, 2020 05:06 PM UTC

Let $f: \{0,1\}^n \to \{0, 1\}$ be a boolean function, and let $f_\land (x, y) = f(x \land y)$ denote the AND-function of $f$, where $x \land y$ denotes bit-wise AND. We study the deterministic communication complexity of $f_\land$ and show that, up to a $\log n$ factor, it is bounded by a polynomial in the logarithm of the real rank of the communication matrix of $f_\land$. This comes within a $\log n$ factor of establishing the log-rank conjecture for AND-functions with no assumptions on $f$. Our result stands in contrast with previous results on special cases of the log-rank conjecture, which needed significant restrictions on $f$ such as monotonicity or low $\mathbb{F}_2$-degree. Our techniques can also be used to prove (within a $\log n$ factor) a lifting theorem for AND-functions, stating that the deterministic communication complexity of $f_\land$ is polynomially-related to the AND-decision tree complexity of $f$. The results rely on a new structural result regarding boolean functions $f:\{0, 1\}^n \to \{0, 1\}$ with a sparse polynomial representation, which may be of independent interest. We show that if the polynomial computing $f$ has few monomials then the set system of the monomials has a small hitting set, of size poly-logarithmic in its sparsity. We also establish extensions of this result to multi-linear polynomials $f:\{0,1\}^n \to \mathbb{R}$ with a larger range.

at October 18, 2020 02:44 PM UTC

The next TCS+ talk will take place this coming Wednesday, October 21th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC). Aayush Jain from UCLA will speak about “Indistinguishability Obfuscation from Well-Founded Assumptions” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Due to security concerns, registration is required to attend the interactive talk. (The link to the YouTube livestream will also be posted on our website on the day of the talk, so people who did not sign up will still be able to watch the talk live.) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: We present a construction of an indistinguishability obfuscation scheme, whose security rests on the subexponential hardness of four well-founded assumptions. We show the existence of an indistinguishability Obfuscation scheme for all circuits assuming sub-exponential security of the following assumptions:

  • The Learning with Errors (LWE) assumption with arbitrarily small subexponential modulus-to-noise ratio,
  • The SXDH assumption with respect to bilinear groups of prime order p,
  • Existence of a Boolean Pseudorandom Generator (PRG) in \mathsf{NC}^0 with arbitrary polynomial stretch, that is, mapping n bits to n^{1+\tau} bits, for any constant \tau>0.
  • The Learning Parity with Noise (LPN) assumption over \mathbb{Z}_p with error-rate \ell^{-\delta}, where \ell is the dimension of the secret and \delta>0 is an arbitrarily small constant.
    Further, assuming only polynomial security of these assumptions, there exists a compact public-key functional encryption scheme for all circuits.

The main technical novelty is the introduction and construction of a cryptographic pseudorandom generator that we call a Structured-Seed PRG (sPRG), assuming LPN over \mathbb{Z}_p and PRGs in \mathsf{NC}^0. During the talk, I will discuss how structured seed PRGs have evolved from different notions of novel pseudorandom generators proposed in the past few years, and how an interplay between different areas of theoretical computer science played a major role in providing valuable insights leading to this work. Time permitting, I will go into the details of how to construct sPRGs.

Joint work with Huijia (Rachel) Lin and Amit Sahai

by plustcs at October 16, 2020 06:33 AM UTC

by David Eppstein at October 15, 2020 10:15 PM UTC

The Public Broadcasting Service (PBS) launched fifty years ago this month in the United States. The New York Times talks about its fifty reasons how the network mattered. I'll throw in my thoughts.

I was just slightly too old for shows like Sesame Street, Electric Company, Mr. Rogers and Zoom, not that that stopped me from watching them. My kids grew up on Barney and Friends. My daughter even had a toy Barney that interacted with the show, which went as well as you'd expect

PBS introduced me to those great British TV shows for young nerds like me including Monty Python and Doctor Who. I wasn't into Nova but did watch Carl Sagan's Cosmos religiously in high school.

My favorite PBS show was the American Experience, short documentaries about US culture. I remember learning about this history of Coney Island and the quiz show scandals before Robert Redford made a movie about it.

Siskel and Ebert got their start on PBS and became my go to source for movie reviews.

In 1987 PBS broadcasted Ivy League football games. One Saturday I sat down expecting to watch my alma mater and instead got supreme court hearings. Only on PBS could Cornell football get Borked.

by Lance Fortnow ( at October 15, 2020 01:00 PM UTC

Following last year’s successful launch, we are happy to announce the second edition of the conference on Information-Theoretic Cryptography (ITC).

The call for papers for ITC 2021 is out, and, to cheer you up during lockdowns, we prepared a short theme song  

Feel free to add your own verse 😉

The submission deadline is February 1st. Please submit your best work to ITC 2021! We hope to see many of you there!

by Boaz Barak at October 14, 2020 10:05 PM UTC

My latest post on this blog was on December 30th 2019. It seems like a lifetime away. The rate at which paradigm shifting events have been happening in 2020 is staggering. And it might very well be that the worst of 2020 is ahead of us, especially for those of us currently in the USA.

When I started communicating online broadly (blog, twitter) I promised myself to keep it strictly about science (or very closely neighboring topics), so the few lines above is all I will say about the current worldwide situation.

In other news, as is evident from the 10 months hiatus in blogging, I have taken elsewhere (at least temporarily) my need for rapid communication about theorems that currently excite me. Namely to youtube. Since the beginning of the pandemic I have been recording home videos of what would have been typically blog posts, with currently 5 such videos:

  1. A law of robustness for neural networks : I explain the conjecture we recently made that, for random data, any interpolating two-layers neural network must have its Lipschitz constant larger than the squareroot of the ratio between the size of the data set and the number of neurons in the network. This would prove that overparametrization is *necessary* for robustness.
  2. Provable limitations of kernel methods : I give the proof by Zeyuan Allen-Zhu and Yuanzhi Li that there are simple noisy learning tasks where *no kernel* can perform well while simple two-steps procedures can learn.
  3. Memorization with small neural networks : I explain old (classical combinatorial) and new (NTK style) construction of optimally-sized interpolating two-layers neural networks.
  4. Coordination without communication : This video is the only one in the current series where I don’t talk at all about neural networks. Specifically it is about the cooperative multiplayer multiarmed bandit problem. I explain the strategy we devised with Thomas Budzinski to solve this problem (for the stochastic version) without *any* collision at all between the players.
  5. Randomized smoothing for certified robustness : Finally, in the first video chronologically, I explain the only known technique for provable robustness guarantees in neural networks that can scale up to large models.

The next video will be about basic properties of tensors, and how it can be used for smooth interpolation (in particular in the context of our law of robustness conjecture). After that, we will see, maybe more neural networks, maybe more bandits, maybe some non-convex optimization ….

Stay safe out there!

by Sebastien Bubeck at October 14, 2020 03:14 AM UTC

Image may contain: 1 person, eyeglasses and closeup

[If you’re not American, or you’re American but a masochist who enjoys the current nightmare, this post won’t be relevant to you—sorry!]

Until recently, this blog had a tagline that included “HOLD THE NOVEMBER US ELECTION BY MAIL.” So I thought I should warn readers that circumstances have changed in ways that have important practical implications over the next few weeks. It’s no longer that we don’t know whether Trump and Pence will acknowledge a likely loss—rather, it’s that we know they won’t. They were repeatedly asked; we all heard their answers.

That means that the best case, the ideal scenario, is already without precedent in the country’s 240-year history. It’s a president who never congratulates the winner, who refuses to meet him or coordinate a transfer of power, who skips the inauguration, and who’s basically dragged from the White House on January 20, screaming to his supporters (and continuing to scream until his dying breath) that the election was faked.

As I said, that banana-republic outcome is now the best case. But it’s also plausible that Trump simply declares himself the winner on election night, because the mail-in votes, urban votes, yet-to-be-counted votes, or any other votes that trend the wrong way are fake; social media and the Murdoch press amplify this fantasy; Trump calls on Republican-controlled state legislatures to set aside the “rigged” results and appoint their own slates of electors; the legislatures dutifully comply; and the Supreme Court A-OKs it all. If you think none of that could happen, read this Atlantic article from a few weeks ago, carefully to the end, and be more terrified than you’ve ever been in your life. And don’t pretend that you know what would happen next.

I know, I know, I’m mentally ill, it’s Trump Derangement Syndrome, I see Nazis behind every corner just because they killed most of my relatives, a little global pandemic here and economic collapse there and riots and apocalyptic fires and resurgent fascism and I act as though it’s the whole world coming to an end. A few months from now, after everything has gone swimmingly, this post will still be here and you can come back and tell me how crazy I was. I accept that risk.

For now, though, the best chance to avert a catastrophe is for Trump not merely to lose, but lose in a landslide that’s already clear by election night. Which means: as Michelle Obama advised already in August, put on your mask, brave the virus, and vote in person if you can—especially if you live in a state that’s in play, and that won’t start tallying mail-in ballots till after election day. If your state allows it, and if early votes will be counted by election night (check this!), vote early, when the lines are shorter. That’s what Dana and I did this morning; Texas going blue on election night would be one dramatic way to foreclose shenanigans. If you can’t vote in person, or if your state counts mail-in ballots earlier, then vote by mail or drop-box, but do it now, so you have a chance to fix any problems well before Election Day. (Note that, even in normal circumstances—which these aren’t—a substantial fraction of all mail-in ballots get rejected because of trivial errors.) I welcome other tips in the comments, from the many readers more immersed in this stuff than I am.

And if this post helped spur you in any way, please say so in the comments. It will improve my mood, thereby helping me finish my next post, which will be on the Continuum Hypothesis.

Update: It’s always fascinating to check my comments and see the missives from parallel universes, where Trump is a normal candidate who one might decide to vote for based on normal criteria, rather than what he himself has announced he is: a knife to the entire system that underlies such decisions. For a view from this universe, see (e.g.) today’s Nature editorial.

Another Update: If it allays anyone’s fears, I was pleasantly surprised by the level of pandemic preparedness when Dana and I went to vote. It was in a huge, cavernous gym on the UT campus, the lines were very short, masks and 6ft distancing were strictly enforced, and finger-coverings and hand sanitizer were offered to everyone.

Unrelated Update (10/16): For those who are interested, here’s a new podcast with me and Matt Asher, where we talk about the use of quantum mechanics (especially Bell inequality violations) to generate certified random numbers.

by Scott at October 14, 2020 01:35 AM UTC


My last post on CH mentioned that Hugh Woodin used to think NOT(CH) but now thinks CH. In both cases his reasons have some math content to them. Also, note that Hugh Woodin seems to believe that CH somehow HAS an answer. Kurt Godel also thought CH HAS an answer. It has been said that he could have announced  his result that CH is consistent by saying  L is THE model, and the problem is now solved. 

Should we care what Hugh Woodin and Kurt Godel think about CH?

YES- they have both studied the issue A LOT. If you think CH should have an answer, then surely you would care what they think. 

NO-  CH has no answer so there opinions are no better than mine. If you think CH does not have an answer then you might think this; however, I think you should still be at least INTERESTED in what people who have thought about the problem A LOT have to say, even if you will disagree with them.

But with MATH there are people who clearly know more than you on topics you care about, so it is worth hearing what they have to say. 


Recently Dwayne THE ROCK Johnson (by Wikipedia: actor, producer, businessman, and former professional wrestler) ENDORSED Joe Biden. Should we care about his opinion? Maybe, if wrestling fans and former pro wrestler tend to be Republicans, so this may indicate a shift. I do not know if this is the case. 

Robert De Niro was in favor of impeaching Donald Trump. He also said that Trump was like a Gangster. He would know because he was in the movie GOODFELLOWS and later THE IRISHMAN (about Jimmy Hoffa). To be fair I do not think he said that is how he would know. Even so, I don't think I care what he thinks, unless he has some specialized knowledge I do not know about. 

David Frum is a republican who had a break with the party NOT over Donald Trump, but over Obamacare- which you may recall was originally a CONSERVATIVE response to Hillarycare by the Heritage Foundation.  He has a good article on this here. Because he is an intelligent  republican in favor of Obamacare (or some version of it) he is worth listening to.

In POLITICS its trickier- who is worth listening to and why. For all I know, THE ROCK has made a detailed study of the Republican and Democratic platforms (actually this cannot be true since the Republicans did not have a platform this time). 


Tom Selleck (Actor-Magnum PI a while back, Blue Bloods now)  does commercials for reverse mortgages. A while back I asked a group of people WHY he is doing them. Here were some answers and reactions

a) He needs the money. Not likely, he seems to have done well and does not seem to have the kind of bad habits (e.g., drugs) that need money. Maybe he has expensive tastes (my only expensive tastes is in fine European Kit Kat bars--- which actually are not that expensive). 

b) He likes doing commercials. Maybe.

c) He believes in the product. At this, everyone cracked up in laughter.

This raises a more general point: Why does ANYONE believe ANY commercial since we KNOW the actor is being PAID to say it. I ask non rhetorically as always. 

by gasarch ( at October 12, 2020 05:49 PM UTC

Here it is—enjoy! (I strongly recommend listening at 2x speed.)

We recorded it a month ago—outdoors (for obvious covid reasons), on a covered balcony in Austin, as it drizzled all around us. Topics included:

  • Whether the universe is a simulation
  • Eugene Goostman, GPT-3, the Turing Test, and consciousness
  • Why I disagree with Integrated Information Theory
  • Why I disagree with Penrose’s ideas about physics and the mind
  • Intro to complexity theory, including P, NP, PSPACE, BQP, and SZK
  • The US’s catastrophic failure on covid
  • The importance of the election
  • My objections to cancel culture
  • The role of love in my life (!)

Thanks so much to Lex for his characteristically probing questions, apologies as always for my verbal tics, and here’s our first podcast for those who missed that one.

by Scott at October 12, 2020 02:38 PM UTC

Our congratulations on the 2020 Nobel Prize in Physics

Composite crop of src1, src2

Roger Penrose, Reinhard Genzel, and Andrea Ghez have won the 2020 Nobel Prize in Physics. The prize is divided half to Penrose for theoretical work and half to Genzel and Ghez for finding a convincing and appreciably large practical example.

Today we congratulate the winners and give further musings on the nature of knowledge and the role of theory.

The physics Nobel has always had the rule that it cannot be for a theory alone, no matter how beautiful and how many mathematical discoveries follow from its development. Stephen Hawking’s theory of black-hole radiation is almost universally accepted, despite its association with paradox, yet it was said that only an empirical confirmation such as mini-black holes being discovered to explode in an accelerator core would have brought it a Nobel. The official citation to Sir Roger says that his prize is:

“for the discovery that black hole formation is a robust prediction of the general theory of relativity.”

What is a “robust” prediction? The word strikes us as having overtones of necessity. Necessary knowledge is the kind we deal with in mathematics. The citation to Genzel and Ghez stays on empirical grounds:

“for the discovery of a supermassive compact object at the centre of our galaxy.”

The “object” must be a black hole—given relativity and its observed gravitational effects, it cannot be otherwise. Among many possible witnesses for the reality of black holes—one being the evident origin of the gravitational waves whose detection brought the 2017 Nobel—the centers of galaxies are hefty examples. The combination of these citations opens several threads we’d like to discuss.

The Proof Horizon of a Black Hole

Dick and I are old enough to remember when black holes had the status of conjecture. One of my childhood astronomy books stated that the Cygnus X-1 X-ray source was the best known candidate for a black hole. In 1974, Hawking bet Kip Thorne that it was not a black hole. The bet lasted until 1990, when Hawking conceded. He wrote the following in his famous book, A Brief History of Time:

This was a form of insurance policy for me. I have done a lot of work on black holes, and it would all be wasted if it turned out that black holes do not exist. But in that case, I would have the consolation of winning my bet. … When we made the bet in 1975, we were 80% certain that Cygnus X-1 was a black hole. By now [1988], I would say that we are about 95% certain, but the bet has yet to be settled.

In the 1980s, I was a student and then postdoc in Penrose’s department, so I was imbued with the ambience of black holes and never had a thought about doubting their existence. I even once spent an hour with John Wheeler, who coined the term “black hole,” when Penrose delegated me to accompany Wheeler to Oxford’s train station for his return to London. But it seems from the record that the progression to regarding black holes as proven entities was as gradual as many argue the act of crossing a large black hole’s event horizon to be. Although the existence of a central black hole from data emanating from Sagittarius had been proposed at least as far back as 1971, the work by Ghez and then Genzel cited for their prize began in 1995. The official announcement for Riccardo Giacconi’s share of the 2002 physics Nobel stated:

“He also detected sources of X-rays that most astronomers now consider to contain black holes.”

This speaks lingering doubt at least about where black holes might be judged to exist, if not their existence at all.

However their time of confirmation might be pinpointed, it is the past five years that have given by far the greatest flood of evidence, including the first visual image of a black hole last year. The fact of their presence in our universe is undeniable. But necessity is a separate matter, and with Penrose this goes back to 1964.

Relativity and Necessity

We have mentioned Kurt Gödel’s solution to the equations of general relativity (GR) in which time travel is possible. This does not mean that time travel must be possible, or that it is possible in our universe. A “solution” to GR is more like a model in logic: it may satisfy a theory’s axioms but have other properties that are contingent (unless the theory is categorical, meaning that all of its models are isomorphic). Gödel’s model has a negative value for Einstein’s cosmological constant; the 2011 physics Nobel went to the discovery that in our universe the constant has a tiny positive value. GR also allows solutions in which some particles (called tachyons) travel faster than light.

That GR has solutions allowing black holes had been known from its infancy in work by Karl Schwarzschild and Johannes Droste. There are also solutions without black holes; a universe with no mass is legal in GR in many ways besides the case of special relativity. Penrose took the opposite tack, of giving minimal conditions under which black holes are necessary. Following this article, we list them informally as follows:

  1. Sufficiently large concentrations of mass exerting gravity exist.
  2. Gravity always attracts, never repels.
  3. No physical effect can travel faster than light.
  4. Gravity determines how light bends and moves.
  5. The space-time manifold is metrically complete.

Penrose showed that any system obeying these properties and evolving in accordance with GR must develop black holes. He showed this without any symmetry assumptions on the system. Thus he derived black holes as a prediction with the force of a theorem derived from minimal axioms.

His 1965 paper actually used a proof by contradiction. He derived five properties needed in order for the system to avoid forming a singularity. Then he showed they are mutually inconsistent—a proof by contradiction. Here is the crux of his paper:

[ Snip from paper ]

In the diagram, time flows up. The point in a nutshell—a very tight nutshell—is that once a surface flows inside the cylinder at the Schwarzschild radius then light and any other motion from it can go only inward toward a singularity. The analysis is possible without the kind of symmetry assumption that had been used to tame the algebraic complexity of the equations of GR. The metric completeness mandates a singularity apart from any symmetries; a periodic equilibrium is ruled out by analysis of Cauchy surfaces.

Necessary For Us?

Like Richard Feynman’s famous diagrams for quantum field theory, Penrose developed his diagrams as tools for shortcutting the vicissitudes of GR. We could devote entire other posts to his famous tiles and triangle and other combinatorial inventions. His tools enable quantifying black-hole formation from observations in our universe.

The question of necessity, however, pertains to other possible universes. Let us take for granted that GR and quantum theory are facets of a physical theory that governs the entire cosmos—the long-sought “theory of everything”—and let us also admit the contention of inflationary theorists that multiple universes are a necessary consequence of any inflation theory. The question remains, are black holes necessary in those universes?

It is possible that those universes might not satisfy axiom 1 above, or might have enough complexity for existence of black holes but not large-scale formation of them. The question then becomes whether black holes must exist in any universe rich enough for sentient life forms such as ourselves to develop. This is a branch of the anthropic principle.

Lee Smolin proposed a mechanism via which black holes engender new universes and so propagate the complexity needed for their large-scale formation. Since complexity also attends the development of sentient life forms, this would place our human existence in the wake of consequence, as opposed to the direction of logic when reasoning by the anthropic principle.

A Little More About Science

The 2020 Nobel Prize in Chemistry was awarded this week to Jennifer Doudna and Emmanuelle Charpentier for their lead roles in developing the CRISPR gene-editing technology, specifically around the protein Cas9.

We argue that two more different types of results cannot be found:

{\bullet } Penrose shows that black holes and general relativity are connected, which is a math result. We still cannot create black holes in a lab to experiment with—or maybe we could but should be very afraid of going anywhere near doing so. It was not clear that there could ever be a real application of this result.

{\bullet } Charpentier and Doudna discover that an existing genetic mechanism could be used to edit genetic material. Clearly this can and was experimented on in labs. Also clear that there are applications of this result. Actually it is now a standard tool used in countless labs. There even are patent battles over the method.

We like the fact that Nobels are given for such diverse type of research. It is not just that one is for astrophysics and one for chemistry. It is that Nobels can be given for very different types of research. We think this is important.

But wait. These results do have something in common, something that sets them apart from any research we can do in complexity theory. Both operate like this:

Observe something important from nature. Something that is there independent of us. Then in Penrose’s case explain why it is true. Then in Charpentier and Doudna’s case, use it to solve some important problems.

We wonder if anything like this could be done in our research world—say in complexity theory?

Open Problems

Besides our congratulations to all those mentioned in this post, Ken expresses special thanks to Sir Roger among other Oxford Mathematical Institute fellows for the kindness recorded here.

[changed note about massless universe]

by RJLipton+KWRegan at October 11, 2020 07:19 PM UTC

We prove a general structural theorem for a wide family of local algorithms, which includes property testers, local decoders, and PCPs of proximity. Namely, we show that the structure of every algorithm that makes $q$ adaptive queries and satisfies a natural robustness condition admits a sample-based algorithm with $n^{1- 1/O(q^2 \log^2 q)}$ sample complexity, following the definition of Goldreich and Ron (TOCT 2016). We prove that this transformation is nearly optimal. Our theorem also admits a scheme for constructing privacy-preserving local algorithms. Using the unified view that our structural theorem provides, we obtain results regarding various types of local algorithms, including the following. - We strengthen the state-of-the-art lower bound for relaxed locally decodable codes, obtaining an exponential improvement on the dependency in query complexity; this resolves an open problem raised by Gur and Lachish (SODA 2020). - We show that any (constant-query) testable property admits a sample-based tester with sublinear sample complexity; this resolves a problem left open in a work of Fischer, Lachish, and Vasudev (FOCS 2015) by extending their main result to adaptive testers. - We prove that the known separation between proofs of proximity and testers is essentially maximal; this resolves a problem left open by Gur and Rothblum (ECCC 2013, Computational Complexity 2018) regarding sublinear-time delegation of computation. Our techniques strongly rely on relaxed sunflower lemmas and the Hajnal–Szemerédi theorem.

at October 10, 2020 11:21 AM UTC

We identify several genres of search problems beyond NP for which existence of solutions is guaranteed. One class that seems especially rich in such problems is PEPP (for "polynomial empty pigeonhole principle"), which includes problems related to existence theorems proved through the union bound, such as finding a bit string that is far from all codewords, finding an explicit rigid matrix, as well as a problem we call Complexity, capturing Complexity Theory's quest. When the union bound is generous, in that solutions constitute at least a polynomial fraction of the domain, we have a family of seemingly weaker classes $\alpha$-PEPP, which are inside FP}$^{\text{NP}}|$poly. Higher in the hierarchy, we identify the constructive version of the Sauer-Shelah lemma and the appropriate generalization of PPP that contains it. The resulting total function hierarchy turns out to be more stable than the polynomial hierarchy: it is known that, under oracles, total functions within FNP may be easy, but total functions a level higher may still be harder than FP$^{\text{NP}}$.

at October 08, 2020 10:31 PM UTC

The determinant is a canonical VBP-complete polynomial in the algebraic complexity setting. In this work, we introduce two variants of the determinant polynomial which we call $StackDet_n(X)$ and $CountDet_n(X)$ and show that they are VP and VNP complete respectively under $p$-projections. The definitions of the polynomials are inspired by a combinatorial characterisation of the determinant developed by Mahajan and Vinay (SODA 1997). We extend the combinatorial object in their work, namely clow sequences, by introducing additional edge labels on the edges of the underlying graph. The idea of using edge labels is inspired by the work of Mengel (MFCS 2013).

at October 08, 2020 10:27 PM UTC

After two blog posts earlier this year on Chebyshev and Jacobi polynomials, I am coming back to orthogonal polynomials, with Hermite polynomials.

This time, in terms of applications to machine learning, no acceleration, but some interesting closed-form expansions in positive-definite kernel methods.

Definition and first properties

There are many equivalent ways to define Hermite polynomials. A natural one is through the so-called Rodrigues’ formula: $$H_k(x) = (-1)^k e^{x^2} \frac{d^k}{d x^k}\big[ e^{-x^2} \big],$$ from which we can deduce \(H_0(x) = 1\), \(H_1(x) =\ – e^{x^2} \big[ -2x e^{-x^2} \big] = 2x\), \(H_2(x) = e^{x^2} \big[ (-2x)^2e^{-x^2} -2 e^{-x^2} \big] = 4x^2 – 2\), etc.

Other simple properties which are consequences of the definition (and can be shown by recursion) are that \(H_k\) is a polynomial of degree \(k\), with the same parity as \(k\), and with a leading coefficient equal to \(2^k\).

Orthogonality for Gaussian distribution. Using integration by parts, one can show (see end of the post) that for \(k \neq \ell\), we have $$\int_{-\infty}^{+\infty} \!\!\!H_k(x) H_\ell(x) e^{-x^2} dx =0, $$ and that for \(k=\ell\), we have $$\int_{-\infty}^{+\infty} \!\!\! H_k(x)^2 e^{-x^2}dx = \sqrt{\pi} 2^k k!.$$

In other words, the Hermite polynomials are orthogonal for the Gaussian distribution with mean \(0\) and variance \(\frac{1}{2}\). Yet in other words, defining the Hermite functions as \( \displaystyle \psi_k(x) = (\sqrt{\pi} 2^k k!)^{-1/2} H_k(x) e^{-x^2/2}\), we obtain an orthonormal basis of \(L_2(dx)\). As illustrated below, the Hermite functions, as the index \(k\) increases, have an increasing “support” (the support is always the entire real line, but most of the mass is concentrated in centered balls of increasing sizes, essentially at \(\sqrt{k}\)) and, like cosines and sines, an increasingly oscillatory behavior.

Plot of Hermite functions \(\psi_k(x) = (\sqrt{\pi} 2^k k!)^{-1/2} H_k(x) e^{-x^2/2}\), from \(k=0\) to \(k=20\).

Among such orthonormal bases, the Hermite functions happen to be diagonalizing the Fourier tranform operator. In other words, the Fourier transform of \(\psi_k\) (for the definition making it an isometry of \(L_2(dx)\)) is equal to $$ \mathcal{F}(\psi_k)(\omega)  = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{+\infty} \psi_k(x) e^{- i \omega x} dx = (-i)^k \psi_k(\omega).$$ (note that the eigenvalues are all of unit modulus as we have an isometry). See a proof at the end of the post. I am not aware of any applications of this property in machine learning or statistics (but there are probably some).

Recurrence. In order to compute Hermite polynomials, the following recurrence relation is the most useful $$ H_{k+1}(x) = 2x H_k(x) \ – 2k H_{k-1}(x). \tag{1}$$ Such recursions are always available for orthogonal polynomials (see [4]), but it takes here a particularly simple form (see a proof at the end of the post).

Generating function. The following property is central in many proofs of properties of Hermite polynomials: for all \(t \in \mathbb{R}\), we have $$\sum_{k=0}^\infty \frac{t^k}{k!} H_k(x) =e^{ 2xt \ – \ t^2}, \tag{2}$$ with a proof at the end of the post based on the residue theorem.

Further (less standard) properties

For the later developments, we need other properties which are less standard (there are many other interesting properties, which are not useful for this post, see here).

Mehler formula. For \(|\rho| < 1\), it states: $$ \exp \Big( – \frac{\rho}{1- \rho^2} (x-y)^2\Big) = \sqrt{1-\rho^2} \sum_{k=0}^\infty \frac{\rho^k}{2^k k!} H_k(x) H_k(y) \exp \Big( – \frac{\rho}{1+\rho} (x^2 + y^2) \Big).$$ The proof is significantly more involved; see [1] for details (with a great last sentence: “Prof. Hardy tells me that he has not seen his proof in print, though the inevitability of the successive steps makes him think that it is unlikely to be new”). Note that we will in fact obtain a new proof from the relationship with kernel methods (see below).

Expectation for Gaussian distributions. We will need this property for \(|\rho|<1\) (see proof at the end of the post), which corresponds to the expectation of \(H_k(x)\) for \(x\) distributed as a non-centered Gaussian distribution: $$\int_{-\infty}^\infty H_k(x) \exp\Big( – \frac{(x-\rho y)^2}{1-\rho^2} \Big)dx= \sqrt{\pi} \rho^k \sqrt{1-\rho^2} H_k (y). \tag{3}$$

Given the relationship with the Gaussian distribution, it is no surprise that Hermite polynomials pop up whenever Gaussians are used, as distributions or kernels. Before looking into it, let’s first give a brief review of kernel methods.

From positive-definite kernel to Hilbert spaces

Given a prediction problem with inputs in a set \(\mathcal{X}\), a traditional way of parameterizing real-valued functions on \(\mathcal{X}\) is to use positive-definite kernels.

A positive-definite kernel is a function \(K: \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) such that for all sets \(\{x_1,\dots,x_n\}\) of \(n\) elements of \(\mathcal{X}\), the “kernel matrix” in \(\mathbb{R}^{n \times n}\) composed of pairwise evaluations is symmetric positive semi-definite. This property happens to be equivalent to the existence of a Hilbert feature space \(\mathcal{H}\) and a feature map \(\varphi: \mathcal{X} \to \mathcal{H}\) such that $$K(x,x’) = \langle \varphi(x), \varphi(x’) \rangle_{\mathcal{H}},$$ with an elegant constructive proof [15].

This allows to define the space of linear functions on the features, that is, functions of the form $$f(x) = \langle f, \varphi(x) \rangle_{\mathcal{H}},$$ for \(f \in \mathcal{H}\).

This space is often called the reproducing kernel Hilbert space (RKHS) associated to the kernel \(K\) (we can prove that it is indeed uniquely defined). In such a space, we can also define the squared norm of the function \(f\), namely \(\| f\|_{\mathcal{H}}^2\), which can be seen as a specific regularization term in kernel methods.

The space satisfies the so-called reproducing property (hence its name): \(f(x) = \langle f, K(\cdot,x) \rangle_{\mathcal{H}}\). In other words, the feature \(\varphi(x)\) is the kernel function evaluated at \(x\), that is, \(\varphi(x) = K(\cdot,x)\). These spaces have been a source of many developments in statistics [5] and machine learning [6, 7].

Orthonormal basis. A difficulty in working with infinite-dimensional Hilbert spaces of functions is that it is sometimes hard to understand what functions are actually considered. One simple way to enhance understanding of the regularization property is to have an orthonormal basis (in very much the same way as the Fourier basis), as we can then identify \(\mathcal{H}\) to the space of squared-integrable sequences.

For kernel-based Hilbert spaces, if we have an orthonormal basis \((g_k)_{k \geqslant 0}\) of the Hilbert space \(\mathcal{H}\), then, by decomposing \(\varphi(x)\) in the basis, we have $$\varphi(x) = \sum_{k =0}^\infty \langle \varphi(x), g_k \rangle_\mathcal{H} g_k,$$ we get $$K(x,y) = \langle \varphi(y), \varphi(x) \rangle = \sum_{k =0}^\infty \langle \varphi(x), g_k \rangle_\mathcal{H} \langle \varphi(y), g_k \rangle_\mathcal{H} =\sum_{k=0}^\infty g_k(x) g_k(y), \tag{4}$$ that is, we have an expansion of the kernel as an infinite sum (note here, that we ignore summability issues).

Among orthonormal bases, some are more interesting than others. The ones composed of eigenfunctions for particular operators are really more interesting, in particular for the covariance operator that we now present, and their use in statistical learning theory.

Analyzing ridge regression through covariance operators

The most classical problem where regularization by RKHS norms occurs is ridge regression, where, given some observations \((x_1,y_1),\dots,(x_n,y_n) \in \mathcal{X} \times \mathbb{R}\), one minimizes with respect to \(f \in \mathcal{H}\): $$ \frac{1}{n} \sum_{i=1}^n \big( y_i \ – \langle f, \varphi(x_i) \rangle_{\mathcal{H}} \big)^2 + \lambda \| f\|_{\mathcal{H}}^2.$$

In finite dimensions, the convergence properties are characterized by the (non-centered) covariance matrix \(\Sigma = \mathbb{E} \big[ \varphi(x) \otimes \varphi(x) \big]\), where the expectation is taken with respect to the underlying distribution of the observations \(x_1,\dots,x_n\) (which are assumed independently and identically distributed for simplicity). If \(\mathcal{H} = \mathbb{R}^d\), then \(\Sigma\) is a \(d \times d\) matrix.

For infinite-dimensional \(\mathcal{H}\), the same expression \(\Sigma = \mathbb{E} \big[ \varphi(x) \otimes \varphi(x) \big]\) defines a linear operator from \(\mathcal{H}\) to \(\mathcal{H}\), so that for \(f,g \in \mathcal{H}\), we have $$\langle f, \Sigma g \rangle_{\mathcal{H}} = \mathbb{E} \big[ \langle f, \varphi(x)\rangle_{\mathcal{H}}\langle g, \varphi(x)\rangle_{\mathcal{H}}\big] = \mathbb{E} \big[ f(x) g(x) \big].$$

The generalization property of ridge regression has been thoroughly studied (see, e.g., [8, 9]), and if there exists \(f_\ast \in \mathcal{H}\) such that \(y_i = \langle f_\ast, \varphi(x_i) \rangle + \varepsilon_i\) for a noise \(\varepsilon_i\) which is independent of \(x_i\), with zero mean and variance equal to \(\sigma^2\), then the expected error on unseen data is asymptotically upper-bounded by $$\sigma^2 + \lambda \| f_\ast\|_{\mathcal{H}}^2 + \frac{\sigma^2}{n} {\rm tr} \big[ \Sigma ( \Sigma + \lambda I)^{-1} \big].$$ The first term \(\sigma^2\) is the best possible expected performance, the term \(\lambda \| f_\ast\|_{\mathcal{H}}^2\) is usually referred to as the bias term and characterizes the bias introduced by regularizing towards zero, while the third term \(\frac{\sigma^2}{n} {\rm tr} \big[ \Sigma ( \Sigma + \lambda I)^{-1} \big]\) is the variance term, which characterizes the loss in performance due to the observation of only \(n\) observations.

The quantity \({\rm df}(\lambda) = {\rm tr} \big[ \Sigma ( \Sigma + \lambda I)^{-1} \big]\) is often referred to as the degrees of freedom [10]. When \(\lambda\) tends to infinity, then \({\rm df}(\lambda)\) tends to zero; when \(\lambda\) tends to zero, then \({\rm df}(\lambda)\) tends to the number of non-zero eigenvalues of \(\Sigma\). Thus, in finite dimension, this typically leads to the underlying dimension. Given the usual variance term in \(\sigma^2 \frac{d}{n}\) for ordinary least-squares with \(d\)-dimensional features, \({\rm df}(\lambda)\) is often seen as an implicit number of parameters for kernel ridge regression.

In infinite dimensions, under mild assumptions, there are infinitely many eigenvalues for \(\Sigma\), which form a decreasing sequence \((\lambda_i)_{i \geqslant 0}\) that tends to zero (and is summable, with a sum equal to the trace of \(\Sigma\)). The rate of such a decay is key to understanding the generalization capabilities of kernel methods. With the following classical types of decays:

  • Polynomial decays: If \(\lambda_i \leqslant \frac{C}{(i+1)^{\alpha}}\) for \(\alpha > 1\), then one can upper bound the sum by an integral as $$ {\rm tr} \big[ \Sigma ( \Sigma + \lambda I)^{-1} \big] = \sum_{i=0}^\infty \frac{\lambda_i}{\lambda_i + \lambda} \leqslant \sum_{i=1}^\infty \frac{1}{1 + \lambda i^\alpha / C} \leqslant \int_0^\infty \frac{1}{1+\lambda t^\alpha / C} dt.$$ With the change of variable \(u = \lambda t^\alpha / C\), we get that \({\rm df}(\lambda) = O(\lambda^{-\alpha})\). We can then balance bias and variance with \(\lambda \sim n^{-\alpha/(\alpha+1)}\) and an excess risk proportional to \(n^{-\alpha/(\alpha+1)}\). This type of decay is typical of Sobolev spaces.
  • Exponential decays: If \(\lambda_i \leqslant {C}e^{-\alpha i}\), for some \(\alpha >0\), we have $$ {\rm tr} \big[ \Sigma ( \Sigma + \lambda I)^{-1} \big] \leqslant \sum_{i=0}^\infty \frac{{C}e^{-\alpha i}}{ \lambda + {C}e^{-\alpha i}} \leqslant \int_{0}^\infty \frac{{C}e^{-\alpha t}}{ \lambda + {C}e^{-\alpha t}}dt.$$ With the change of variable \(u = e^{-\alpha t}\), we get an upper bound $$\int_{0}^1 \frac{C}{\alpha}\frac{1}{ \lambda + {C}u}du = \frac{1}{\alpha}\big[ \log(\lambda + C) \ – \log (\lambda) \big] = \frac{1}{\alpha} \log \big( 1+\frac{C}{\lambda} \big).$$ We can then balance bias and variance with \(\lambda \sim 1/n \) and an excess risk proportional to \((\log n) / n \), which is very close to the usual parametric (finite-dimensional) rate in \(O(1/n)\). We will see an example of this phenomenon for the Gaussian kernel.

In order to analyze the generalization capabilities, we consider a measure \(d \mu\) on \(\mathcal{X}\), and the following (non-centered) covariance operator defined above as $$\mathbb{E} \big[ \varphi(x) \otimes \varphi(x) \big],$$ which is now an self-adjoint operator from \(\mathcal{H}\) to \(\mathcal{H}\) with a finite trace. The traditional empirical estimator \(\hat\Sigma = \frac{1}{n} \sum_{i=1}^n \varphi(x_i) \otimes \varphi(x_i)\), whose eigenvalues are the same as the eigenvalues of \(1/n\) times the \(n \times n\) kernel matrix of pairwise kernel evaluations (see simulation below).

Characterizing eigenfunctions. If \((g_k)\) is the eigenbasis associated to the eigenfunctions of \(\Sigma\), then it has to be an orthogonal family that span the entire space \(\mathcal{H}\) and such that \(\Sigma g_k = \lambda_k g_k\). Applying it to \(\varphi(y) = K(\cdot,y)\), we get $$ \langle K(\cdot,y), \Sigma g_k \rangle_{\mathcal{H}} = \mathbb{E} \big[ K(x,y) g_k(x) \big] = \lambda_k \langle g_k, \varphi(y)\rangle_\mathcal{H} = \lambda_k g_k(y),$$ which implies that the functions also have to be eigenfunctions of the self-adjoint so-called integral operator \(T\) defined on \(L_2(d\mu)\) as \(T f(y) = \int_{\mathcal{X}} K(x,y) f(y) d\mu(y)\). Below, we will check this property. Note that this other notion of integral operator (defined on \(L_2(d\mu)\) and not in \(\mathcal{H}\)), which has the same eigenvalues and eigenfunctions, is important to deal with mis-specified models (see [9]). Note that the eigenfunctions \(g_k\) are orthogonal for both dot-products in \(L_2(d\mu)\) and \(\mathcal{H}\), but that the normalization to unit norm differs. If \(\| g_k \|_{L_2(d\mu)}=1\) for all \(k \geqslant 0\), then we have \( \| g_k \|^2_\mathcal{H}= \lambda_k^{-1} \langle g_k, \Sigma g_k \rangle_\mathcal{H} = \lambda_k^{-1}\mathbb{E} [ g_k(x)^2] =\lambda_k^{-1}\) , and thus, \(\| \lambda_k^{1/2} g_k \|_{\mathcal{H}}=1\), and we have the kernel expansion from an orthonormal basis of \(\mathcal{H}\): $$K(x,y) = \sum_{k=0}^\infty\lambda_k g_k(x) g_k(y),$$ which will lead to a new proof for Mehler formula.

Orthonormal basis for the Gaussian kernel

Hermite polynomials naturally lead to orthonormal basis of some reproducing kernel Hilbert spaces (RKHS). For simplicity, I will focus on one-dimensional problems, but this extends to higher dimension.

Translation-invariant kernels. We consider a function \(q: \mathbb{R} \to \mathbb{R}\) which is integrable, with Fourier transform (note the different normalization than before) which is defined for all \(\omega \in \mathbb{R}\) because of the integrability: $$\hat{q}(\omega) = \int_{\mathbb{R}} q(x) e^{-i \omega x} dx.$$ We consider the kernel $$K(x,y) = q(x-y).$$ It can be check that as soon as \(\hat{q}(\omega) \in \mathbb{R}_+\) for all \(\omega \in \mathbb{R}\), then the kernel \(K\) is positive-definite.

For a translation-invariant kernel, we can write using the inverse Fourier transform formula: $$K(x,y) = q(x-y) = \frac{1}{2\pi} \int_{\mathbb{R}} \hat{q}(\omega) e^{i \omega ( x- y)} d \omega = \int_{\mathbb{R}} \varphi_\omega(x)^* \varphi_\omega(y) d \omega,$$ with \(\varphi_\omega(x) = \sqrt{\hat{q}(\omega) / (2\pi) } e^{i\omega x}\). Intuitively, for a function \(f: \mathbb{R} \to \mathbb{R}\), with \(\displaystyle f(x) = \frac{1}{2\pi} \int_{\mathbb{R}} \hat{f}(\omega)e^{i\omega x} d\omega = \int_{\mathbb{R}} \frac{\hat{f}(\omega) }{\sqrt{2 \pi \hat{q}(\omega)}}\varphi_\omega(x) d\omega\), which is a “dot-product” between the family \((\varphi_\omega(x))_\omega\) and \(\Big( \frac{\hat{f}(\omega) }{\sqrt{2 \pi \hat{q}(\omega)}} \Big)_\omega\), the squared norm \(\| f\|_{\mathcal{H}}^2\) is equal to the corresponding “squared norm” of \(\Big( \frac{\hat{f}(\omega) }{\sqrt{2 \pi \hat{q}(\omega)}}\Big)_\omega\), and we thus have $$ \| f\|_{\mathcal{H}}^2 = \int_{\mathbb{R}} \Big| \frac{\hat{f}(\omega) }{\sqrt{2 \pi \hat{q}(\omega)}} \Big|^2 d\omega = \frac{1}{2\pi} \int_{\mathbb{R}} \frac{ | \hat{f}(\omega) |^2}{\hat{q}(\omega)} d\omega,$$ where \(\hat{f}\) is the Fourier transform of \(f\). While the derivation above is not rigorous, the last expression is.

In this section, I will focus on the Gaussian kernel defined as \(K(x,y) = q(x-y) = \exp \big( – \alpha ( x- y )^2 \big)\), for which \(\displaystyle \hat{q}(\omega)= \sqrt{\frac{\pi}{\alpha}} \exp\big( – \frac{\omega^2}{4 \alpha} \big)\).

Given that \(\displaystyle \frac{1}{\hat{q}(\omega)} = \sqrt{\frac{\alpha}{\pi}} \exp\big( \frac{\omega^2}{4 \alpha} \big)= \sqrt{\frac{\alpha}{\pi}} \sum_{k=0}^\infty \frac{\omega^{2k}}{(4 \alpha)^k k!} \), the penalty \(\|f\|_\mathcal{H}^2\) is a linear combination of squared \(L_2\)-norm of \(\omega^k \hat{f}(\omega)\), which is the squared \(L_2\)-norm of the \(k\)-th derivative of \(f\). Thus, functions in the RKHS are infinitely differentiable, and thus very smooth (this implies that to have the fast rate \((\log n) / n \) above, the optimal regression function has to be very smooth).

Orthonormal basis of the RKHS. As seen in Eq. (4), an expansion in an infinite sum is necessary to obtain an orthonormal basis. We have: $$K(x,y) = e^{-\alpha x^2} e^{-\alpha y^2} e^{2 \alpha x y} = e^{-\alpha x^2} e^{-\alpha y^2} \sum_{k=0}^\infty \frac{ (2\alpha)^k}{k!} x^k y^k.$$ Because of Eq. (4), with \(g_k(x) = \sqrt{ \frac{(2\alpha)^k}{k!}} x^k \exp \big( – \alpha x^2 \big)\), we have a good candidate for an orthonornal basis. Let us check that this is the case. Note that the expansion above alone cannot be used as a proof that \((g_k)\) is an orthonormal basis of \(\mathcal{H}\).

Given the function \(f_k(x) = x^k \exp \big( – \alpha x^2 \big)\), we can compute its Fourier transform as $$ \hat{f}_k(\omega) = i^{-k} ( 4 \alpha)^{-k/2} \sqrt{\frac{\pi}{\alpha}} H_k \Big( \frac{\omega}{\sqrt{4 \alpha}} \Big) \exp\big( – \frac{\omega^2}{4 \alpha} \big) .$$ Indeed, we have, from Rodrigues’ formula, $$H_k \Big( \frac{\omega}{\sqrt{4 \alpha}} \Big) \exp\big( – \frac{\omega^2}{4 \alpha} \big) =(-1)^k (4 \alpha)^{k/2} \frac{d^k}{d \omega^k}\big[ \exp\big( – \frac{\omega^2}{4 \alpha} \big) \big],$$ and thus its inverse Fourier transform is equal to \((ix)^k\) times the one of \((-1)^k (4 \alpha)^{k/2} \exp\big( – \frac{\omega^2}{4 \alpha} \big)\), which is thus equal to \((-i)^k (4 \alpha)^{k/2} \sqrt{ \alpha / \pi }e^{-\alpha x^2} \), which leads to the Fourier transform formula above.

We can now compute the RKHS dot products, to show how to obtain the orthonormal basis described in [11]. This leads to $$ \langle f_k, f_\ell \rangle = \frac{1}{2\pi} \sqrt{\frac{\pi}{\alpha}} ( 4 \alpha)^{-k/2}( 4 \alpha)^{-\ell/2} \int_{\mathbb{R}} H_k \Big( \frac{\omega}{\sqrt{4 \alpha}} \Big) H_\ell \Big( \frac{\omega}{\sqrt{4 \alpha}} \Big) \exp\big( – \frac{\omega^2}{4 \alpha} \big) d\omega,$$ which leads to, with a change of variable $$ \langle f_k, f_\ell \rangle = \frac{1}{2\pi} \sqrt{\frac{\pi}{\alpha}} ( 4 \alpha)^{-k/2}( 4 \alpha)^{-\ell/2} \sqrt{4 \alpha} \int_{\mathbb{R}} H_k (u) H_\ell (u) \exp(-u^2) du,$$ which is equal to zero if \(k \neq \ell\), and equal to \(\frac{1}{2\pi} \sqrt{\frac{\pi}{\alpha}} ( 4 \alpha)^{-k} \sqrt{4 \alpha} \sqrt{\pi} 2^k k! = ( 2 \alpha)^{-k} k!\) if \(k = \ell\). Thus the sequence \((f_k)\) is an orthogonal basis of the RKHS, and the sequence \((g_k)\) defined as \(g_k(x) = \sqrt{ \frac{(2\alpha)^k}{k!}} f_k(x)\) is an orthonormal basis of the RKHS, from which, using the expansion as in Eq. (4), we recover the expansion: $$K(x,y) = \sum_{k=0}^\infty g_k(x) g_k(y) = e^{-\alpha x^2} e^{-\alpha y^2} \sum_{k=0}^\infty \frac{ (2\alpha)^k}{k!} x^k y^k.$$

This expansion can be used to approximate the Gaussian kernel by finite-dimensional explicit feature spaces, by just keeping the first basis elements (see an application to optimal transport in [12], with an improved behavior using an adaptive low-rank approximation through the Nyström method in [13]).

Eigenfunctions for the Gaussian kernels

In order to obtain explicit formulas for the eigenvalues of the covariance operator, we need more than a mere orthonormal basis, namely an eigenbasis.

An orthogonal basis will now be constructed with arguably better properties as it is also an orthonormal basis for both the RKHS and \(L_2(d\mu)\) for a Gaussian measure, that diagonalizes the integral operator associated to this probability measure, as well as the covariance operator.

As seen above, we simply need an orthogonal family \((f_k)_{k \geqslant 0}\), such that given a distribution \(d\mu\), \((f_k)_{k \geqslant 0}\) is a family in \(L_2(d\mu)\) such that $$\int_{\mathbb{R}} f_k(x) K(x,y) d\mu(x) = \lambda_k f_k(y), \tag{5}$$ for eigenvalues \((\lambda_k)\). In the next paragraph, we will do exactly this for the Gaussian kernel \(K(x,y) = e^{-\alpha (x-y)^2}\) for \(\alpha = \frac{\rho}{1- \rho^2}\) for some \(\rho \in (0,1)\); this particular parameterization in \(\rho\) is to make the formulas below not (too) complicated.

With \(f_k(x) = \frac{1}{\sqrt{N_k}} H_k(x) \exp \Big( – \frac{\rho}{1+\rho} x^2 \Big)\), where \(N_k = {2^k k!} \sqrt{ \frac{1-\rho}{1+\rho}}\), then \((f_k)_{k \geqslant 0}\) is an orthonormal basis for \(L_2(d\mu)\) for \(d\mu\) the Gaussian distribution with mean zero and variance \(\frac{1}{2} \frac{1+\rho}{1-\rho}\) (this is a direct consequence of the orthogonality property of Hermite polynomials).

Moreover, the moment of the Hermite polynomial in Eq. (3) exactly leads to Eq. (5) for the chosen kernel and \(\lambda_k = (1-\rho) \rho^k\). Since the eigenvalues sum to one, and the trace of \(\Sigma\) is equal to one (as a consequence of \(K(x,x)=1\)), the family \((f_k)\) has to be a basis of \(\mathcal{H}\).

From properties of the eigenbasis, since \((f_k)\) is an orthonormal eigenbasis of \(L_2(d\mu)\) and the eigenvalues are \(\lambda_k = (1-\rho)\rho^k\), we get: $$ K(x,y) = \exp \Big( – \frac{\rho}{1- \rho^2} (x-y)^2\Big) = \sum_{k=0}^\infty (1-\rho)\rho^k f_k(x) f_k(y),$$ which is exactly the Mehler formula, and thus we obtain an alternative proof.

We then get an explicit basis and the exponential decay of eigenvalues, which was first outlined by [2]. See an application to the estimation of the Poincaré constant in [14] (probably a topic for another post in a few months).

Experiments. In order to showcase the exact eigenvalues of the expectation \(\Sigma\) (for the correct combination of Gaussian kernel and Gaussian distribution), we compare the eigenvalues with the ones of the empirical covariance operator \(\hat\Sigma\), for various values of the number of observations. We see that as \(n\) increases, the empirical eigenvalues match the exact ones for higher \(k\).

Eigenvalues of the covariance operator \(\Sigma\) (“expectation”) compared to the ones of the empirical covariance operator \(\hat\Sigma\), averaged over 20 replications (“empirical”), for several values of \(n\).


In this post, I only presented applications of Hermite polynomials to the Gaussian kernel, but these polynomials appear in many other areas of applied mathematics, for other types of kernels within machine learning such as dot-product kernels [3], in random matrix theory (see here), in statistics for Edgeworth expansions, and of course for Gauss-Hermite quadrature.

Acknowledgements. I would like to thank Loucas Pillaud-Vivien and Alessandro Rudi for proofreading this blog post and making good clarifying suggestions.


[1] George Neville Watson. Notes on Generating Functions of Polynomials: (2) Hermite PolynomialsJournal of the London Mathematical Society, 8, 194-199, 1933.
[2] Huaiyu Zhu, Christopher K. I. Williams, Richard Rohwer, and Michal Morciniec. Gaussian regression and optimal finite dimensional linear models. In Neural Networks and Machine Learning. Springer-Verlag, 1998.
[3] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances In Neural Information Processing Systems, 2016.
[4] Gabor Szegö. Orthogonal polynomials. American Mathematical Society, 1939.
[5] Grace Wahba. Spline models for observational data. Society for Industrial and Applied Mathematics, 1990.
[6] Bernhard Schölkopf, Alexander J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, 2002.
[7] John Shawe-Taylor, Nello Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004.
[8] Andrea Caponnetto, Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics 7.3: 331-368, 2007.
[9] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. Springer series in statistics, 2001.
[10] Trevor Hastie and Robert Tibshirani. Generalized Additive Models. Chapman & Hall, 1990.
[11] Ingo Steinwart, Don Hush, and Clint Scovel. An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernelsIEEE Transactions on Information Theory, 52.10:4635-4643, 2006.
[12] Jason Altschuler, Francis Bach, Alessandro Rudi, Jonathan Niles-Weed. Approximating the quadratic transportation metric in near-linear time. Technical report arXiv:1810.10046, 2018.
[13] Jason Altschuler, Francis Bach, Alessandro Rudi, Jonathan Niles-Weed. Massively scalable Sinkhorn distances via the Nyström methodAdvances in Neural Information Processing Systems (NeurIPS), 2019.
[14] Loucas Pillaud-Vivien, Francis Bach, Tony Lelièvre, Alessandro Rudi, Gabriel Stoltz. Statistical Estimation of the Poincaré constant and Application to Sampling Multimodal DistributionsProceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.
[15] Nachman Aronszajn. Theory of Reproducing Kernels. Transactions of the American Mathematical Society, 68(3): 337–404, 1950.

Proof of properties of Hermite polynomials

In this small appendix, I give “simple” proofs (that sometimes require knowledge of complex analysis) to the properties presented above.

Generating function. We have, using residue theory, $$H_k(x)=(-1)^k e^{x^2} \frac{d^k}{d x^k}\big[ e^{-x^2} \big] = (-1)^k \frac{k!}{2i\pi} e^{x^2} \oint_\gamma \frac{e^{-z^2}}{(z-x)^{k+1}}dz, $$ where \(\gamma\) is a contour in the complex plane around \(x\). This leads to, for any \(t\) (here, we ignore on purpose the summability issues, for more details, see [4, Section 5.5]): $$ \sum_{k=0}^\infty \frac{t^k}{k!} H_k(x) = \frac{1}{2i\pi} e^{x^2} \oint_\gamma \frac{e^{-z^2}}{z-x} \sum_{k=0}^\infty \frac{t^k} {(x-z)^{k}}dz, $$ which can be simplified using the sum of the geometric series, leading to $$\frac{1}{2i\pi} e^{x^2} \oint_\gamma \frac{e^{-z^2}}{z-x} \frac{z-x}{z-x- t} dz = \frac{1}{2i\pi} e^{x^2} \oint_\gamma \frac{e^{-z^2}} {z-x- t} dz.$$ Using the first-order residue at \(x+t\). This is thus equal to \(e^{x^2-(t+x)^2} = e^{-t^2 + 2tx}\), which is exactly the generating function statement from Eq. (2).

Orthogonality for Gaussian distribution. We can prove through integration by parts, but there is a nicer proof through the generating function. Indeed, with $$ a_{k \ell} = \int_{-\infty}^{+\infty} e^{-x^2} H_k(x) H_\ell(x) dx, $$ for \(k, \ell \geqslant 0\), we get $$\sum_{k,\ell = 0}^\infty a_{k \ell} \frac{t^k u^\ell}{k! \ell!} = \int_{-\infty}^{+\infty}e^{-x^2}\Big( \sum_{k,\ell = 0}^\infty a_{k \ell} \frac{t^k u^\ell}{k! \ell!} H_k(x) H_\ell(x) \Big) dx.$$ Using the generating function, this leads to $$\sum_{k,\ell = 0}^\infty a_{k \ell} \frac{t^k u^\ell}{k! \ell!} = \int_{-\infty}^{+\infty} e^{-x^2 + 2xu-u^2 + 2xt – t^2} dx= e^{2uv} \int_{-\infty}^{+\infty} e^{-(x-u-v)^2}dx, $$ which can be computed explicitly using normalization constants of the Gaussian distribution, as \( \sqrt{\pi} e^{2uv} = \sqrt{\pi} \sum_{k=0}^\infty \frac{ (2 u v)^k}{k!},\) leading to all desired orthogonality relationships using the uniqueness of all coefficients for factors \(t^k u^\ell\).

Recurrence relationship. Taking the derivative of the generating function with respect to \(t\), one gets \( \displaystyle (2x-2t) e^{2tx-t^2} = \sum_{k=0}^\infty \frac{t^{k-1}}{(k-1)!} H_k(x),\) which is equal to (using again the generating function) \(\displaystyle \sum_{k=0}^\infty \frac{t^{k}}{k!} 2x H_k(x) \ – \sum_{n=0}^\infty \frac{t^{k+1}}{k!} 2 H_k(x).\) By equating the coefficients for all powers of \(t\), this leads to the desired recursion in Eq. (1).

Fourier transform. Again using the generating function, written $$ e^{-x^2/2 + 2xt – t^2} = \sum_{k=0}^\infty \frac{t^k}{k!} e^{-x^2/2} H_k(x), $$ we can take Fourier transforms and use the fact that the Fourier transform of \(e^{-x^2/2}\) is itself (for the chosen normalization), and then equate coefficients for all powers of \(t\) to conclude (see more details here).

Expectation for Gaussian distributions. We finish the appendix by proving Eq. (3). We consider computing for any \(t\), $$\sum_{k=0}^\infty \rho^k \frac{t^k}{k!} H_k (y) = e^{2\rho t y – \rho^2 t^2},$$ using the generating function from Eq. (2). We then compute $$A=\int_{-\infty}^\infty \exp\Big( – \frac{(x-\rho y)^2}{1-\rho^2} \Big) \sum_{k=0}^\infty \frac{t^k}{k!} H_k(x) dx = \int_{-\infty}^\infty \exp\Big( – \frac{(x-\rho y)^2}{1-\rho^2} \Big) \exp( 2tx – t^2) dx.$$ We then use \( \frac{(x-\rho y)^2}{1-\rho^2} – 2tx + t^2 = \frac{x^2}{1-\rho^2} – \frac{2x[ t(1-\rho^2) + \rho y]}{1-\rho^2} + t^2 + \frac{\rho^2 y^2}{1-\rho^2}\), leading to $$A = \sqrt{\pi} \sqrt{1-\rho^2} \exp\Big( -t^2 – \frac{\rho^2 y^2}{1-\rho^2} +(1-\rho^2) \big( t + \frac{\rho y}{1-\rho^2} \big)^2 \Big) = \sqrt{\pi} \sqrt{1-\rho^2} e^{2\rho t y – \rho^2 t^2}.$$ By equating powers of \(t\), this leads to Eq. (3).

by Francis Bach at October 08, 2020 07:33 PM UTC

Random walks on expanders are a central and versatile tool in pseudorandomness. If an arbitrary half of the vertices of an expander graph are marked, known Chernoff bounds for expander walks imply that the number $M$ of marked vertices visited in a long $n$-step random walk strongly concentrates around the expected $n/2$ value. Surprisingly, it was recently shown that the parity of $M$ also has exponentially small bias. Is there a common unification of these results? What other statistics about $M$ resemble the binomial distribution (the Hamming weight of a random $n$-bit string)? To gain insight into such questions, we analyze a simpler model called the sticky random walk. This model is a natural stepping stone towards understanding expander random walks, and we also show that it is a necessary step. The sticky random walk starts with a random bit and then each subsequent bit independently equals the previous bit with probability $(1+\lambda)/2$. Here $\lambda$ is the proxy for the expander's (second largest) eigenvalue. Using Krawtchouk expansion of functions, we derive several probabilistic results about the sticky random walk. We show an asymptotically tight $\Theta(\lambda)$ bound on the total variation distance between the (Hamming weight of the) sticky walk and the binomial distribution. We prove that the correlation between the majority and parity bit of the sticky walk is bounded by $O(n^{-1/4})$. This lends hope to unifying Chernoff bounds and parity concentration, as well as establishing other interesting statistical properties, of expander random walks.

at October 08, 2020 02:03 PM UTC

In certain complexity-theoretic settings, it is notoriously difficult to prove complexity separations which hold almost everywhere, i.e., for all but finitely many input lengths. For example, a classical open question is whether $\mathrm{NEXP} \subset \mathrm{i.o.-}\mathrm{NP}$; that is, it is open whether nondeterministic exponential time computations can be simulated on infinitely many input lengths by $\mathrm{NP}$ algorithms. This difficulty also applies to Williams' algorithmic method for circuit lower bounds [Williams, J. ACM 2014]. In particular, although [Murray and Williams, STOC 2018] proved $\mathrm{NTIME}[2^{\mathrm{polylog}(n)}] \not\subset \mathrm{ACC}^0$, it has remained an open problem to show that $\mathrm{E}^{\mathrm{NP}}$ ($2^{O(n)}$ time with an $\mathrm{NP}$ oracle) is not contained in $\mathrm{i.o.-}\mathrm{ACC}^0$. In this paper, we show how many infinitely-often circuit lower bounds proved by the algorithmic method can be adapted to establish almost-everywhere lower bounds. - We show there is a function $f \in \mathrm{E}^{\mathrm{NP}}$ such that for all sufficiently large input lengths $n$ and $\varepsilon \leq o(1)$, $f$ cannot be $(1/2+2^{-n^{\varepsilon}})$-approximated by $2^{n^\varepsilon}$-size $\mathrm{ACC}^0$ circuits on inputs of length $n$, improving lower bounds in [Chen and Ren, STOC 2020] and [Viola, ECCC 2020]. - We construct rigid matrices in $\mathrm{P}^{\mathrm{NP}}$ for all but finitely many inputs, rather than infinitely often as in [Alman and Chen, FOCS 2019] and [Bhangale et al., FOCS 2020]. - We show there are functions in $\mathrm{E}^{\mathrm{NP}}$ requiring constant-error probabilistic degree at least $\Omega(n/\log^2 n)$ for all large enough $n$, improving an infinitely-often separation of [Viola, ECCC 2020]. Our key to proving almost-everywhere worst-case lower bounds is a new ``constructive'' proof of an NTIME hierarchy theorem proved by [Fortnow and Santhanam, CCC 2016], where we show for every ``weak'' nondeterminstic algorithm (with smaller running-time and short witness), a ``refuter algorithm'' exists that can construct ``bad'' inputs for the hard language. We use this refuter algorithm to construct an almost-everywhere hard function. To extend our lower bounds to the average case, we prove a new XOR Lemma based on approximate linear sums, and combine it with the PCP-of-proximity applications developed in [Chen and Williams, CCC 2019] and [Chen and Ren, STOC 2020]. As a byproduct of our new XOR Lemma, we obtain a nondeterministic pseudorandom generator for poly-size $\mathrm{ACC}^0$ circuits with seed length $\mathrm{polylog}(n)$, which resolves an open question in [Chen and Ren, STOC 2020].

at October 08, 2020 01:53 PM UTC

I have been thinking about CH lately for two reasons

1) I reread the article

Hilbert's First Problem: The Continuum Hypothesis by Donald Martin from Proceedings of Symposia  in Pure Mathematics: Mathematical developments arising from Hilbert Problems. 1976. (For a book review of the symposia and, The Honor Class, also about Hilbert's problems, see here.)

The article takes the point of view that CH CAN have an answer. He discusses large cardinals (why assuming they exist is plausible, but alas, that assumption does not seem to resolve CH) and Projective Det.  (why assuming it is true is plausible, but alas, that assumption does not seem to resolve CH).

(A set A \subseteq {0,1}^omega is DETERMINED if either Alice or Bob has a winning strategy in the following non-fun game: they alternate picking bits a_1, b_1, a_2, b_2, ... with Alice going first. If a_1 b_1 a_2 b_2... IS IN A then Alice wins, IF NOT then Bob wins. Martin showed that all Borel sets are determined. Proj Det is the statement that all projections of Borel sets are determined. AD is the axiom that ALL sets A are determined. It contradicts AC.)

But what really inspired this post is the last paragraph:

Throughout the latter part of my discussion, I have been assuming a naive and uncritical attitude towards CH. While this is in fact my attitude, I by no means wish to dismiss the opposite viewpoint.  Those that argue that the concept of set is not sufficiently clear to fix the truth-value of CH have a position that is at present difficult to assail. As long as no new axiom is found which decides CH, their case will continue to grow stronger, and our assertions that the meaning of CH is clear will sound more and more empty.

2) Scott Aaronson mentioned in a blog post (see here) that  he has read and understood the proof that CH is independent of set theory.

SO, this seemed like a good time to revisit thoughts on CH.

 I took a very short poll, just two people, about CH: Stephen Fenner (in a perfect world he would be a set theorists) and Scott Aaronson (having JUST read the proof that CH is ind.  he has thought about it recently).

Here are some thoughts of theirs and mine

1) All three of us are Platonists with regard to the Naturals (I was surprised to find recently that there are people who are not!) but not with regard to the reals.  So we would be OKAY with having CH have no answer.

2) All three of us  agree that it would be nice if SOME axiom was both

a) Intuitively appealing or aesthetically appealing ,  and

b) resolved CH.

I always thought that (a) would be the hard part-- or at least getting everyone (not sure who we are talking about) to AGREE on a new axiom. But even getting an axiom to resolve CH seems hard.  Large cardinals don't seem to do it, and various forms of Determinacy don't seem to do it.

Scott reminded me of Freiling's Axiom of Symmetry (see here) which IS intuitive and DOES resolve CH (its false) though there are problems with it--- a minor variant   of it contradicts AC (I am QUITE FINE with that since AC implies Banach-Tarski which Darling says shows `Math is broken'.)

Stephen recalled some of Hugh Woodin's opinions of CH, but Hugh seems to have changed his mind from NOT(CH): 2^{aleph_0} = aleph_2, to CH:  2^{aleph_0} = aleph_1.(See here.)

3) All three of would be okay with V=L, though note that this would put many set theorists out of work. All the math that applies to the real world would still be intact.  I wonder if in an alternative history the reaction to Russell's paradox would be a formulation of set theory where V=L. We would KNOW that CH is true, KNOW that AC is true. We would know a lot about L but less about forcing.

4) Which Geometry is true: Euclidian, Riemannian, others? This is now regarded as a silly question: Right Tool, Right Job! If you build a bridge use Euclid. If you are doing astronomy use Riemann. Might Set Theory go the same way? It would be AWESOME if Scott Aaronson found some quantum thing where assuming 2^{aleph_0} = aleph_2 was the right way to model it.

5) If I was more plugged into the set theory community I might do a poll of set theorists, about CH. Actually, someone sort-of already has. Penelope Maddy has two excellent and readable articles where she studies what set theorists believe and why.

Believing The Axioms Ihere

Believing The Axioms IIhere

Those articles were written in 1988. I wonder if they need an update.

by gasarch ( at October 08, 2020 01:52 PM UTC

Apologies dear readers for the late posting. The beginning of the school year is always frenzied, and the pandemic has only added to that frenzy. We have an exciting September, with four papers on graph property testing, one two papers on distribution testing, and one paper that connects both topics.

(Ed: we normally scan through ECCC and arXiv, but are happy to post about papers that appear elsewhere. Thanks to the reader who pointed out a relevant COLT 2020 paper.)

Estimation of Graph Isomorphism Distance in the Query World by Sourav Chakraborty, Arijit Ghosh, Gopinath Mishra, and Sayantan Sen (ECCC). Graph isomorphism is about as fundamental as it gets, and this papers studies approximating the graph isomorphism distance for dense graphs. There is a known graph \(G_k\) (with \(n\) vertices). The algorithm is given query access to an input graph \(G_u\) and needs to approximate the number of edge inserts/deletes required to make the graphs isomorphic. This is the tolerant testing version; the property testing version is known to be doable in \(\widetilde{O}(\sqrt{n})\) queries (Matsliah-Fischer). The main insight of this paper is to relate the tolerant testing complexity to a distribution testing problem. Consider distributions over the \(\{0,1\}^n\) defined by multisets of \(n\) hypercube points. Our aim is to estimate the earthmover distance between a known distribution and an unknown distribution. Interestingly, the query model is different: one can sample the underlying multisets without replacement. It turns out that the optimal complexity of this problem is (upto polylog factors) is the same as the optimal complexity of tolerant testing of graph isomorphism. A direct corollary is that the isomorphism distance can be approximated upto additive \(\epsilon n^2\) using \(\widetilde{O}(n)\) samples. This equivalence also gives an alternate proof for lower bounds for property testing graph isomorphism.

Robustly Self-Ordered Graphs: Constructions and Applications to Property Testing by Oded Goldreich and Avi Wigderson (ECCC). Let’s start from the application. The aim is to prove the following property testing lower bounds for the bounded-degree graph setting: an exponential separation between tolerant and vanilla testing, and finding an efficiently decidable property (in polynomial time) that cannot be property tested in sublinear time. For binary strings, results of this form are known. Can these be “ported” to the bounded-degree graph world? Can we construct graphs such that adjacency queries reduce to bit queries in strings? Naturally, one can simply represent the adjacency list as a string and treat graph queries as bit queries. But the problem is that of isomorphisms: different bit strings could represent the same graph and therefore, the different bit strings must have the same status with respect to the underlying property. The key insight in this paper is to introduce robustly self-ordered graphs, as a tool to port bit string property testing lower bounds to bounded-degree graphs. Such graphs essentially have a unique (identity) automorphism, even after a few edge insert/deletes. The actual definition is more technical, but that is the essence. The main result is an explicit construction of such graphs, from which the lower bound can be ported directly through a convenient lemma.

Modifying a Graph’s Degree Sequence and the Testablity of Degree Sequence Properties by Lior Gishboliner (arXiv). A sequence of numbers \(D = (d_1, d_2, \ldots, d_n)\) is graphic if there exists an undirected graph on \(n\) vertices whose degrees are precisely the numbers of the sequence. Graphical sequences have been characterized by classic results of Erdös-Gállai and Havel-Hakimi. This paper first proves the following theorem. Suppose a graphic sequence \(D’\) has \(l_1\)-distance at most \(\delta\) from the degree sequence \(D\) of a graph \(G\). Then, there exists a graph \(G’\) with degree sequence \(D’\) such that the (dense graph) distance between \(G\) and \(G’\) is \(O(\sqrt{\delta})\). This theorem is used to prove an interesting property testing result. Let \(\mathcal{D}\) be a subset of graphic sequences that are closed under permutation. Let \(\mathcal{G}\) be the set of graphs that have a degree sequence in \(\mathcal{D}\). Then \(\mathcal{G}\) can be tested in \(poly(1/\epsilon)\) queries.

Sampling an Edge Uniformly in Sublinear Time by Jakub Têtek (arXiv). In the general model for sublinear algorithms on graphs, an important choice is whether one allows uniform random edge queries. A natural question is whether such queries can simulated efficiently, using only random vertex, degree, and neighbor queries. This problem appears somewhat implicitly in previous sublinear subgraph counting algorithms, and Eden-Ron-Rosenbaum study it explicitly. They prove that one can sample from an \(\epsilon\)-approximate uniform distribution (over edges) using \(O(n/\sqrt{\epsilon m})\) samples. The problem of sampling from exactly the uniform distribution is left open. Until this paper. The main result shows that by modifying the Eden-Ron-Rosenbaum algorithm parameters, one can generate edge samples from an \(\epsilon\)-approximate uniform distribution using \(O((n/\sqrt{m})\log \epsilon^{-1})\) samples. The exact uniform distribution is achieved by setting \(\epsilon = 1/n\), to get a sample complexity of \(O((n\log n)/\sqrt{m})\).

Faster Property Testers in a Variation of the Bounded Degree Model by Isolde Adler and Polly Fahey (arXiv). The setting of bounded-degree graph property testing naturally extends to bounded-degree relational databases, which can be thought of as “directed” hypergraphs. This is an interesting new direction of research, that combines property testing with database theory (see Adler-Harwath and Chen-Yoshida). One of the main contributions of this work is to consider another notion of distance: edge and vertex inserts/deletes. This is a natural extension, and we can now compare distances between graphs/databases with different numbers of vertices. The main result is that, under this notion of distance, a large class of properties can be tested in constant running time on databases with bounded degree and treewidth. Specifically, any property expressible in Counting Monadic Second-Order Logic (CMSO) can be tested in constant time. Previous results by Alder-Harwath showed that such properties can be tested (under the standard distance notion) in constant queries, but polylogarithmic time.

Optimal Testing of Discrete Distributions with High Probability by Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, John Peebles, and Eric Price (arXiv, ECCC). The focus of this paper is distribution testing in the “high probability” regime, where we wish the error of the tester to be \(< \delta\). Typically, most results just get an error of at most \(1/3\), from which standard probabilistic boosting would tack on an extra \(O(\log 1/\delta)\) factor. In standard TCS settings, one doesn’t focus on optimizing this dependence, but in statistics, there is significant focus on the optimal sample complexity. And indeed, for practical applications, it is crucial to have sharp bounds on the right number of samples required for hypothesis testing. The paper also argues that getting the optimal sample complexity requires new algorithms, even for uniformity testing. There are optimal results given for closeness and independence testing. The optimal sample complexity only pays a multiplicative factor of \(\log^{1/3} (1/\delta)\) or \(\log^{1/2}(1/\delta)\) over the optimal bound for constant error (with other additive terms depending on \(\log(1/\delta)\)).

Bessel Smoothing and Multi-Distribution Property Estimation by Yi Hao and Ping Li (COLT 2020). Let us consider some standard (tolerant) distribution testing questions, phrases as approximation algorithms. Given sample access to two distributions \(p\) and \(q\) over \([n]\), we may wish to estimate the \(l_1\)-distance, \(l_2\)-distance, relative entropy, etc. between these distributions. One can phrases this problem abstractly as estimating \(\sum_{i \in [n]} f(p_i, q_i)\), where \(f\) is some explicit function. This papers shows that for any 1-Lipschitz function \(f\) that satisfies some “regularity” property, the sum \(\sum_{i \in [n]} f(p_i, q_i)\) can be \(\epsilon\)-approximated with \(O(\epsilon^{-3}n/\sqrt{\log n})\) samples (apologies to the authors to replacing their \(k\) with the more familiar \(n\) for our readers). Thus, we can get sublinear sampling complexity for a very general class of estimation problems. Moreover, this was actually the simplest setting consider in the paper. One can deal with such functions of \(d\) distributions, not just two distributions. One of the corollaries of the theorems is a sublinear tolerant tester for the property of being a mixture of distributions.

by Seshadhri at October 07, 2020 11:20 PM UTC

The next TCS+ talk will take place this coming Wednesday, October 14th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC). Jayadev Acharya from Cornell University will speak about “Distributed Statistical Inference under Local Information Constraints ” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Due to security concerns, registration is required to attend the interactive talk. (The link to the YouTube livestream will also be posted on our website on the day of the talk, so people who did not sign up will still be able to watch the talk live.) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: We consider statistical inference tasks in a distributed setting where access to data samples is subjected to strict “local constraints,” through a unified framework that captures communication limitations and (local) privacy constraints as special cases. We study estimation (learning) and goodness-of-fit (testing) for both discrete and high-dimensional distributions. Our goal is to understand how the sample complexity increases under the information constraints.

In this talk we will provide an overview of this field and a sample of some of our results. We will discuss the role of (public) randomness and interactivity in information-constrained inference, and make a case for thinking about randomness and interactivity as resources.

The work is part of a long-term ongoing collaboration with Clément Canonne (IBM Research) and Himanshu Tyagi (IISc), and includes works done with Cody Freitag (Cornell), Yanjun Han (Stanford), Yuhan Liu (Cornell), and Ziteng Sun (Cornell).

by plustcs at October 07, 2020 08:11 PM UTC

NeurIPS 2020 is the biggest conference on machine learning, with tons of content on differential privacy in many different forms. We were able to find two workshops, a competition, and 31 papers. This was just going off the preliminary accepted papers list, so it’s possible that we might have missed some papers on differential privacy – please let us know! We will update this post later, once all the conference material (papers and videos) are publicly available.




by Gautam Kamath at October 07, 2020 04:30 PM UTC

The next Foundations of Data Science virtual talk will take place on Friday, Oct 09th at 10:00 AM Pacific Time (1:00 pm Eastern Time, 18:00 Central European Time, 17:00 UTC).  Alexandr Andoni from Columbia University will speak about “Approximating Edit Distance in Near-Linear Time”.

Abstract: Edit distance is a classic measure of similarity between strings, with applications ranging from computational biology to coding. Computing edit distance is also a classic dynamic programming problem, with a quadratic run-time solution, often taught in the “Intro to Algorithms” classes. Improving this runtime has been a decades-old challenge, now ruled likely-impossible using tools from the modern area of fine-grained complexity. We show how to approximate the edit distance between two strings in near-linear time, up to a constant factor. Our result completes a research direction set forth in the breakthrough paper of [Chakraborty, Das, Goldenberg, Koucky, Saks; FOCS’18], which showed the first constant-factor approximation algorithm with a (strongly) sub-quadratic running time.

Joint work with Negev Shekel Nosatzki, available at

Please register here to join the virtual talk.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

by dstheory at October 07, 2020 03:44 PM UTC

Science is good too

Emil Faber is the pretend founder of the pretend Faber College. The 1978 movie Animal House starts with a close-up of Faber’s statue, which has the inscription, Knowledge Is Good.

Today, Ken and I thought we might talk about knowledge, science, mathematics, proofs, and more.

The phrase on Faber’s pedestal is meant to be a joke, as is the subtitle we added saying the same about science. But there is some truth to both of them. From the cause of climate change to the best response to the current pandemic to sports predictions there is much interest in science. Science is good, indeed.


What is science and what are methods of creating knowledge via science? There is a whole world on the philosophy of science. The central questions are: What is science? What methods are used to create new science? Is science good?—just kidding.

We are not experts on the philosophy of science. But there seem to be three main ways to create scientific knowledge.

{\bullet } Experiments: This is the classic one. Think about the testing of a candidate vaccine to stop the pandemic.

{\bullet } Computational Experiments: This is relatively new. Think computer simulations of how climate change is effected by the methods of creating energy—for example. wind vs. coal.

{\bullet } Mathematical Proofs: This is the one we focus on here at GLL. Think proofs that some algorithm works or that there is no algorithm that can work unless…

Mathematical Proofs

We are interested in creating knowledge via proving new theorems. This is how we try to create knowledge. Our science is based not on experiments and not on simulations but mostly on the theorem-proof method. Well not exactly. We do use experiments and simulations. For example, the field of quantum algorithms uses both of these.

However, math proofs are the basis of complexity theory. This means that we need to create proofs and then check that they are correct. The difficulty of checking a proof is based on who created them:

  • You did—checking your own work.

  • Someone else did—refereeing for a journal.

  • Someone in your class did—grading exams.

  • Some graduate student did—mentoring.

  • Someone on the web who claims a major result like {\mathsf{P < NP}} did—debugging.

  • And so on.

My Favorite Checking Method

My favorite tool for checking is this trick: Suppose that we have a proof {P} that demonstrates {A \implies X} is true. Sometimes it is possible to show that there is a proof {Q} that proves {A \implies Y} where:

  1. The proof {Q} is based on changing the claimed proof {P}.

  2. The proof {Q} demonstrates {A \implies Y}, and;

  3. The statement {Y} does not follow from {A}.

One way this commonly arises is when {P} as a proof did not use all of the assumptions in {A}. Thus {P} really proves more that {X} and it proves {Y}. But we note that {Y} is not a consequence of {A}.

For example, consider the Riemann hypothesis. Suppose that we claim that we have a proof that

\displaystyle  \sum_{n=1}^{\infty} \frac{1}{n^{s}} \neq 0

follows from the usual axioms of math plus {\Re(s) > 1/2}. Sounds great. But suppose this is based on an argument that assumes that

\displaystyle  \sum_{n=1}^{\infty} \frac{1}{n^{s}} = 0

and manipulates the summation, eventually yielding a contradiction, without using the condition {\Re(s) > 1/2}. This is a problem, since there are {s} with {\Re(s) = 1/2} so that the sum is zero. This is an example of the above method of checking.

A New Checking Method

From time to time claims are made of resolutions to famous conjectures. Think {\mathsf{P = NP}}. These claims have all been wrong to date. So most researchers are reluctant to take time to check any new claims. Why would you take the effort to try and find the bug that is likely there?

I wonder if there could be a method that is based on competition. For concreteness, suppose Alice and Bob are two researchers who both claim a resolution to the {\mathsf{P}} versus {\mathsf{NP}} problem. Alice has a lower bound argument that {\mathsf{P < NP}} and Bob has an upper bound that {\mathsf{P = NP}}. Could we have them play a “game”?

Give their papers to each other. Have them try to find a flaw in each other’s paper.

They are highly motivated. Could we argue that if they cannot find any flaw then we would be slightly more motivated to look at the papers?

This might work even if they both claim {\mathsf{P = NP}}. Ken and I, personally, have had more claims of {\mathsf{P = NP}} brought to our attention. Even in this case they would be highly motivated: the awards, the prizes, the praise will go to the one who is correct.

Possible Extensions

One difference in our situation from classic empirical science is the nature of gaps in knowledge. For example, one of the big current controversies in physics is over the existence of dark matter. The Wikipedia article we just linked seems to date mostly to years around 2012 when dark matter was more widely accepted than strikes us today (see also this and this). There are cases where two competing theories are incompatible yet the available data do not suffice to find a fault in either.

Whereas, with claimed proofs of incompatible statements, such as {\mathsf{P < NP}} and {\mathsf{P = NP}}, at least one must have a demonstrable error. The statements themselves may have barriers all the way up to undecidability, but that does not matter to judging the proffered proofs.

The method may be more applicable in life sciences where the gap is gathering sufficient field or lab observations. For a topical example, consider claims about the risk or safety of human gatherings amid the pandemic. One extreme is represented by the extraordinary claim, which is evidently quite excessive, that the Sturgis motorcycle rally in August led to over 250,000 Covid-19 cases. The other extreme would be analyses used to justify gatherings with minimal precautions. The extremes cannot coexist. The means to arbitrate between them are available in principle but require costly social effort for contact tracing and testing as well as resolving mathematical issues between epidemiological models.

Open Problems

What do you think of our new checking method? Should it be more widely employed for evaluating claims and hypotheses?

by rjlipton at October 07, 2020 04:54 AM UTC

David Blackwell was still roaming the corridors of UC Berkeley’s stat department during my postdoc years, circa 2009. Jake Abernethy, Sasha Rakhlin, Peter Bartlett and myself were discussing his results, and his seminal contributions to prediction theory were already well known.

David Blackwell
James Hannan

At that time, still the early days of online convex optimization, we were contemplating everything from adaptive gradient methods to bandit convex optimization. Blackwell’s famed approachability theorem was looming in the background, considered to be one of the strongest theorems in the ML-theorists toolkit.

I’ve recently added a new draft chapter to to V2.0 of Introduction to Online Convex Optimization, about the connection between OCO and Blackwell approachability: they are algorithmically equivalent! More precisely, an efficient algorithm for approachability gives rise to an efficient algorithm for OCO with sublinear regret, and vice versa.

Details are in the chapter draft, and I recommend this music for reading it: (the song is called “together we won despite everything”, dedicated to music and science)


by Elad Hazan at October 06, 2020 12:11 AM UTC