Theory of Computing Blog Aggregator

A covering code is a set of codewords with the property that the union of balls, suitably defined, around these codewords covers an entire space. Generally, the goal is to find the covering code with the minimum size codebook. While most prior work on covering codes has focused on the Hamming metric, we consider the problem of designing covering codes defined in terms of insertions and deletions. First, we provide new sphere-covering lower bounds on the minimum possible size of such codes. Then, we provide new existential upper bounds on the size of optimal covering codes for a single insertion or a single deletion that are tight up to a constant factor. Finally, we derive improved upper bounds for covering codes using $R\geq 2$ insertions or deletions. We prove that codes exist with density that is only a factor $O(R \log R)$ larger than the lower bounds for all fixed $R$. In particular, our upper bounds have an optimal dependence on the word length, and we achieve asymptotic density matching the best known bounds for Hamming distance covering codes.

at December 08, 2019 09:03 AM UTC

Two postdoctoral positions, each for one year renewable to a second, are available to work in Luca Trevisan’s group at Bocconi University on topics related to average-case analysis of algorithms, approximation algorithms, and combinatorial constructions.

The positions have a very competitive salary and relocation benefits. Funding for travel is available.


by shacharlovett at December 07, 2019 11:11 PM UTC

I am recruiting for two postdoctoral positions, each for one year renewable to a second, to work with me at Bocconi University on topics related to average-case analysis of algorithms, approximation algorithms, and combinatorial constructions.

The positions have a very competitive salary and relocation benefits. Funding for travel is available.

Application information is at this link. The deadline is December 15. If you apply, please also send me an email (L.Trevisan at to let me know.

by luca at December 07, 2019 11:07 PM UTC

The query model offers a concrete setting where quantum algorithms are provably superior to randomized algorithms. Beautiful results by Bernstein-Vazirani, Simon, Aaronson, and others presented partial Boolean functions that can be computed by quantum algorithms making much fewer queries compared to their randomized analogs. To date, separations of $O(1)$ vs. $\sqrt{N}$ between quantum and randomized query complexities remain the state-of-the-art (where $N$ is the input length), leaving open the question of whether $O(1)$ vs. $N^{1/2+\Omega(1)}$ separations are possible? We answer this question in the affirmative. Our separating problem is a variant of the Aaronson-Ambainis $k$-fold Forrelation problem. We show that our variant: (1) Can be solved by a quantum algorithm making $2^{O(k)}$ queries to the inputs. (2) Requires at least $\tilde{\Omega}(N^{2(k-1)/(3k-1)})$ queries for any randomized algorithm. For any constant $\varepsilon>0$, this gives a $O(1)$ vs. $N^{2/3-\varepsilon}$ separation between the quantum and randomized query complexities of partial Boolean functions. Our proof is Fourier analytical and uses new bounds on the Fourier spectrum of classical decision trees, which could be of independent interest. Looking forward, we conjecture that the Fourier bounds could be further improved in a precise manner, and show that such conjectured bounds imply optimal $O(1)$ vs. $N^{1-\varepsilon}$ separations between the quantum and randomized query complexities of partial Boolean functions.

at December 07, 2019 02:25 PM UTC

Computer Science and Engineering at the University of Michigan currently invites applications for multiple tenure-track and teaching faculty (lecturer) positions. We seek exceptional candidates at all levels in all areas across computer science and computer engineering. We also have a targeted search for an endowed professorship in theoretical computer science (the Fischer Chair).


by shacharlovett at December 07, 2019 03:53 AM UTC

We show that the size of any regular resolution refutation of Tseitin formula $T(G,c)$ based on a graph $G$ is at least $2^{\Omega(tw(G)/\log n)}$, where $n$ is the number of vertices in $G$ and $tw(G)$ is the treewidth of $G$. For constant degree graphs there is known upper bound $2^{O(tw(G))}$ [AR11, GTT18], so our lower bound is tight up to a logarithmic factor in the exponent. In order to prove this result we show that any regular resolution proof of Tseitin formula $T(G,c)$ of size $S$ can be converted to a read-once branching program computing satisfiable Tseitin formula $T(G,c')$ of size $S^{O(\log n)}$. Then we show that any read-once branching program computing satisfiable Tseitin formula $T(G,c')$ has size at least $2^{\Omega(tw(G))}$; the latter improves the recent result of Glinskih and Itsykson [GI19].

at December 07, 2019 01:51 AM UTC

Authors: Philipp Bamberger, Fabian Kuhn, Yannic Maus
Download: PDF
Abstract: We show that the $(degree+1)$-list coloring problem can be solved deterministically in $O(D \cdot \log n \cdot\log^3 \Delta)$ in the CONGEST model, where $D$ is the diameter of the graph, $n$ the number of nodes, and $\Delta$ is the maximum degree. Using the network decomposition algorithm from Rozhon and Ghaffari this implies the first efficient deterministic, that is, $\text{poly}\log n$-time, CONGEST algorithm for the $\Delta+1$-coloring and the $(degree+1)$-list coloring problem. Previously the best known algorithm required $2^{O(\sqrt{\log n})}$ rounds and was not based on network decompositions.

Our results also imply deterministic $O(\log^3 \Delta)$-round algorithms in MPC and the CONGESTED CLIQUE.

at December 07, 2019 12:00 AM UTC

Authors: Xi Li, Mingyou Wu, Hanwu Chen
Download: PDF
Abstract: In this work, we consider the application of continuous time quantum walking(CTQW) to the Maximum Clique(MC) Problem. Performing CTQW on graphs will generate distinct periodic probability amplitude for different vertices. We will show that the intensity of the probability amplitude at frequency indeed implies the clique structure of some special kinds of graph. And recursive algorithms with time complexity $O(N^5)$ in classical computers for finding the maximum clique are proposed. We have experimented on random graphs where each edge exists with probabilities 0.3, 0.5 and 0.7. Although counter examples are not found for random graphs, whether these algorithms are universal is not known to us.

at December 07, 2019 12:00 AM UTC

Authors: Peter Bubenik, Alex Elchesen
Download: PDF
Abstract: We undertake a formal study of persistence diagrams and their metrics. We show that barcodes and persistence diagrams together with the bottleneck distance and the Wasserstein distances are obtained via universal constructions and thus have corresponding universal properties. In addition, the 1-Wasserstein distance satisfies Kantorovich-Rubinstein duality. Our constructions and results apply to any metric space with a distinguished basepoint. For example, they can also be applied to multiparameter persistence modules.

at December 07, 2019 12:00 AM UTC

Authors: Ondřej Benedikt, István Módos, Zdeněk Hanzálek
Download: PDF
Abstract: This paper addresses a single machine scheduling problem with non-preemptive jobs to minimize the total electricity cost. Two latest trends in the area of the energy-aware scheduling are considered, namely the variable energy pricing and the power-saving states of a machine. Scheduling of the jobs and the machine states are considered jointly to achieve the highest possible savings. Although this problem has been previously addressed in the literature, the reported results of the state-of-the-art method show that the optimal solutions can be found only for instances with up to 35 jobs and 209 intervals within 3 hours of computation. We propose an elegant pre-processing technique called SPACES for computing the optimal switching of the machine states with respect to the energy costs. The optimal switchings are associated with the shortest paths in an interval-state graph that describes all possible transitions between the machine states in time. This idea allows us to implement efficient integer linear programming and constraint programming models of the problem while preserving the optimality. The efficiency of the models lies in the simplification of the optimal switching representation. The results of the experiments show that our approach outperforms the existing state-of-the-art exact method. On a set of benchmark instances with varying sizes and different state transition graphs, the proposed approach finds the optimal solutions even for the large instances with up to 190 jobs and 1277 intervals within an hour of computation.

at December 07, 2019 12:00 AM UTC

Authors: Thomas Fernique
Download: PDF
Abstract: We consider circle packings in the plane with circles of sizes $1$, $r\simeq 0.834$ and $s\simeq 0.651$. These sizes are algebraic numbers which allow a compact packing, that is, a packing in which each hole is formed by three mutually tangent circles. Compact packings are believed to maximize the density when there are possible. We prove that it is indeed the case for these sizes. The proof should be generalizable to other sizes which allow compact packings and is a first step towards a general result.

at December 07, 2019 12:00 AM UTC

Authors: Benjamin Coleman, Anshumali Shrivastava
Download: PDF
Abstract: Kernel density estimation is a simple and effective method that lies at the heart of many important machine learning applications. Unfortunately, kernel methods scale poorly for large, high dimensional datasets. Approximate kernel density estimation has a prohibitively high memory and computation cost, especially in the streaming setting. Recent sampling algorithms for high dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms cannot handle high dimensional datasets due to the curse of dimensionality. We propose RACE, an efficient sketching algorithm for kernel density estimation on high-dimensional streaming data. RACE compresses a set of N high dimensional vectors into a small array of integer counters. This array is sufficient to estimate the kernel density for a large class of kernels. Our sketch is practical to implement and comes with strong theoretical guarantees. We evaluate our method on real-world high-dimensional datasets and show that our sketch achieves 10x better compression compared to competing methods.

at December 07, 2019 12:00 AM UTC

Authors: Jeff Erickson, Ivor van der Hoog, Tillmann Miltzow
Download: PDF
Abstract: We propose a new paradigm for robust geometric computations that complements the classical fixed precision paradigm and the exact geometric computation paradigm. We provide a framework where we study algorithmic problems under smoothed analysis of the input, the relaxation of the problem requirements, or the witness of a recognition problem. Our framework specifies a widely applicable set of prerequisites that make real RAM algorithms suitable for smoothed analysis. We prove that suitable algorithms can (under smoothed analysis) be robustly executed with expected logarithmic bit-precision. This shows in a formal way that inputs which need high bit-precision are contrived and that these algorithms are likely robust for realistic input. Interestingly our techniques generalize to problems with a natural notion of resource augmentation (geometric packing, the art gallery problem) and recognition problems (recognition of realizable order types or disk intersection graphs).

Our results also have theoretical implications for some ER-hard problems: These problems have input instances where their real verification algorithm requires at least exponential bit-precision which makes it difficult to place these ER-hard problems in NP. Our results imply for a host of ER-complete problems that this exponential bit-precision phenomenon comes from nearly degenerate instances.

It is not evident that problems that have a real verification algorithm belong to ER. Therefore, we conclude with a real RAM analogue to the Cook-Levin Theorem. This gives an easy proof of ER-membership, as real verification algorithms are much more versatile than ETR-formulas.

at December 07, 2019 12:00 AM UTC

Authors: P. Mirabal, J. Abreu, D. Seco
Download: PDF
Abstract: Strings are a natural representation of biological data such as DNA, RNA and protein sequences. The problem of finding a string that summarizes a set of sequences has direct application in relative compression algorithms for genome and proteome analysis, where reference sequences need to be chosen. Median strings have been used as representatives of a set of strings in different domains. However, several formulations of those problems are NP-Complete. Alternatively, heuristic approaches that iteratively refine an initial coarse solution by applying edit operations have been proposed. Recently, we investigated the selection of the optimal edit operations to speed up convergence without spoiling the quality of the approximated median string. We propose a novel algorithm that outperforms state of the art heuristic approximations to the median string in terms of convergence speed by estimating the effect of a perturbation in the minimization of the expressions that define the median strings. We present corpus of comparative experiments to validate these results.

at December 07, 2019 12:00 AM UTC

Authors: Tianhao Wang, Min Xu, Bolin Ding, Jingren Zhou, Cheng Hong, Zhicong Huang, Ninghui Li, Somesh Jha
Download: PDF
Abstract: When collecting information, local differential privacy (LDP) alleviates privacy concerns of users because their private information is randomized before being sent to the central aggregator. However, LDP results in loss of utility due to the amount of noise that is added to each individual data item. To address this issue, recent work introduced an intermediate server with the assumption that this intermediate server did not collude with the aggregator. Using this trust model, one can add less noise to achieve the same privacy guarantee; thus improving the utility.

In this paper, we investigate this multiple-party setting of LDP. We first analyze the threat model and identify potential adversaries. We then make observations about existing approaches and propose new techniques that achieve a better privacy-utility tradeoff than existing ones. Finally, we perform experiments to compare different methods and demonstrate the benefits of using our proposed method.

at December 07, 2019 11:27 PM UTC

The Department of Computer Science at The George Washington University invites applications for two tenure track positions at the Assistant, Associate or Full Professor level, beginning as early as Fall 2020. One position focuses on Machine Learning and related areas; the other position welcomes all areas of theoretical and applied computer science.


by shacharlovett at December 06, 2019 09:33 PM UTC

If anyone asks you: how can I scale my State Machine Replication (Blockchain) system? You should answer back with a question: what is your bottleneck? Is it Data, Consensus or Execution? Data: Shipping the commands to all the replicas. For example, if a block contains 1MB of commands, then you...

at December 06, 2019 05:05 PM UTC

A pseudo-deterministic algorithm is a (randomized) algorithm which, when run multiple times on the same input, with high probability outputs the same result on all executions. Classic streaming algorithms, such as those for finding heavy hitters, approximate counting, $\ell_2$ approximation, finding a nonzero entry in a vector (for turnstile algorithms) are not pseudo-deterministic. For example, in the instance of finding a nonzero entry in a vector, for any known low-space algorithm $A$, there exists a stream $x$ so that running $A$ twice on $x$ (using different randomness) would with high probability result in two different entries as the output. In this work, we study whether it is inherent that these algorithms output different values on different executions. That is, we ask whether these problems have low-memory pseudo-deterministic algorithms. For instance, we show that there is no low-memory pseudo-deterministic algorithm for finding a nonzero entry in a vector (given in a turnstile fashion), and also that there is no low-dimensional pseudo-deterministic sketching algorithm for $\ell_2$ norm estimation. We also exhibit problems which do have low memory pseudo-deterministic algorithms but no low memory deterministic algorithm, such as outputting a nonzero row of a matrix, or outputting a basis for the row-span of a matrix. We also investigate multi-pseudo-deterministic algorithms: algorithms which with high probability output one of a few options. We show the first lower bounds for such algorithms. This implies that there are streaming problems such that every low space algorithm for the problem must have inputs where there are many valid outputs, all with a significant probability of being outputted.

at December 06, 2019 02:44 PM UTC

We hit the mother-lode of property testing papers this month. Stick with us, as we cover 10 (!) papers that appeared online in November.

Testing noisy linear functions for sparsity, by Xue Chen, Anindya De, and Rocco A. Servedio (arXiv). Given samples from a noisy linear model \(y = w\cdot x + \mathrm{noise}\), test whether \(w\) is \(k\)-sparse, or far from being \(k\)-sparse. This is a property testing version of the celebrated sparse recovery problem, whose sample complexity is well-known to be \(O(k\log n)\), where the data lies in \(\mathbb{R}^n\). This paper shows that the testing version of the problem can be solved (tolerantly) with a number of samples independent of \(n\), assuming technical conditions: the distribution of coordinates of \(x\) are i.i.d. and non-Gaussian, and the noise distribution is known to the algorithm. Surprisingly, all these conditions are needed, otherwise the dependence on \(n\) is \(\tilde \Omega(\log n)\), essentially the same as the recovery problem.

Pan-Private Uniformity Testing, by Kareem Amin, Matthew Joseph, Jieming Mao (arXiv). Differentially private distribution testing has now seen significant study, in both the local and central models of privacy. This paper studies a distribution testing in the pan-private model, which is intermediate: the algorithm receives samples one by one in the clear, but it must maintain a differentially private internal state at all time steps. The sample complexity turns out to be qualitatively intermediate to the two other models: testing uniformity over \([k]\) requires \(\Theta(\sqrt{k})\) samples in the central model, \(\Theta(k)\) samples in the local model, and this paper shows that \(\Theta(k^{2/3})\) samples are necessary and sufficient in the pan-private model.

Almost Optimal Testers for Concise Representations, by Nader Bshouty (ECCC). This work gives a unified approach for testing for a plethora of different classes which possess some sort of sparsity. These classes include \(k\)-juntas, \(k\)-linear functions, \(k\)-terms, various types of DNFs, decision lists, functions with bounded Fourier degree, and much more.

Unified Sample-Optimal Property Estimation in Near-Linear Time, by Yi Hao and Alon Orlitsky (arXiv). This paper presents a unified approach for estimating several distribution properties with both near-optimal time and sample complexity, based on piecewise-polynomial approximation. Some applications include estimators for Shannon entropy, power sums, distance to uniformity, normalized support size, and normalized support coverage. More generally, results hold for all Lipschitz properties, and consequences include high-confidence property estimation (outperforming the “median trick”) and differentially private property estimation.

Testing linear-invariant properties, by Jonathan Tidor and Yufei Zhao (arXiv). This paper studies property testing of functions which are in a formal sense, definable by restrictions to subspaces of bounded degree. This class of functions is a broad generalization of testing whether a function is linear, or a degree-\(d\) polynomial (for constant \(d\)). The algorithm is the oblivious one, which simply repeatedly takes random restrictions and tests whether the property is satisfied or not (similar to the classic linearity test of BLR, along with many others).

Approximating the Distance to Monotonicity of Boolean Functions, by Ramesh Krishnan S. Pallavoor, Sofya Raskhodnikova, Erik Waingarten (ECCC). This paper studies the following fundamental question in tolerant testing: given a Boolean function on the hypercube, test whether it is \(\varepsilon’\)-close or \(\varepsilon\)-far from monotone. It is shown that there is a non-adaptive polynomial query algorithm which can solve this problem for \(\varepsilon’ = \varepsilon/\tilde \Theta(\sqrt{n})\), implying an algorithm which can approximate distance to monotonicity up to a multiplicative \(\tilde O(\sqrt{n})\) (addressing an open problem by Sesh). They also give a lower bound demonstrating that improving this approximating factor significantly would necessitate exponentially-many queries. Interestingly, this is proved for the (easier) erasure-resilient model, and also implies lower bounds for tolerant testing of unateness and juntas.

Testing Properties of Multiple Distributions with Few Samples, by Maryam Aliakbarpour and Sandeep Silwal (arXiv). This paper introduces a new model for distribution testing. Generally, we are given \(n\) samples from a distribution which is either (say) uniform or far from uniform, and we wish to test which is the case. The authors here study the problem where we are given a single sample from \(n\) different distributions which are either all uniform or far from uniform, and we wish to test which is the case. By additionally assuming a structural condition in the latter case (it is argued that some structural condition is necessary), they give sample-optimal algorithms for testing uniformity, identity, and closeness.

Random Restrictions of High-Dimensional Distributions and Uniformity Testing with Subcube Conditioning, by Clément L. Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten (ECCC, arXiv). By now, it is well-known that testing uniformity over the \(n\)-dimensional hypercube requires \(\Omega(2^{n/2})\) samples — the curse of dimensionality quickly makes this problem intractable. One option is to assume that the distribution is product, which causes the complexity to drop to \(O(\sqrt{n})\). This paper instead assumes one has stronger access to the distribution — namely, one can receive samples conditioned on being from some subcube of the domain. With this, the paper shows that the complexity drops to the near-optimal \(\tilde O(\sqrt{n})\) samples. The related problem of testing whether a distribution is either uniform or has large mean is also considered.

Property Testing of LP-Type Problems, by Rogers Epstein, Sandeep Silwal (arXiv). An LP-Type problem (also known as a generalized linear program) is an optimization problem sharing some properties with linear programs. More formally, they consist of a set of constraints \(S\) and a function \(\varphi\) which maps subsets of \(S\) to some totally ordered set, such that \(\varphi\) possesses monotonicity and locality properties. This paper considers the problem of testing whether \(\varphi(S) \leq k\), or whether at least an \(\varepsilon\)-fraction of constraints in \(S\) must be removed for \(\varphi(S) \leq k\) to hold. This paper gives an algorithm with query complexity \(O(\delta/\varepsilon)\), where \(\delta\) is a dimension measure of the problem. This is applied to testing problems for linear separability, smallest enclosing ball, smallest intersecting ball, smallest volume annulus. The authors also provide lower bounds for some of these problems as well.

Near-Optimal Algorithm for Distribution-Free Junta Testing, by Xiaojin Zhang (arXiv). This paper presents an (adaptive) algorithm for testing juntas, in the distribution-free model with one-sided error. The query complexity is \(\tilde O(k/\varepsilon)\), which is nearly optimal. Algorithms with this sample complexity were previously known under the uniform distribution, or with two-sided error, but this is the first paper to achieve it in the distribution-free model with one-sided error.

by Gautam Kamath at December 06, 2019 04:41 AM UTC

By Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

This is a lightly edited and expanded version of the following post on the OpenAI blog about the following paper. While I usually don’t advertise my own papers on this blog, I thought this might be of interest to theorists, and a good follow up to my prior post. I promise not to make a habit out of it. –Boaz

TL;DR: Our paper shows that double descent occurs in conventional modern deep learning settings: visual classification in the presence of label noise (CIFAR 10, CIFAR 100) and machine translation (IWSLT’14 and WMT’14). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases. This behavior also extends over train epochs, where a single model undergoes double-descent in test error over the course of training. Surprisingly (at least to us!), we show these phenomenon can lead to a regime where “more data hurts”—training a deep network on a larger train set actually performs worse.


Open a statistics textbook and you are likely to see warnings against the danger of “overfitting”: If you are trying to find a good classifier or regressor for a given set of labeled examples, you would be well-advised to steer clear of having so many parameters in your model that you are able to completely fit the training data, because you risk not generalizing to new data.

The canonical example for this is polynomial regression. Suppose that we get n samples of the form (x, p(x)+noise) where x is a real number and p(x) is a cubic (i.e. degree 3) polynomial. If we try to fit the samples with a degree 1 polynomial—-a linear function, then we would get many points wrong. If we try to fit it with just the right degree, we would get a very good predictor. However, as the degree grows, we get worse till the degree is large enough to fit all the noisy training points, at which point the regressor is terrible, as shown in this figure:

It seems that the higher the degree, the worse things are, but what happens if we go even higher? It seems like a crazy idea—-why would we increase the degree beyond the number of samples? But it corresponds to the practice of having many more  parameters than training samples in modern deep learning. Just like in deep learning, when the degree is larger than the number of samples, there is more than one polynomial that fits the data– but we choose a specific one: the one found running gradient descent.

Here is what happens if we do this for degree 1000, fitting a polynomial using gradient descent (see this notebook):

We still fit all the training points, but now we do so in a more controlled way which actually tracks quite closely the ground truth. We see that despite what we learn in statistics textbooks, sometimes overfitting is not that bad, as long as you go “all in” rather than “barely overfitting” the data. That is, overfitting doesn’t hurt us if we take the number of parameters to be much larger than what is needed to just fit the training set — and in fact, as we see in deep learning, larger models are often better.

The above is not a novel observation. Belkin et al called this phenomenon “double descent” and this goes back to even earlier works . In this new paper we (Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever) extend the prior works and report on a variety of experiments showing that “double descent” is widely prevalent across several modern deep neural networks and for several natural tasks such as image recognition (for the CIFAR 10 and CIFAR 100 datasets) and language translation (for IWSLT’14 and WMT’14 datasets).  As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases.  Moreover, double descent also extends beyond number of parameters to other measures of “complexity” such as the number of training epochs of the algorithm.

The take-away from our work (and the prior works it builds on) is that neither the classical statisticians’ conventional wisdom that “too large models are worse” nor the modern ML paradigm that “bigger models are always better” always hold. Rather it all depends on whether you are on the first or second descent.  Further more, these insights also allow us to generate natural settings in which even the age-old adage of “more data is always better” is violated!

In the rest of this blog post we present a few sample results from this recent paper.

Model-wise Double Descent

We observed many cases in which, just like in the polynomial interpolation example above, the test error undergoes a “double descent” as we increase the complexity of the model. The figure below demonstrates one such example: we plot the test error as a function of the complexity of the model for ResNet18 networks. The complexity of the model is the width of the layers, and the dataset is CIFAR10 with 15% label noise. Notice that the peak in test error occurs around the “interpolation threshold”: when the models are just barely large enough to fit the train set. In all cases we’ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, changing the number of train samples, or varying the amount of label noise) also affect the location of the test error peak correspondingly.

We found the double descent phenomena is most prominent in settings with added label noise— without it, the peak is much smaller and easy to miss. But adding label noise amplifies this general behavior and allows us to investigate it easily.

Sample-Wise Nonmonotonicity

Using the model-wise double descent phenomenon we can obtain examples where training on more data actually hurts. To see this, let’s look at the effect of increasing the number of train samples on the test error vs. model size graph. The below plot shows Transformers trained on a language-translation task (with no added label noise):

On the one hand, (as expected) increasing the number of samples generally shifts the curve downwards towards lower test error. On the other hand, it also shifts the curve to the right: since more samples require larger models to fit, the interpolation threshold (and hence, the peak in test error) shifts to the right. For intermediate model sizes, these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Epoch-Wise Double Descent

There is a regime where training longer reverses overfitting. Let’s look closer at the experiment from the “Model-wise Double Descent” section, and plot Test Error as a function of both model-size and number of optimization steps. In the plot below to the right, each column tracks the Test Error of a given model over the course of training. The top horizontal dotted-line corresponds to the double-descent of the first figure. But we can also see that for a fixed large model, as training proceeds test error goes down, then up and down again—we call this phenomenon “epoch-wise double-descent.”

Moreover, if we plot the Train error of the same models and the corresponding interpolation contour (dotted line) we see that it exactly matches the ridge of high test error (on the right).

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Our intuition is that for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly-noisy or mis-specified labels will destroy its global structure. That is, there are no “good models”, which both interpolate the train set, and perform well on the test set. However in the over-parameterized regime, there are many models that fit the train set, and there exist “good models” which both interpolate the train set and perform well on the distribution. Moreover, the implicit bias of SGD leads it to such “good” models, for reasons we don’t yet understand.

The above intuition is theoretically justified for linear models, via a series of recent works including [Hastie et al.] and [Mei-Montanari]. We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.

Commentary: Experiments for Theory

The experiments above are especially interesting (in our opinion) because of how they can inform ML theory: any theory of ML must be consistent with “double descent.” In particular, one ambitious hope for what it means to “theoretically explain ML” is to prove a theorem of the form:

“If the distribution satisfies property X and architecture/initialization satisfies property Y, then SGD trained on ‘n’ samples, for T steps, will have small test error with high probability”

For values of X, Y, n, T, “small” and “high” that are used in practice.

However, these experiments show that these properties are likely more subtle than we may have hoped for, and must be non-monotonic in certain natural parameters.

This rules out even certain natural “conditional conjectures” that we may have hoped for, for example the conjecture that

“If SGD on a width W network works for learning from ‘n’ samples from distribution D, then SGD on a width W+1 network will work at least as well”

Or the conjecture

“If SGD on a certain network and distribution works for learning with ‘n’ samples, then it will work at least as well with n+1 samples”

It also appears to conflict with a “2-phase” view of the trajectory of SGD, as an initial “learning phase” and then an “overfitting phase” — in particular, because the overfitting is sometimes reversed (at least, as measured by test error) by further training.

Finally, the fact that these phenomena are not specific to neural networks, but appear to hold fairly universally for natural learning methods (linear/kernel regression, decision trees, random features) gives us hope that there is a deeper phenomenon at work, and we are yet to find the right abstraction.

We especially thank Mikhail Belkin and Christopher Olah for helpful discussions throughout this work. The polynomial example is inspired in part by experiments in [Muthukumar et al.].

by preetum at December 05, 2019 05:00 PM UTC

The Computer Science Department at UT Austin invites applications for a Postdoctoral Fellow in theoretical computer science for the 2020-21 academic year. The Fellow will work with Dana Moshkovitz and David Zuckerman on pseudorandomness and computational complexity. Review of applicants will begin on January 15, but applications will be accepted until the position is filled.


by shacharlovett at December 05, 2019 02:24 PM UTC

Applications are invited for a postdoctoral position in algorithms in the School of Informatics at the University of Edinburgh. The position is for a period of up to 3 years, and can start at any time in 2020.


by shacharlovett at December 05, 2019 10:53 AM UTC

Applications are invited for a postdoctoral position in algorithms in the School of Informatics at the University of Edinburgh. The position is for a period of up to 3 years, and can start at any time in 2020.


by shacharlovett at December 05, 2019 10:49 AM UTC

A central question in the study of interactive proofs is the relationship between private-coin proofs, where the verifier is allowed to hide its randomness from the prover, and public-coin proofs, where the verifier's random coins are sent to the prover. In this work, we study transformations from private-coin proofs to public-coin proofs, which preserve (up to polynomial factors) the running time of both the prover and the verifier. We re-consider this question in light of the emergence of doubly-efficient interactive proofs, where the honest prover is required to run in polynomial time. Can every private-coin doubly-efficient interactive proof be transformed into a public-coin doubly-efficient proof? We show that, assuming one-way functions exist, there is no black-box private-coin to public-coin transformation for doubly-efficient interactive proofs. Our main result is a (loose) converse: every doubly-efficient proof has a function such that if this function is invertible, then the proof can be efficiently transformed. Turning our attention back to classic interactive proofs, where the honest prover need not run in polynomial time, we show a similar result for constant-round general interactive proofs. Finally, we demonstrate a condition that suffices for efficiency preserving emulation of any protocol (even if the number of rounds is polynomial and the honest prover is unbounded). For a given interactive proof, if for every transcript prefix it is possible to efficiently approximate the number of consistent verifier random coins, then there is an efficiency-preserving transformation.

at December 05, 2019 01:56 AM UTC

We show that there are degree-$d$ polynomials over $\mathbb{F}_{2}$ with correlation $\Omega(d/\sqrt{n})$ with the majority function on $n$ bits. This matches the $O(d/\sqrt{n})$ bound by Smolensky.

at December 04, 2019 04:53 PM UTC

The Algorithms and Randomness Center (ARC) at Georgia Institute of Technology is seeking multiple postdoctoral fellows starting Fall 2020. ARC has faculty associated with multiple departments including CS, Math, ISyE and EE. The candidate will work on any aspect of algorithms, optimization, broadly interpreted, and collaborate with ARC faculty.


by shacharlovett at December 04, 2019 02:57 PM UTC

Bopuifs fodszqujpo qspcmfn.

“Unsung Entrepreneur” source

Adolph Ochs was the owner of the New York Times. In 1897 he created the paper’s slogan, “All the News That’s Fit to Print.” We at GLL would like some suggestions on our own slogan. Send us your ideas. Please no suggestion of “All the news that fits we print,” as that is already out there.

Today Ken and I wish to comment on a recent article in the NYT that was on end-to-end encryption.

The article leads by saying:

A Justice Department official hinted on Monday that a yearslong fight over encrypted communications could become part of a sweeping investigation of big tech companies.

Of course, end-to-end encryption scrambles messages so that only the sender and receiver can decode the message. Other methods are weaker: some only encrypt messages as they enter part of the network. This means that one must trust the network to keep your message secret. Thus the end-to-end method reduces the number of parties that one must trust.

In 1912, Ochs was a party to encryption that was literally end-to-end on the globe. The New York Times had bought exclusive American rights to report Roald Amundsen’s expedition to the South Pole. When Amundsen returned to Hobart, Tasmania, he sent a coded cable to his brother Leon who was acting as conduit to the Times and the London Daily Chronicle. The brother pronounced the coast clear for Amundsen to communicate directly to the papers. The stories were still quickly plagiarized once the first one appeared in the Times, and Ochs had to defend his rights with lawsuits.

The Usual Problem

There is an ongoing interest in using end-to-end encryption to protect more and more of our messages. And this interest leads to several hard problems.

The main one addressed by the NYT article is: Does this type of encryption protect bad actors? Many believe that encryption makes it impossible to track criminals. Many in law enforcement, for example, wish to have the ability to access any messages, at least on a court order. Some countries are likely to make this the law—that is, they will insist that they always can access any message. A followup NYT article described debates within Interpol about these matters.

Another Problem

The above problem is not what we wish to talk about today. We want to raise another problem.

How do we know that our messages are being properly encrypted?

We could check that our app is in end-to-end mode. The app will say “yes”. The problem is that this does not prove anything. The deeper question is how do we know that messages are correctly encrypted. Indeed.

Suppose that we are told that the message {M} has been sent to another person as the encrypted message {E(M)}. How do we know that this has been done? Several issues come to mind:

{\bullet } The app could lie. The app could for example say it is encrypting your message and it did not.

{\bullet } The app could mislead. The app could send an encrypted message and also send the clear message to who ever it wishes.

{\bullet } The app could be wrong. The app could think that the message was properly encrypted. The key, for example, could be a weak key.

{\bullet } The app method could be flawed. The app’s method could be incorrect. The method used might be vulnerable to non or unknown attacks.

Authenticated encryption seems to cover only part of the need. It can confirm the identity of the sender and that the ciphertext has not been tampered with. This is, however, a far cry from verifying that the encryption itself is proper and free of holes that could make it easy to figure out. Our point is also aside from problems with particular end-to-end implementations such as those considered in this 2017 paper.

Open Problems

Bopuifs fodszqujpo qspcmfn was encrypted with the simple key

\displaystyle  a \rightarrow b, b \rightarrow c, c \rightarrow d, \dots

The point of this silly example is that it might have been encrypted by a harder method, but it was only encrypted by a trivial substitution method. Nevertheless, Google could not figure it out:

by RJLipton+KWRegan at December 04, 2019 03:03 AM UTC

On Dec 8, 1919 Julia Robinson was born, so today is close to her 100th birthday (she passed away at
the age of 65 on July 30, 1985).

So time for some facts about her

1) She got her PhD from Tarski where she proved the undecidability of the theory of the rationals.

2) She is probably best known for her work on Hilbert's tenth problem (which we call H10)

In todays' terminology H10 would be stated as:

Find an algorithm that will, given p in Z[x_1,...,x_n] determine if it has an integer solution.

Hilbert posed it to inspire deep research in Number Theory. There are some cases that are
solvable (the topic of a later blog post) but the general problem is undecidable. This is not what Hilbert was aiming for. I wonder if he would be happy with the resolution.

The Davis-Putnam-Robinson paper showed that the decision problem for exponential diophantine equations was undecidable. It was published in 1961. The paper is here.  Martin Davis predicted that the proof that H10 was undecidable would be by a young Russian mathematician. He was proven correct when Yuri Matiyasevich supplied the missing piece needed to complete the proof.  

I often read `H10 was resolved by Davis-Putnam-Robinson and Matiyasevich' or sometimes they put all four names in alphabetical order. I like that--- it really was a joint effort.

3) She was inducted (a proof by induction?) into the National Academy of Sciences in 1975.

4) She was elected to be president of the American Math Society in 1982.  

5) She got a MacAuthor Fellowship prize in 1985 (Often called the MacAuthor Genius award.)
At the time it was worth $60,000.  Its now $625,000.

6) She also did work in Game Theory. Her paper An Iterative Method of Solving a Game, which is
here, is a proof from the book according to Paul Goldberg's comment on this post.

7) The Julia Robinson Math Festival is named in her honor (hmmm- is that a tautology?) Its purpose is to inspire K-12 students to get involved in math. For more on it see here.

8) (ADDED LATER) Commenter David Williamson pointed out that Julia Robinson did work on the transportation problem. See his comment and his pointer to the paper.

(ADDED LATER) When I hear Julia Robinson I think Hilbert's 10th problem.  I suspect many of you do the same. However, looking at items 6 and 8 above, one realizes that she did research in non-logic branches of math as well.

by GASARCH ( at December 02, 2019 11:01 PM UTC

Applications are invited for a postdoc position hosted by Hsin-Hao Su at Boston College. Areas of specific interests include but not limited to distributed graph algorithms, local algorithms, dynamic graph algorithms, gossip algorithms, and MPC algorithms. The position can start at any time in 2020 after February. The length of the position is for a period of up to two years.


by shacharlovett at December 02, 2019 10:37 PM UTC

This job ad could be of interest to you or to someone you know. Feel free to spread it as you see fit. Thanks!
The Department of Computer Science at Reykjavik University invites applications for several full-time faculty positions.

We are looking for energetic, highly qualified academics who are eager to develop their own research programs, strengthen existing research within the department and build bridges with industry. Of particular interest are candidates in the areas of machine learning, data science, computer security and software systems, broadly construed, but exceptionally qualified candidates from all areas of computer science are encouraged to apply.

Candidates should have a proven international research record that is commensurate with the level of the position for which they apply. The successful applicants will play a full part in the teaching and administrative activities of the department; in particular, they will teach courses and supervise students at both graduate and undergraduate level. Applicants having a demonstrated history of excellence in teaching are preferred. A PhD in computer science or a related field is required.

Salary and rank are commensurate with experience. An appointment at assistant-professor level is permanent track, with the expectation that a successful candidate will qualify for promotion to associate professor within six years.

The positions are open until filled, with the earliest available starting date in August 2020. Later starting dates can be negotiated.

The review of the applications will begin on Monday, 17 February 2020, and will continue until the positions are filled.

See for further information on how to apply for the position and on the required documents.

About the department, Reykjavik University and Iceland

The Department of Computer Science at Reykjavik University has about 650 full-time-equivalent students and 22 faculty members. It provides an excellent working environment that encourages and supports independence in academic endeavours, and in which a motivated academic can have impact at all levels. The department offers undergraduate and graduate programs in computer science and software engineering, as well as a combined-degree undergraduate program in discrete mathematics and computer science. The doctoral program within the department received its ministerial accreditation in 2009 and has been active ever since.

The department is home to several research centres producing high-quality collaborative research in areas such as artificial intelligence, financial technology, language technology, software systems, and theoretical computer science, among others; for more information on those research centres, see

On the Times Higher Education rankings for 2020, Reykjavík University is ranked in first place along with six other universities for the average number of citations per faculty. Overall, according to that list, RU is ranked among the 300 best universities world-wide, 52nd out of all universities established fewer than 50 years ago, 14th among universities with fewer than 5000 students, and first among Icelandic universities.

Iceland is well known for its breathtaking natural beauty, with volcanoes, geysers, hot springs, lava fields and glaciers offering a dramatic landscape. It is consistently ranked as one of the best places in the world to live. It offers a high quality of life, is one of the safest places in the world, has high gender equality, and strong health-care and social-support systems. It is in fourth position in the 2019 UN World Happiness Report, which ranks the world's countries by their happiness level.

For further information about the Department of Computer Science at Reykjavik University and its activities, see

by Luca Aceto ( at December 02, 2019 10:35 PM UTC

Applications are invited for a postdoctoral position in theoretical computer science in the School of Computer and Communication Sciences at EPFL. The position is for a period of up to two years, and comes with a competitive salary as well as a generous travel allowance.


by shacharlovett at December 02, 2019 08:34 PM UTC

The College of Information and Computer Sciences at the University of Massachusetts
Amherst invites applications for tenure-track faculty in Theoretical Computer Science at the
Associate and Assistant Professor levels. Exceptional candidates at other ranks may be


by shacharlovett at December 02, 2019 06:13 PM UTC

  1. Two weeks ago, I blogged about the claim of Nathan Keller and Ohad Klein to have proven the Aaronson-Ambainis Conjecture. Alas, Keller and Klein tell me that they’ve now withdrawn their preprint (though it may take another day for that to show up on the arXiv), because of what looks for now like a fatal flaw, in Lemma 5.3, discovered by Paata Ivanishvili. (My own embarrassment over having missed this flaw is slightly mitigated by most of the experts in discrete Fourier analysis having missed it as well!) Keller and Klein are now working to fix the flaw, and I wholeheartedly wish them success.
  2. In unrelated news, I was saddened to read that Virgil Griffith—cryptocurrency researcher, former Integrated Information Theory researcher, and onetime contributor to Shtetl-Optimized—was arrested at LAX for having traveled to North Korea to teach the DPRK about cryptocurrency, against the admonitions of the US State Department. I didn’t know Virgil well, but I did meet him in person at least once, and I liked his essays for this blog about how, after spending years studying IIT under Giulio Tononi himself, he became disillusioned with many aspects of it and evolved to a position not far from mine (though not identical either).
    Personally, I despise the North Korean regime for the obvious reasons—I regard it as not merely evil, but cartoonishly so—and I’m mystified by Virgil’s apparently sincere belief that he could bring peace between the North and South by traveling to North Korea to give a lecture about blockchain. Yet, however world-historically naïve he may have been, his intentions appear to have been good. More pointedly—and here I’m asking not in a legal sense but in a human one—if giving aid and comfort to the DPRK is treasonous, then isn’t the current occupant of the Oval Office a million times guiltier of that particular treason (to say nothing of others)? It’s like, what does “treason” even mean anymore? In any case, I hope some plea deal or other arrangement can be worked out that won’t end Virgil’s productive career.

by Scott at December 02, 2019 03:31 PM UTC

Aeiel Yadin’s homepage contains great lecture notes on harmonic functions on groups and on various other topics.

I have a lot of things to discuss and to report; exciting developments in the analysis of Boolean functions; much to report on algebraic, geometric and probabilistic combinatorics following our Oberwolfach summer meeting; much to tell about our Kazhdan seminar on quantum computation and symplectic geometry; a lot of exciting math and TCS activities in Israel; exciting things that Avi Wigderson’s told me on non commutative optimization and moment maps; and, of course, last but not least, the exciting Google supremacy demonstration that I most recently wrote about in my post Gil’s Collegial Quantum Supremacy Skepticism FAQ.  In particular, the unbelievable local-to-global fidelity Formula (77), and (NEW) a poem by Peter Shor for quantum computer skeptics. More poems are most welcome!

With all these excitements, plans, and blog duties it looks that this is the right time to take a pause for a Test Your Intuition post. (Based on chats with Asaf Nachmias, Jonathan Hermon and Itai Benjamini.)

Approaching the uniform distribution on the discrete cube with a lazy simple random walk.

Consider the discrete cube as a graph: the vertices are all the 0-1 vertices of length n , two vertices are adjacent if they differ in one coordinate.

A (lazy) simple random walk is described as follows: You start at the all 0 vertex. At each step when you are at vertex v you stay where you are with probability 1/2 and, with probability 1/2,  you move to a neighbor  of v chosen uniformly at random.

Your position after t steps is a random variable describing a probability distribution D_t on the vertices of the discrete cube. Now, lets fix once and for all the value of \epsilon to be 0.1.

Test your intuition: How many steps T(n) does it take until D_t is \epsilon-close to the uniform distribution U in total variation distance.  (The total variation distance is 1/2 the L^1 distance).

Others measure of proximity

We can also ask: How many steps M(n) does it take until D_t(x) is close to the uniform distribution for every x? Namely, for every x,

(1-\epsilon)/2^n \le D_t(x) \le (1+\epsilon)/2^n

For this question there is a simple analysis based on the coupon collector problem.

We can also consider intermediate measures of proximity, like the entropy:

How many steps H(n) it takes until the entropy of D_t(x) is \epsilon-close to the n the entropy of the uniform distribution?

Let me try now to test your more detailed intuition: For the public opinion poll below we say that X behaves like Y if their ratio tends to one as n tends to infinity, and that X is really smaller than Y if their ratio X/Y tends to a limit smaller than 1.

NEW: Share your knowledge

Let’s try something new: “Share your knowledge (SYK):” What other distances between probability distributions do you recommend? Tell us about them!

A theorem by Ajtai-Komlos-Szemeredi and a problem by Itai Benjamini

Ajtai, Komlos and Szemeredi proved that when you choose every edge of the discrete n-cube with probability greater than 1/n a giant component emerges! Now, choose every edge with probability 2/n  and start a simple random walk from a vertex of the giant component. Itai Benjamini conjectured that it will take roughly n^2 steps to approximately reach the stationary distribution. This seems very difficult.


by Gil Kalai at December 02, 2019 06:15 AM UTC

We show exponential lower bounds on resolution proof length for pigeonhole principle (PHP) formulas and perfect matching formulas over highly unbalanced, sparse expander graphs, thus answering the challenge to establish strong lower bounds in the regime between balanced constant-degree expanders as in [Ben-Sasson and Wigderson '01] and highly unbalanced, dense graphs as in [Raz '04] and [Razborov '03, '04]. We obtain our results by revisiting Razborov's pseudo-width method for PHP formulas over dense graphs and extending it to sparse graphs. This further demonstrates the power of the pseudo-width method, and we believe it could potentially be useful for attacking also other longstanding open problems for resolution and other proof systems.

at December 02, 2019 01:04 AM UTC

June 15-19, 2020 Imperial College of London (UK) Registration deadline: March 31, 2020 With this workshop we would like to promote the interaction between the following five fields: Berkovich spaces Tropical geometry p-adic differential equations Arithmetic D-modules and representations of p-adic Lie groups Arithmetic applications of p-adic local systems While the first two are … Continue reading Tropical Geometry, Berkovich Spaces, Arithmetic D-Modules and p-adic Local Systems

by shacharlovett at December 01, 2019 04:59 PM UTC

I’d like to ramble about another program that our group is running. We are searching for unusually promising high-school-age students and mentor each in a serious research project. We started doing this already before our REU program. However, I never advertised this program before, because I felt that we were still learning how to do […]

by Adam Sheffer at December 01, 2019 03:27 PM UTC

We revisit the fundamental problem of determining seed length lower bounds for strong extractors and natural variants thereof. These variants stem from a ``change in quantifiers'' over the seeds of the extractor: While a strong extractor requires that the average output bias (over all seeds) is small for all input sources with sufficient min-entropy, a somewhere extractor only requires that there exists a seed whose output bias is small. More generally, we study what we call probable extractors, which on input a source with sufficient min-entropy guarantee that a large enough fraction of seeds have small enough associated output bias. Such extractors have played a key role in many constructions of pseudorandom objects, though they are often defined implicitly and have not been studied extensively. Prior known techniques fail to yield good seed length lower bounds when applied to the variants above. Our novel approach yields significantly improved lower bounds for somewhere and probable extractors. To complement this, we construct a somewhere extractor that implies our lower bound for such functions is tight in the high min-entropy regime. Surprisingly, this means that a random function is far from an optimal somewhere extractor in this regime. The techniques that we develop also yield an alternative, simpler proof of the celebrated optimal lower bound for strong extractors originally due to Radhakrishnan and Ta-Shma (SIAM J. Discrete Math., 2000).

at December 01, 2019 02:42 PM UTC

by David Eppstein at November 30, 2019 10:10 AM UTC

Bitcoin’s underlying consensus protocol, now known as Nakamoto consensus, is an extremely simple and elegant solution to the Byzantine consensus problem. One may expect this simple protocol to come with a simple security proof. But that turns out not to be the case. The Bitcoin white paper did not provide...

at November 29, 2019 09:05 PM UTC