Theory of Computing Blog Aggregator

We study from the proof complexity perspective the (informal) proof search problem: Is there an optimal way to search for propositional proofs? We note that for any fixed proof system there exists a time-optimal proof search algorithm. Using classical proof complexity results about reflection principles we prove that a time-optimal proof search algorithm exists w.r.t. all proof systems iff a p-optimal proof system exists. To characterize precisely the time proof search algorithms need for individual formulas we introduce a new proof complexity measure based on algorithmic information concepts. In particular, to a proof system P we attach {\bf information-efficiency function} $i_P(\tau)$ assigning to a tautology a natural number, and we show that: - $i_P(\tau)$ characterizes time any $P$-proof search algorithm has to use on $\tau$ and that for a fixed $P$ there is such an information-optimal algorithm, - a proof system is information-efficiency optimal iff it is p-optimal, - for non-automatizable systems $P$ there are formulas $\tau$ with short proofs but having large information measure $i_P(\tau)$. We isolate and motivate the problem to establish {\em unconditional} super-logarithmic lower bounds for $i_P(\tau)$ where no super-polynomial size lower bounds are known. We also point out connections of the new measure with some topics in proof complexity other than proof search.

at April 13, 2021 07:34 AM UTC

Prediction is very difficult, especially if it’s about the future—Niels Bohr

Boston Globe “Uncertainty” source

Lisa Randall is a professor of theoretical physics at Harvard. Her research has touched on many of the basic questions of modern physics: supersymmetry, Standard Model observables, cosmological inflation, baryogenesis, grand unified theories, and general relativity. She has also written popular books about her work and science in general. Thus she has a handle on aspects of science that overlap my expertise—not to mention those of her sister Dana Randall, whom I have known as a colleague for many years.

Today, Ken and I thought we would talk about recent developments in particle physics, and their connection to two topics dear to us.

Randall’s most recent popular book is Dark Matter and the Dinosaurs. The idea she advances is that the periodic extinctions in Earth’s history may have been caused when the solar system passes through a plane of dark matter within our galaxy. But dark matter and also dark energy have come under increasing recent doubt, even from their original formulator. Maybe Niels Bohr’s quote should also say:

Prediction is very difficult, especially if it’s about the past.

Randall’s previous book, Knocking on Heaven’s Door, is most relevant to this post. The 1973 Bob Dylan song title it pinches describes the feeling of doing frontier physical science. Insofar as her own work is mostly theoretical, much of it connects to feelings we have in computer science—especially complexity lower bounds where the door mostly feels slammed shut.

But the book is also about the practice of experimental science—not only how to gather knowledge but when and how we can have confidence in it. Its long middle part is titled, “Machines, Measurements, and Probability.” All three elements are foremost in considering a new development that involves two measurements taken 20 years apart.


Last Tuesday’s New York Times highlighted a potential discovery in particle physics. It was in their Tuesday science section.

The result is an experimental discovery that could show that the current model of matter is wrong.

“This is our Mars rover landing moment,” said Chris Polly, a physicist at the Fermi National Accelerator Laboratory, or Fermilab, in Batavia, Ill., who has been working toward this finding for most of his career.

Indeed. It is a Mars landing moment. They both involved many people, lots of exotic machinery, lots of money, many years. Say three billion dollars or so for Mars. Say nearly the same amount for muons—the annual budget for Fermilab is over one half billion dollars. It was certainly enough to reassemble and upgrade a huge accelerator ring that was first used at Brookhaven National Lab on Long Island in 2001:

NY Times src

The study used the muon to probe the Standard Model of physics. Muons are useful because they are charged like an electron, which helps control them in an accelerator. Yet their mass is roughly 207 times larger than an electron. The same charge helps control their motion and the large mass makes collisions more interesting. As Polly stated in Natalie Wolchover’s story for Quanta:

“[I]f you’re looking for particles that could explain the missing mass of the universe—dark matter—or you’re looking for [supersymmetry], that’s where the muon has a unique role.”

Theory and Jealousy

Computer science theory is so different from high end physics. We are closer to the type of research that Randall does. We involve few people, no exotic machinery, and small amounts of money. Maybe the closest attribute we have to high-end research is we also take years and years.

Perhaps we are also jealous of high-end physics. Not just for money, but for the ability of particle physicists to get announcements into the New York Times. Polly said the following about the ending day of the muon experiment two decades ago:

When we revealed the results, people from all over the world flew in to visit the lab. These experiments take decades to build and analyze, so you don’t get to go to very many of these events. We did a little “Drumroll, please” and then had the postdoc managing the spreadsheet hit the button to show it on the projector. Lo and behold, you could see that there was still a three-sigma discrepancy!

At the time he was a graduate assistant assigned to machinery for measuring particle energies. He had fixed a problem where someone had touched a component with bare hands and thereby ruined its sheathing. All such problems were meant to be ironed-out by the drum-roll event. But all this raises two further interesting issues that connect the muon results with issues we think about in computer science. Let’s look at them next.

Three to Four Sigma

In an experimental science one must be aware that results are not exact. They are samples from some random process. Flip a coin 10 times in a row. If they all come up heads what does that mean? Could be the coin is fair but this happens about one time in a thousand. Or the coin is biased. Or something else.

Flip a muon many times. That is sample some muon experiment. The outcome is from a random process. Some of it comes from properties of the natural processes themselves and others from incidentals of the measurement apparatus. How do we decide if the experiment means what we think it does?

The theory developed by Carl Gauss and others before and after to delineate the normal distribution was largely prompted by analysis of measurement errors to begin with. This yields the “rule of three,” about the percentage of values that naturally lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.

The question is, how to assess cases where the measurement result is well outside these intervals—when can we conclude it is more than a deviation by natural chance? In social sciences a result is “significant” provided it lies outside two-sigma. In particle physics, there is a convention of a five-sigma effect (99.99994% confidence) being required to qualify as a discovery. No Nobel prize for less.

The situation with the muons has an extra factor of repeated measurements—but there have been only two measurements so far:

Composite of src1, src2

The blue line in the left figure is the original Brookhaven measurement; the red is the new one. There is also theoretical uncertainty in the calculation of the Standard Model prediction, and that combines with the measurement error bars to give the sigma baseline. The chart at right normalizes the deviation to parts per billion—the measurements need to be incredibly fine. This scale appears to be about 25% under the current {\sigma}-scale (it shows about {2.8} for Brookhaven compared to its {3.7\sigma} after revised uncertainty) but it is close enough to get the picture.

Although the new Fermilab result by itself deviates slightly less from the Standard Model, it corroborates the earlier measurement. It is not fully independent from it, but the combination is enough to raise the current claimed deviation to about {4.2} sigmas. This is well above social science level but below Nobel level. This is with respect to the probability that the effects are real.

The social significance of {4.2} is that it is above the “{3+}” level where hoped-for anomalies have subsequently disappeared for reasons chalked up to natural chance. This is because physicists around the world do many a hundredfold amount of hopeful measurements. Some measurements get initial “bumps” up just because of the numbers. But {4.2} reduces the natural frequency under 1-in-40,000. This is why reproducing measurements is so important, why the new Fermilab team devoted all the expense and effort. A more independent measurement on other machines could give a higher boost that might get over the {5.0} line. Time to break out the wallets and hammers?

The {4.2} is not, however, beyond the realm of recent experience with apparatus faults and modeling error. On the latter, there is still some doubt about the theoretical prediction for the muon’s magnetic moment. In any event, the muon results are exciting but still below what is required for a true discovery. Time will tell.

The second factor we draw attention to concerns the human “hoping” directly.


In any experimental science one must also be aware that people are not unbiased. Scientists have much invested in the outcome of their experiments. Think jobs, tenure even, funding, and more. So a big physics experiment like the muon one must be careful. They follow standard practice to perform blinded data analysis.

This surprised me. This blinding is a crypto-type protocol, which is something we computer scientists study. The muon team performed a protocol that protected against cheating. Here is how they did it:

In this case, the master clock that keeps track of the muons’ wobble had been set to a rate unknown to the researchers. The figure was sealed in envelopes that were locked in the offices at Fermilab and the University of Washington in Seattle.

In a ceremony on Feb. 25 that was recorded on video and watched around the world on Zoom, Dr. Polly opened the Fermilab envelope and David Hertzog from the University of Washington opened the Seattle envelope. The number inside was entered into a spreadsheet, providing a key to all the data, and the result popped out to a chorus of wows.

“That really led to a really exciting moment, because nobody on the collaboration knew the answer until the same moment,” said Saskia Charity, a Fermilab postdoctoral fellow who has been working remotely from Liverpool, England, during the pandemic.

This mechanism for blinding suggests possible crypto questions. They hid the master clock rate. Can this be modeled as one of our crypto problems? Can we prove some security bounds? If they claim that hiding the rate protects against cheating then they should be able to make this claim precise. The discovery of gravitational waves used a blind injection scheme tailored for that experiment. How can this be generalized?

Open Problems

We have discussed two aspects that involve soft numbers rather than hard machines and hard-shelled particles. Perhaps they are interesting new problems for us? What do you think?

by rjlipton at April 12, 2021 11:11 PM UTC

A secret-sharing scheme allows to distribute a secret $s$ among $n$ parties such that only some predefined ``authorized'' sets of parties can reconstruct the secret, and all other ``unauthorized'' sets learn nothing about $s$. The collection of authorized/unauthorized sets can be captured by a monotone function $f:\{0,1\}^n\rightarrow \{0,1\}$. In this paper, we focus on monotone functions that all their min-terms are sets of size $a$, and on their duals -- monotone functions whose max-terms are of size $b$. We refer to these classes as $(a,n)$-upslices and $(b,n)$-downslices, and note that these natural families correspond to monotone $a$-regular DNFs and monotone $(n-b)$-regular CNFs. We derive the following results. 1. (General downslices) Every downslice can be realized with total share size of $1.5^{n+o(n)}<2^{0.585 n}$. Since every monotone function can be cheaply decomposed into $n$ downslices, we obtain a similar result for general access structures improving the previously known $2^{0.637n+o(n)}$ complexity of Applebaum, Beimel, Nir and Peter (STOC 2020). We also achieve a minor improvement in the exponent of linear secrets sharing schemes. 2. (Random mixture of upslices) Following Beimel and Farras (TCC 2020) who studied the complexity of random DNFs with constant-size terms, we consider the following general distribution $F$ over monotone DNFs: For each width value $a\in [n]$, uniformly sample $k_a$ monotone terms of size $a$, where $k=(k_1,\ldots,k_n)$ is an arbitrary vector of non-negative integers. We show that, except with exponentially small probability, $F$ can be realized with share size of $2^{0.5 n+o(n)}$ and can be linearly realized with an exponent strictly smaller than $2/3$. Our proof also provides a candidate distribution for ``exponentially-hard'' access structure. We use our results to explore connections between several seemingly unrelated questions about the complexity of secret-sharing schemes such as worst-case vs. average-case, linear vs. non-linear and primal vs. dual access structures. We prove that, in at least one of these settings, there is a significant gap in secret-sharing complexity.

at April 12, 2021 05:17 PM UTC

 Alice works at a charity that puts together bag and box lunches for children.

They all wear masks and they are 12 feet apart and very careful, and nobody there has gotten COVID.

Then Alice gets here first COVID shot and says:

I am not going to work for that charity until I have had my second shot and waited  4 weeks so I am immune. 

She is really scared of getting COVID NOW that  she is on the verge of being immune. 

Is that logical? She was not scared before. So does it make sense to be scared now? I see where she is coming from emotionally, but is there a logical argument for her viewpoint? I ask nonrhetorically.

bill g. 

by gasarch ( at April 12, 2021 04:12 AM UTC

Authors: Robert Cardona, Eva Miranda, Daniel Peralta-Salas
Download: PDF
Abstract: In this article we construct a compact Riemannian manifold of high dimension on which the time dependent Euler equations are Turing complete. More precisely, the halting of any Turing machine with a given input is equivalent to a certain global solution of the Euler equations entering a certain open set in the space of divergence-free vector fields. In particular, this implies the undecidability of whether a solution to the Euler equations with an initial datum will reach a certain open set or not in the space of divergence-free fields. This result goes one step further in Tao's programme to study the blow-up problem for the Euler and Navier-Stokes equations using fluid computers. As a remarkable spin-off, our method of proof allows us to give a counterexample to a conjecture of Moore dating back to 1998 on the non-existence of analytic maps on compact manifolds that are Turing complete.

at April 12, 2021 10:37 PM UTC

Authors: Sanjana Dey, Ramesh K. Jallu, Subhas C. Nandy
Download: PDF
Abstract: In this article, we study the Euclidean minimum spanning tree problem in an imprecise setup. The problem is known as the \emph{Minimum Spanning Tree Problem with Neighborhoods} in the literature. We study the problem where the neighborhoods are represented as non-crossing line segments. Given a set ${\cal S}$ of $n$ disjoint line segments in $I\!\!R^2$, the objective is to find a minimum spanning tree (MST) that contains exactly one end-point from each segment in $\cal S$ and the cost of the MST is minimum among $2^n$ possible MSTs. We show that finding such an MST is NP-hard in general, and propose a $2\alpha$-factor approximation algorithm for the same, where $\alpha$ is the approximation factor of the best-known approximation algorithm to compute a minimum cost Steiner tree in an undirected graph with non-negative edge weights. As an implication of our reduction, we can show that the unrestricted version of the problem (i.e., one point must be chosen from each segment such that the cost of MST is as minimum as possible) is also NP-hard. We also propose a parameterized algorithm for the problem based on the "separability" parameter defined for segments.

at April 12, 2021 10:40 PM UTC

Authors: Paul Beaujean, Florian Sikora, Florian Yger
Download: PDF
Abstract: Feature generation is an open topic of investigation in graph machine learning. In this paper, we study the use of graph homomorphism density features as a scalable alternative to homomorphism numbers which retain similar theoretical properties and ability to take into account inductive bias. For this, we propose a high-performance implementation of a simple sampling algorithm which computes additive approximations of homomorphism densities. In the context of graph machine learning, we demonstrate in experiments that simple linear models trained on sample homomorphism densities can achieve performance comparable to graph neural networks on standard graph classification datasets. Finally, we show in experiments on synthetic data that this algorithm scales to very large graphs when implemented with Bloom filters.

at April 12, 2021 10:38 PM UTC

Authors: Amnon Rosenmann
Download: PDF
Abstract: The $k$-cardinality assignment problem asks for finding a maximal (minimal) weight of a matching of cardinality $k$ in a weighted bipartite graph $K_{n,n}$, $k \leq n$. The algorithm of Gassner and Klinz from 2010 for the parametric assignment problem computes in time $O(n^3)$ the set of $k$-cardinality assignments for those integers $k \leq n$ which refer to "essential" terms of a corresponding maxpolynomial. We show here that one can extend this algorithm and compute in a second stage the other "semi-essential" terms in time $O(n^2)$, which results in a time complexity of $O(n^3)$ for the whole sequence of $k=1,...,n$-cardinality assignments. The more there are assignments left to be computed at the second stage the faster the two-stage algorithm runs. In general, however, there is no benefit for this two-stage algorithm on the existing algorithms, e.g. the simpler network flow algorithm based on the successive shortest path algorithm which also computes all the $k$-cardinality assignments in time $O(n^3)$.

at April 12, 2021 10:40 PM UTC

Oded Goldreich is a theoretical computer scientist at the Weizmann Institute in Rehovot, Israel. He’s best known for helping to lay the rigorous foundations of cryptography in the 1980s, through seminal results like the Goldreich-Levin Theorem (every one-way function can be modified to have a hard-core predicate), the Goldreich-Goldwasser-Micali Theorem (every pseudorandom generator can be made into a pseudorandom function), and the Goldreich-Micali-Wigderson protocol for secure multi-party computation. I first met Oded more than 20 years ago, when he lectured at a summer school at the Institute for Advanced Study in Princeton, barefoot and wearing a tank top and what looked like pajama pants. It was a bracing introduction to complexity-theoretic cryptography. Since then, I’ve interacted with Oded from time to time, partly around his firm belief that quantum computing is impossible.

Last month a committee in Israel voted to award Goldreich the Israel Prize (roughly analogous to the US National Medal of Science), for which I’d say Goldreich had been a plausible candidate for decades. But alas, Yoav Gallant, Netanyahu’s Education Minister, then rather non-gallantly blocked the award, solely because he objected to Goldreich’s far-left political views (and apparently because of various statements Goldreich signed, including in support of a boycott of Ariel University, which is in the West Bank). The case went all the way to the Israeli Supreme Court (!), which ruled two days ago in Gallant’s favor: he gets to “delay” the award to investigate the matter further, and in the meantime has apparently sent out invitations for an award ceremony next week that doesn’t include Goldreich. Some are now calling for the other winners to boycott the prize in solidarity until this is righted.

I doubt readers of this blog need convincing that this is a travesty and an embarrassment, a shanda, for the Netanyahu government itself. That I disagree with Goldreich’s far-left views (or might disagree, if I knew in any detail what they were) is totally immaterial to that judgment. In my opinion, not even Goldreich’s belief in the impossibility of quantum computers should affect his eligibility for the prize. 🙂

Maybe it would be better to say that, as far as his academic colleagues in Israel and beyond are concerned, Goldreich has won the Israel Prize; it’s only some irrelevant external agent who’s blocking his receipt of it. Ironically, though, among Goldreich’s many heterodox beliefs is a total rejection of the value of scientific prizes (although Goldreich has also said he wouldn’t refuse the Israel Prize if offered it!).

In unrelated news, the 2020 Turing Award has been given to Al Aho and Jeff Ullman. Aho and Ullman have both been celebrated leaders in CS for half a century, having laid many of the foundations of formal languages and compilers, and having coauthored one of CS’s defining textbooks with John Hopcroft (who already received a different Turing Award).

But again there’s a controversy. Apparently, in 2011, Ullman wrote to an Iranian student who wanted to work with him, saying that as “a matter of principle,” he would not accept Iranian students until the Iranian government recognized Israel. Maybe I should say that I, like Ullman, am both a Jew and a Zionist, but I find it hard to imagine the state of mind that would cause me to hold some hapless student responsible for the misdeeds of their birth-country’s government. Ironically, this is a mirror-image of the tactics that the BDS movement has wielded against Israeli academics. Unlike Goldreich, though, Ullman seems to have gone beyond merely expressing his beliefs, actually turning them into a one-man foreign policy.

I’m proud of the Iranian students I’ve mentored and hope to mentor more. While I don’t think this issue should affect Ullman’s Turing Award (and I haven’t seen anyone claim that it should), I do think it’s appropriate to use the occasion to express our opposition to all forms of discrimination. I fully endorse Shafi Goldwasser’s response in her capacity as Director of the Simons Institute for Theory of Computing in Berkeley:

As a senior member of the computer science community and an American-Israeli, I stand with our Iranian students and scholars and outright reject any notion by which admission, support, or promotion of individuals in academic settings should be impeded by national origin or politics. Individuals should not be conflated with the countries or institutions they come from. Statements and actions to the contrary have no place in our computer science community. Anyone experiencing such behavior will find a committed ally in me.

As for Al Aho? I knew him fifteen years ago, when he became interested in quantum computing, in part due to his then-student Krysta Svore (who’s now the head of Microsoft’s quantum computing efforts). Al struck me as not only a famous scientist but a gentleman who radiated kindness everywhere. I’m not aware of any controversies he’s been involved in and never heard anyone say a bad word about him.

Anyway, this seems like a good occasion to recognize some foundational achievements in computer science, as well as the complex human beings who produce them!

by Scott at April 09, 2021 06:15 PM UTC

Interactive error correcting codes are codes that encode a two party communication protocol to an error-resilient protocol that succeeds even if a constant fraction of the communicated symbols are adversarially corrupted, at the cost of increasing the communication by a constant factor. What is the largest fraction of corruptions that such codes can protect against? If the error-resilient protocol is allowed to communicate large (constant sized) symbols, Braverman and Rao (STOC, 2011) show that the maximum rate of corruptions that can be tolerated is $1/4$. They also give a binary interactive error correcting protocol that only communicates bits and is resilient to $1/8$ fraction of errors, but leave the optimality of this scheme as an open problem. We answer this question in the negative, breaking the $1/8$ barrier. Specifically, we give a binary interactive error correcting scheme that is resilient to $5/39 > 1/8$ fraction of adversarial errors. Our scheme builds upon a novel construction of binary list-decodable interactive codes with small list size.

at April 09, 2021 04:15 AM UTC

Scott Aaronson wrote last month about the hype over quantum computing. I'd thought I'd drop a few stories.

I was once asked to review a grant proposal (outside the US) that claimed it would find a quantum algorithm for NP-hard problems. I wrote a scathing review but the grant was funded because I failed to prove that it was impossible. I replied that they should fund my research to teleport people from Chicago to Paris because they couldn't prove I couldn't do it. I never got a response.

I was at an NSF sponsored meeting on quantum computing. I suggested, as a complexity theorist, that we need to explore the limits of quantum computing. A senior researcher said we shouldn't mention that in the report or it might hurt our chances of funding the field if they think quantum computing might not be a complete success.

I went to a Microsoft Faculty Research Summit which had a big focus on quantum computing. I complained of the quantum computing hype. My friends in the field denied the hype. Later at the summit a research head said that Microsoft will solve world hunger with quantum computing.

I was meeting with a congressional staffer who had worked on the National Quantum Initiative which coincidentally was being announced that day. I said something about high risk, high reward. He looked shocked--nobody had told him before that quantum computing is a speculative technology.

Quantum computing has generated a large number of beautiful and challenging scientific questions. Thinking about quantum has helped generate classical complexity and algorithmic results. But quantum computing having a real-world impact in the near or mid-term is unlikely. Most scientists I know working directly in quantum research are honest about the limitations and challenges in quantum computing. But somehow that message is not often getting to the next layers up, the policy makers, the research managers, the university administrators, the media and the venture capitalists. 

But who knows, maybe some quantum heuristic that doesn't need much entanglement will change the world tomorrow. I can't prove it's impossible.

by Lance Fortnow ( at April 08, 2021 12:57 PM UTC

Amazon Advertising opens applications for early career scientists. The new program, which offers full-time two-year positions, is aimed at recent PhD graduates who want to innovate, publish, and have their work impact millions of customers. The application deadline is May 14.

by metoo ( at April 08, 2021 04:56 AM UTC

Can you trust a model whose designer had access to the test/holdout set? This implicit question in Dwork et al 2015 launched a new field, adaptive data analysis. The question referred to the fact that in many scientific settings as well as modern machine learning (with its standardized datasets like CIFAR, ImageNet etc.) the model designer has full access to the holdout set and is free to ignore the

(Basic Dictum of Data Science) “Thou shalt not train on the test/holdout set.”

Furthermore, even researchers who scrupulously follow the Basic Dictum may be unknowingly violating it when they take inspiration (and design choices) from published works by others who presumably published only the best of the many models they evaluated on the test set.

Dwork et al. showed that if the test set has size $N$, and the designer is allowed to see the error of the first $i-1$ models on the test set before designing the $i$’th model, then a clever designer can use so-called wacky boosting (see this blog post) to ensure the accuracy of the $t$’th model on the test set as high as $\Omega(\sqrt{t/N})$. In other words, the test set could become essentially useless once $t \gg N$, a condition that holds in ML, whereby in popular datasets (CIFAR10, CIFAR100, ImageNet etc.) $N$ is no more than $100,000$ and the total number of models being trained world-wide is well in the millions if not higher (once you include hyperparameter searches).

Meta-overfitting Error (MOE) of a model is the difference between its average error on the test data and its expected error on the full distribution. (It is closely related to false discovery rate in statistics.)

This blog post concerns our new paper, which gives meaningful upper bounds on this sort of trouble for popular deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate Rip van Winkle’s Razor which combines references to Occam’s Razor and the mythical person who fell asleep for 20 years.

drawing Rip Van Winkle wakes up from 20 years of sleep, clearly needing a Razor

Adaptive Data Analysis: Brief tour

It is well-known that for a model trained without ever querying the test set, MOE scales (with high probability over choice of the test set) as $1/\sqrt{N}$ where $N$ is the size of the test set. Furthermore standard concentration bounds imply that even if we train $t$ models without ever referring to the test set (in other words, using proper data hygiene) then the maximum meta-overfitting error among the $t$ models scales whp as $O(\sqrt{\log(t)/ N})$. The trouble pinpointed by Dwork et al. can happen only if models are designed adaptively, with test error of the previous models shaping the design of the next model.

Adaptive Data Analysis has come up with many good practices for honest researchers to mitigate such issues. For instance, Dwork et al. showed that using Differential Privacy on labels while evaluating models can lower MOE. Or the Ladder mechanism helps in Kaggle-like settings where the test dataset resides on a server that can choose to answers only a selected subset of queries, which essentially takes away the MOE issue.

For several good practices matching lower bounds exist showing a way to construct cheating models with MOE matching the upper bound.

However such recommended best practices do not help with understanding the MOE in the performance numbers of a new model since there is no guarantee that the inventors never tuned models using the test set, or didn’t get inspiration from existing models that may have been designed that way. Thus statistically speaking the above results still give no reason to believe that a modern deep net such as ResNet152 has low MOE.

Recht et al. 2019 summed up the MOE issue in a catchy title: Do ImageNet Classifiers Generalize to ImageNet? They tried to answer their question experimentally by creating new test sets from scratch –we discuss their results later.

MOE bounds and description length

The starting point of our work is the following classical concentration bounds:

Folklore Theorem With high probability over the choice of a test set of size $N$, the MOE of all models with description length at most $k$ bits is $O(\sqrt{k/N})$.

At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE.

Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.

Rip van Winkle’s Razor

Our new upper bound involves a more careful definition of Description Length: it is the smallest description that allows a referee to reproduce a model of similar performance using the (universally available) training and validation datasets.

While this phrasing may appear reminiscent of the review process for conferences and journals, there is a subtle difference with respect to what the referee can or cannot be assumed to know. (Clearly, assumptions about the referee can greatly affect description length —e.g, a referee ignorant of even basic calculus might need a very long explanation!)

Informed Referee: “Knows everything that was known to humanity (e.g., about deep learning, mathematics,optimization, statistics etc.) right up to the moment of creation of the Test set.”

Unbiased Referee: Knows nothing discovered since the Test set was created.

Thus Description Length of a model is the number of bits in the shortest description that allows an informed but unbiased referee to reproduce the claimed result.

Note that informed referees let descriptions get shorter. Unbiased require longer descriptions that rule out any statistical “contamination” due to any interaction whatsoever with the test set. For example, momentum techniques in optimization were well-studied before the creation of ImageNet test set, so informed referees can be expected to understand a line like “SGD with momentum 0.9.” But a line like “Use Batch Normalization” cannot be understood by unbiased referees since conceivably this technique (invented after 2012) might have become popular precisely because it leads to better performance on the test set of ImageNet.

By now it should be clear why the estimate is named after “Rip van Winkle”: the referee can be thought of as an infinitely well-informed researcher who went into deep sleep at the moment of creation of the test set, and has just been woken up years later to start refereeing the latest papers. Real-life journal referees who luckily did not suffer this way should try to simulate the idealized Rip van Winkle in their heads while perusing the description submitted by the researcher.

To allow as short a description as possible the researcher is allowed to compress the description of their new deep net non-destructively using any compression that would make sense to Rip van Winkle (e.g., Huffman Coding). The description of the compression method itself is not counted towards the description length – provided the same method is used for all papers submitted to Rip van Winkle. To give an example, a technique appearing in a text known to Rip van Winkle could be succinctly referred to using the book’s ISBN number and page number.

Estimating MOE of ResNet-152

As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves $4.49\%$ top-5 test error.

The description consists of three types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such as CONV, MaxPool, ReLU, SGD etc., as well as a graph-theoretic notation/shorthand for describing net architecture. The newly introduced concepts including Batch-Norm, Layer, Block are defined precisely using Math, English, and other primitive concepts.

drawing Description for reproducing ResNet-152

According to our estimate, the length of the above description is $1032$ bits, which translates into a upper bound on meta-overfitting error of merely $5\%$! This suggests the real top-5 error of the model on full distribution is at most $9.49\%$. In the paper we also provide a $980$-bit long description for reproducing DenseNet-264, which leads to $5.06\%$ upper bound on its meta-overfitting error.

Note that the number $5.06$ suggests higher precision than actually given by the method, since it is possible to quibble about the coding assumptions that led to it. Perhaps others might use a more classical coding mechanism and obtain an estimate of $6\%$ or $7\%$.

But the important point is that unlike existing bounds in Adaptive Data Analysis, there is no dependence on $t$, the number of models that have been tested before, and the bound is non-vacuous.

Empirical evidence about lack of meta-overfitting

Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number of parameters and hyper-parameters in today’s deep nets, the information content of these models is not high given knowledge circa 2012.

Recently Recht et al. tried to reach an empirical upper bound on MOE for ImageNet and CIFAR-10. They created new tests sets by carefully replicating the methodology used for constructing the original ones. They found that error of famous published models of the past seven years is as much as 10-15% higher on the new test set as compared to the original. On the face of it, this seemed to confirm a case of bad meta-overfitting. But they also presented evidence that the swing in test error was due to systemic effects during test set creation. For instance, a comparable swing happens also for models that predated the creation of ImageNet (and thus were not overfitted to the ImageNet test set). A followup study of a hundred Kaggle competitions used fresh, identically distributed test sets that were available from the official competition organizers. The authors concluded that MOE does not appear to be significant in modern ML.


To us the disquieting takeaway from Recht et al.’s results was that estimating MOE by creating a new test set is rife with systematic bias at best, and perhaps impossible, especially in datasets concerning rare or one-time phenomena (e.g., stock prices). Thus their work still left a pressing need for effective upper bounds on meta-overfitting error. Our Rip van Winkle’s Razor is elementary, and easily deployable by the average researcher. We hope it becomes part of the standard toolbox in Adaptive Data Analysis.

at April 07, 2021 09:00 PM UTC

In this blog post, I outline how existing advertising platforms do not prevent high-stakes ads from reaching different demographics at different rates. The post then describes how pushing this responsibility down to advertisers rather than addressing it at the platform leaves manipulating a complex system to those least aware of the system’s inner workings. She then proposes a simpler, more unified solution to this problem: advertising slots should be either targetable or untargetable, and high-stakes ads should be in the untargeted segment. Finally, the post concludes with a discussion of how this segmentation need not cost these systems substantial revenue if reserve prices are used appropriately.

by jamiemorgenstern at April 07, 2021 08:47 PM UTC

The next TCS+ talk will take place this coming Wednesday, April 14th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC). Andrea Lincoln from UC Berkeley will speak about “New Techniques for Proving Fine-Grained Average-Case Hardness” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Due to security concerns, registration is required to attend the interactive talk. (The recorded talk will also be posted on our website afterwards, so people who did not sign up will still be able to watch the talk)

As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: In this talk I will cover a new technique for worst-case to average-case reductions. There are two primary concepts introduced in this talk: “factored” problems and a framework for worst-case to average-case fine-grained (WCtoACFG) self reductions.

We will define new versions of OV, kSUM and zero-k-clique that are both worst-case and average-case fine-grained hard assuming the core hypotheses of fine-grained complexity. We then use these as a basis for fine-grained hardness and average-case hardness of other problems. Our hard factored problems are also simple enough that we can reduce them to many other problems, e.g. to edit distance, k-LCS and versions of Max-Flow. We further consider counting variants of the factored problems and give WCtoACFG reductions for them for a natural distribution.

To show hardness for these factored problems we formalize the framework of [Boix-Adsera et al. 2019] that was used to give a WCtoACFG reduction for counting k-cliques. We define an explicit property of problems such that if a problem has that property one can use the framework on the problem to get a WCtoACFG self reduction. In total these factored problems and the framework together give tight fine-grained average-case hardness for various problems including the counting variant of regular expression matching.

Based on joint work with Mina Dalirrooyfard and Virginia Vassilevska Williams.

by plustcs at April 06, 2021 08:51 PM UTC

TL;DR: The summer school we are organizing is looking for TAs. Please forward this to your students as well as any departmental mailing lists.

Are you passionate about teaching? Or about increasing diversity within TCS? If so, we need your help!

The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. We particularly encourage participants from groups that are currently under-represented in TCS. See for more details.

We are looking for TAs to help run the school.

TAs will have the following responsibilities:
        • Plan team building and ice breaking activities and social events for the summer school
        • Lead small groups during the week
        • Monitor questions in chat during lectures
        • Work with one of the instructors to prepare one homework
        • Grade homework
        • Provide mentorship to students
        • Possibly assist with reviewing applications and other technical/admin aspects of running the school

The time commitment will be ~20 hours during the week of May 31-June 4; ~5-10 hours prior to that week; and ~2-3 hours following that week. We are hoping to pay an amount of $500 to each TA (please note that international students will need a CPT for this).

To apply for a TA position, please fill in the application form at  by April 15, 2021. Please also have a faculty advisor send a short recommendation to Please ask them to use the subject “TA recommendation for <<Your Name>>”.

Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)

Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).

Please email with any questions.

by Boaz Barak at April 06, 2021 05:32 PM UTC

CIFellows is a general CRA/CCC program for matching postdocs with advisors, and supports 2-year postdocs. Applications will open mid-April, and are due early May. It is competitive across all areas of CS.


by shacharlovett at April 06, 2021 01:33 PM UTC

Although the prime numbers are rigidly determined, they somehow feel like experimental data—Tim Gowers

Finnish Scientific Courage Award source

Kaisa Matomäki is a Finnish mathematician working in number theory. She has some terrific results on prime numbers—results that have won several important prizes including the 2021 Ruth Lyttle Satter Prize: It is presented to a woman who has made an outstanding contribution to mathematics research. The prize citation specifically mentions a 2015 paper with Maksym Radziwill that contributed to Terence Tao’s resolution of the Erdős discrepancy problem—and indeed we highlighted Tao’s followup work with her and Radziwill in our 2015 post on this.

Today, Ken and I thought we would combine one of her new deep theorems with some shallow observations.

The set of primes is denoted by {\mathsf{PRIMES}}, as usual. As complexity theorists we have known that for decades that the Boolean circuit complexity of {\mathsf{PRIMES}} is polynomial. This follows from the existence of a polynomial random algorithm for {\mathsf{PRIMES}} due to Robert Solovay and Volker Strassen, and then applying Len Adleman’s connection between such algorithms and Boolean circuit complexity.

Matomäki views {\mathsf{PRIMES}} in a different way. As an analytic number theorist she is concerned with the density and structure of primes, not so much the complexity of recognizing or generating a prime.


Exact results on the set of primes are hard to come by.

Twin Primes: We know that there are infinitely many {p} such that {p} and {p+2} are prime. But not proved.

Goldbach: We know that every even number from {4} onward is the sum of two primes. But not proved.

Density of Primes: We known that the gap between consecutive primes is at most {\dots} But not proved.

The Riemann Hypothesis attempts to say more about density of the primes than the Prime Number Theorem does. It has been proved equivalent to some statements about approximate growth rates, yet even these forms have not been touched.

Instead of trying for better approximate results about primes, can we learn more by relaxing the notion of “prime” itself? For each {k \geq 1}, define {P_k} to be the set of numbers that are products of at most {k} primes. Then {P_1} is the set of primes (not the prime powers) and {P_2} is the set of products of two primes—which include the Blum integers of special interest to cryptography. Collectively the {P_k} sets represent different levels of being “almost prime.” The motivating question is:

How well and in what ways does the structure of the sets of almost-primes model the structure of the primes?

In 2010, John Friedlander and Henryk Iwaniec wrote a monograph titled Opera de Cribro, which is Latin for “Works of the Sieve.” They proved certain results about {P_{13}} and speculated whether their methods improve them to work at least for {P_3}. Then they say, “It would be interesting to get integers with at most two prime divisors.” This is where Matomäki comes in—with a second kind of relaxing, one applying to “almost all” members of {P_2}.

New Result

Matomäki has a brand new paper (revised two weeks ago from a December original) titled, “Almost primes in almost all very short intervals.” It follows on from a paper by one of her students, Joni Teräväinen, titled “Almost all primes in almost all short intervals.”

Her main result is the following theorem.

Theorem 1 Suppose {h_X \rightarrow \infty} as {X \rightarrow \infty} and put {\Delta_X = h\log X}. Then for almost all {X} and {x \in (\frac{X}{2},X]}, the interval {(x-\Delta_X,x]} contains a {P_2}.

Roger Heath-Brown had established the corresponding result for primes (that is, for {P_1}) but only assuming simultaneously the Riemann hypothesis and the pair correlation conjecture for the zeros of the Riemann zeta function. With no hypotheses, the presence of primes in almost all intervals is known only when the intervals have length {X^{\Omega(1)}}—concretely, {X^{1/20}} is known. Thus, Matomäki has traded off the use of conjectures for the relaxed notion of prime.

The “almost call” is in the sense of average density of {x} whose short interval is populated. This average instead of worst case mirrors our difficulties with circuit lower bounds. We can easily show that on average Boolean function have huge circuit lower bounds. But worst case bounds for specific functions is beyond reach. Hmmm.

Lower Bounds

We previously proved:

Theorem 2 (Lower Bound) For any {\epsilon>0}, the depth of a DeMorgan boolean circuit for {\mathsf{PRIMES}} is at least {(2-\epsilon)\log_2 n + O(1)}.

Recall a DeMorgan formula {F} on {n} variables {x_1,\dots,x_n} is a binary tree whose leaves are labeled with variables or their negations, and whose internal nodes are labeled with either {\vee} for OR and {\wedge} for AND gates.

Here is a high level view of this result. See here for details.

Assume there is a DeMorgan Boolean circuit for {\mathsf{PRIMES}} is at most {(2-\epsilon)\log_2 n + O(1)} depth. We can use the random restriction method to set lots of the inputs to random values {0/1}.

We claim that with high probability the input looks like

\displaystyle  x_1,\dots,x_m, x_{m+1},\dots,x_n.

The right-most bits

\displaystyle  x_{m+1},\dots,x_{n}

are left unset, and the other bits to the left have some bits randomly set. Moreover, the number of bits in the above right is order {n^{\epsilon}}.

Now set all the bits in the left group that are unset also to random values {0/1}. Leave the right group unset.

The point of this is that if we assume that the formula had size at most order {n^{2-\epsilon}}, then we have that the formula is likely to be constant. But this contradicts the density of primes and composites.

Open Problems

Can Matomäki’s theorem be used to get stronger lower bounds on the complexity of {\mathsf{PRIMES}}? Can it go beyond the quadratic limit?

by KWRegan at April 06, 2021 04:43 AM UTC

After several attempts, I finally found the energy to start writing a book. It grew out of lecture notes for a graduate class I taught last semester. I make the draft available so that I can get feedback before a (hopefully) final effort next semester.

The goal of the book is to present old and recent results in learning theory, for the most widely-used learning architectures. This book is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are large users of learning methods, such as computer vision or natural language processing.

A particular effort is made to prove many results from first principles, while keeping the exposition as simple as possible. This will naturally lead to a choice of key results that show-case in simple but relevant instances the important concepts in learning theory. Some general results will also be presented without proofs. Of course the concept of first principles is subjective, and a good knowledge of linear algebra, probability theory and differential calculus will be assumed.

Moreover, I will focus on the part of learning theory that does not exist outside of algorithms that can be run in practice, and thus all algorithmic frameworks described in this book are routinely used. For most learning methods, some simple illustrative experiments are presented, with the plan to have accompanying code (Matlab, Julia, and Python) so that students can see for themselves that the algorithms are simple and effective in synthetic experiments.

This is not an introductory textbook on machine learning. There are already several good ones in several languages [1, 2]. Many topics are not covered, and many more are not covered in much depth. There are many good textbooks on learning theory that go deeper [3, 4, 5].

The choice of topics is arbitrary (and thus personal). Many important algorithmic frameworks are forgotten (e.g.,  reinforcement learning, unsupervised learning, etc.). Suggestions of extra themes are welcome! A few additional chapters are currently being written such as: ensemble learning, bandit optimization, probabilistic methods, structured prediction.

Help wanted!

This is still work in progress. In particular, there are still a lot of typos, probably some mistakes, and almost surely places where more details are needed; readers are most welcome to report them to me (and then get credit for it). Also, the bibliography is currently quite short and would benefit from some expansion (all suggestions welcome, in particular for giving proper credit).

Moreover, I am convinced that simpler mathematical arguments are possible in many places in the book. If you are aware of elegant and simple ideas that I have overlooked, please let me know.


[1] Chloé-Agathe Azencott. Introduction au Machine Learning. Dunod, 2019.
[2] Ethem Alpaydin. Introduction to Machine Learning. MIT Press, 2020.
[3] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
[4] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[5] Andreas Christmann and Ingo Steinwart. Support Vector Machines. Springer, 2008.

by Francis Bach at April 05, 2021 08:18 AM UTC

A somewhat quieter month by recent standards. Three Two papers: graph property testing and quantum distribution testing. (Ed: The distribution testing paper was a revision of a paper we already covered in Sept 2020.)

Robust Self-Ordering versus Local Self-Ordering by Oded Goldreich (ECCC). In Nov 2020, we covered a paper that uses a tool called self-ordered graphs, that transferred bit string function lower bounds to graph property testing. Consider a labeled graph. A graph is self-ordered if its automorphism group only contains the identity element (it has no non-trivial isomorphisms). A graph is robustly self-ordered, if every permutation of the vertices leads to a (labeled) graph that is sufficiently “far” according to edit distance. Given a self-ordered graph \(G\), a local self-ordering procedure is the following. Given access to a copy \(G’\) of \(G\) and a vertex \(v \in V(G’)\), this procedure determines the (unique) vertex in \(V(G)\) that corresponds to \(v\) with sublinear queries to \(G\). In other words, it can locally “label” the graph. Intuitively, one would think that more robustly self-ordered graphs will be easier to locally label. This paper studies the relation between robust and local self-ordering. Curiously, this paper refutes the above intuition for bounded-degree graphs, and (weakly) confirms it for dense graphs. Roughly speaking, there are bounded degree graphs that are highly robustly self-ordered, for which any local self-ordering procedure requires \(\omega(\sqrt{n})\) queries. Moreover, there are bounded degree graphs with \(O(\log n)\)-query local self-ordering procedures, yet are not robustly self-ordered even for weak parameters. For dense graphs, the existence of fast non-adaptive local self-ordering procedures implies robust self-ordering.

Testing identity of collections of quantum states: sample complexity analysis by Marco Fanizza, Raffaele Salvia, and Vittorio Giovannetti (arXiv). This paper takes identity testing to the quantum setting. One should think of a \(d\)-dimensional quantum state as a \(d \times d\) density matrix (with some special properties). To learn the state entirely up to error \(\varepsilon\) would require \(O(\varepsilon^{-2} d^2)\) samples/measurements. A recent result of Badescu-O’Donnell-Wright proves that identity testing to a known state can be done significantly faster using \(O(\varepsilon^{-2} d)\) measurements. This paper takes this result a step further by consider a set of \(N\) quantum states. A “sample” is like a classical sample, where one gets a sample from a distribution of quantum states. The YES (“uniform”) case is when all the states are identical. The NO (“far from uniform”) case is when they are “far” from being the same state. This paper proves that \(O(\varepsilon^{-2}\sqrt{N}d)\) samples suffices for distinguishing these cases.

by Seshadhri at April 05, 2021 04:35 AM UTC

BILL: Lets say we didn't know that FACTORING NPC --> NP=coNP.
then what direction would we think Factoring in P or NPC? 

STUDENT: In P. After all, Number Theory is a deep subject and I can imagine some deep Theorem in it leading to FACTORING in P. 

BILL: That cuts both ways. I can imagine some deep theorem in NT being the key to showing 

FACTORING poly-reduces to  SAT

Deep mathematics HAS been used for algorithms. Factoring algs is one example, The graph minor theorem yielding MANY problems in P is another. Can you give me an example where deep mathematics has been used for an NPC reduction?

BILL: Oh. Gee. I do not know of any. 

STUDENT: If only you had some mechanism to ask the theory community. Maybe you could call it a web log, or weblog.

BILL: If only...

1) Are there any NPC reductions that use deep math? (I realize that the phrase `deep math' is not well defined,)

2) Are there other reductions that use deep math?

3) If P NE NP then: 
For all epsilon there is no approx-alg for MAX3SAT which yields  \ge  (7/8 + epsilon)OPT

For all delta  there is no approx-alg for CLIQ which yields > n^{-delta} OPT

No approx-alg for SET COVER which yields \ge (ln n + o(1))OPT. 

All of these proofs use  the PCP machinery or something similar. My impression is that the original PCP theorem, while long, hard, and impressive, did not use deep math. I have a vague memory of some paper or PhD thesis stating that the ONLY theorem needed was that a poly of degree d over a finite field has \le d roots. 

However to get the optimal lower bounds seemed to use some deep math. But I am more ASKING than telling. 

by gasarch ( at April 05, 2021 04:30 AM UTC

Update (April 5, 2021): So it turns out that Adam Chalcraft and Michael Greene already proved the essential result of this post back in 1994 (hat tip to commenter Dylan). Not terribly surprising in retrospect!

My son Daniel had his fourth birthday a couple weeks ago. For a present, he got an electric train set. (For completeness—and since the details of the train set will be rather important to the post—it’s called “WESPREX Create a Dinosaur Track”, but this is not an ad and I’m not getting a kickback for it.)

As you can see, the main feature of this set is a Y-shaped junction, which has a flap that can control which direction the train goes. The logic is as follows:

  • If the train is coming up from the “bottom” of the Y, then it continues to either the left arm or the right arm, depending on where the flap is. It leaves the flap as it was.
  • If the train is coming down the left or right arms of the Y, then it continues to the bottom of the Y, pushing the flap out of its way if it’s in the way. (Thus, if the train were ever to return to this Y-junction coming up from the bottom, not having passed the junction in the interim, it would necessarily go to the same arm, left or right, that it came down from.)

The train set also comes with bridges and tunnels; thus, there’s no restriction of planarity. Finally, the train set comes with little gadgets that can reverse the train’s direction, sending it back in the direction that it came from:

These gadgets don’t seem particularly important, though, since we could always replace them if we wanted by a Y-junction together with a loop.

Notice that, at each Y-junction, the position of the flap stores one bit of internal state, and that the train can both “read” and “write” these bits as it moves around. Thus, a question naturally arises: can this train set do any nontrivial computations? If there are n Y-junctions, then can it cycle through exp(n) different states? Could it even solve PSPACE-complete problems, if we let it run for exponential time? (For a very different example of a model-train-like system that, as it turns out, is able to express PSPACE-complete problems, see this recent paper by Erik Demaine et al.)

Whatever the answers regarding Daniel’s train set, I knew immediately on watching the thing go that I’d have to write a “paperlet” on the problem and publish it on my blog (no, I don’t inflict such things on journals!). Today’s post constitutes my third “paperlet,” on the general theme of a discrete dynamical system that someone showed me in real life (e.g. in a children’s toy or in biology) having more structure and regularity than one might naïvely expect. My first such paperlet, from 2014, was on a 1960s toy called the Digi-Comp II; my second, from 2016, was on DNA strings acted on by recombinase (OK, that one was associated with a paper in Science, but my combinatorial analysis wasn’t the main point of the paper).

Anyway, after spending an enjoyable evening on the problem of Daniel’s train set, I was able to prove that, alas, the possible behaviors are quite limited (I classified them all), falling far short of computational universality.

If you feel like I’m wasting your time with trivialities (or if you simply enjoy puzzles), then before you read any further, I encourage you to stop and try to prove this for yourself!

Back yet? OK then…

Theorem: Assume a finite amount of train track. Then after a linear amount of time, the train will necessarily enter a “boring infinite loop”—i.e., an attractor state in which at most two of the flaps keep getting toggled, and the rest of the flaps are fixed in place. In more detail, the attractor must take one of four forms:

I. a line (with reversing gadgets on both ends),
II. a simple cycle,
III. a “lollipop” (with one reversing gadget and one flap that keeps getting toggled), or
IV. a “dumbbell” (with two flaps that keep getting toggled).

In more detail still, there are seven possible topologically distinct trajectories for the train, as shown in the figure below.

Here the red paths represent the attractors, where the train loops around and around for an unlimited amount of time, while the blue paths represent “runways” where the train spends a limited amount of time on its way into the attractor. Every degree-3 vertex is assumed to have a Y-junction, while every degree-1 vertex is assumed to have a reversing gadget, unless (in IIb) the train starts at that vertex and never returns to it.

The proof of the theorem rests on two simple observations.

Observation 1: While the Y-junctions correspond to vertices of degree 3, there are no vertices of degree 4 or higher. This means that, if the train ever revisits a vertex v (other than the start vertex) for a second time, then there must be some edge e incident to v that it also traverses for a second time immediately afterward.

Observation 2: Suppose the train traverses some edge e, then goes around a simple cycle (meaning, one where no edges or vertices are reused), and then traverses e again, going in the same direction as the first time. Then from that point forward, the train will just continue around the same simple cycle forever.

The proof of Observation 2 is simply that, if there were any flap that might be in the train’s way as it continued around the simple cycle, then the train would already have pushed it out of the way its first time around the cycle, and nothing that happened thereafter could possibly change the flap’s position.

Using the two observations above, let’s now prove the theorem. Let the train start where it will, and follow it as it traces out a path. Since the graph is finite, at some point some already-traversed edge must be traversed a second time. Let e be the first such edge. By Observation 1, this will also be the first time the train’s path intersects itself at all. There are then three cases:

Case 1: The train traverses e in the same direction as it did the first time. By Observation 2, the train is now stuck in a simple cycle forever after. So the only question is what the train could’ve done before entering the simple cycle. We claim that at most, it could’ve traversed a simple path. For otherwise, we’d contradict the assumption that e was the first edge that the train visited twice on its journey. So the trajectory must have type IIa, IIb, or IIc in the figure.

Case 2: Immediately after traversing e, the train hits a reversing gadget and traverses e again the other way. In this case, the train will clearly retrace its entire path and then continue past its starting point; the question is what happens next. If it hits another reversing gadget, then the trajectory will have type I in the figure. If it enters a simple cycle and stays in it, then the trajectory will have type IIb in the figure. If, finally, it makes a simple cycle and then exits the cycle, then the trajectory will have type III in the figure. In this last case, the train’s trajectory will form a “lollipop” shape. Note that there must be a Y-junction where the “stick” of the lollipop meets the “candy” (i.e., the simple cycle), with the base of the Y aligned with the stick (since otherwise the train would’ve continued around and around the candy). From this, we deduce that every time the train goes around the candy, it does so in a different orientation (clockwise or counterclockwise) than the time before; and that the train toggles the Y-junction’s flap every time it exits the candy (although not when it enters the candy).

Case 3: At some point after traversing e in the forward direction (but not immediately after), the train traverses e in the reverse direction. In this case, the broad picture is analogous to Case 2. So far, the train has made a lollipop with a Y-junction connecting the stick to the candy (i.e. cycle), the base of the Y aligned with the stick, and e at the very top of the stick. The question is what happens next. If the train next hits a reversing gadget, the trajectory will have type III in the figure. If it enters a new simple cycle, disjoint from the first cycle, and never leaves it, the trajectory will have type IId in the figure. If it enters a new simple cycle, disjoint from the first cycle, and does leave it, then the trajectory now has a “dumbbell” pattern, type IV in the figure (also shown in the first video). There’s only one other situation to worry about: namely, that the train makes a new cycle that intersects the first cycle, forming a “theta” (θ) shaped trajectory. In this case, there must be a Y-junction at the point where the new cycle bumps into the old cycle. Now, if the base of the Y isn’t part of the old cycle, then the train never could’ve made it all the way around the old cycle in the first place (it would’ve exited the old cycle at this Y-junction), contradiction. If the base of the Y is part of the old cycle, then the flap must have been initially set to let the train make it all the way around the old cycle; when the train then reenters the old cycle, the flap must be moved so that the train will never make it all the way around the old cycle again. So now the train is stuck in a new simple cycle (sharing some edges with the old cycle), and the trajectory has type IIc in the figure.

This completes the proof of the theorem.

We might wonder: why isn’t this model train set capable of universal computation, of AND, OR, and NOT gates—or at any rate, of some computation more interesting than repeatedly toggling one or two flaps? My answer might sound tautological: it’s simply that the logic of the Y-junctions is too limited. Yes, the flaps can get pushed out of the way—that’s a “bit flip”—but every time such a flip happens, it helps to set up a “groove” in which the train just wants to continue around and around forever, not flipping any additional bits, with only the minor complications of the lollipop and dumbbell structures to deal with. Even though my proof of the theorem might’ve seemed like a tedious case analysis, it had this as its unifying message.

It’s interesting to think about what gadgets would need to be added to the train set to make it computationally universal, or at least expressively richer—able, as turned out to be the case for the Digi-Comp II, to express some nontrivial complexity class falling short of P. So for example, what if we had degree-4 vertices, with little turnstile gadgets? Or multiple trains, which could be synchronized to the millisecond to control how they interacted with each other via the flaps, or which could even crash into each other? I look forward to reading your ideas in the comment section!

For the truth is this: quantum complexity classes, BosonSampling, closed timelike curves, circuit complexity in black holes and AdS/CFT, etc. etc.—all these topics are great, but the same models and problems do get stale after a while. I aspire for my research agenda to chug forward, full steam ahead, into new computational domains.

PS. Happy Easter to those who celebrate!

by Scott at April 04, 2021 06:37 PM UTC

Motivated in part by applications in lattice-based cryptography, we initiate the study of the size of linear threshold (`$t$-out-of-$n$') secret-sharing where the linear reconstruction function is restricted to coefficients in $\{0,1\}$. We prove upper and lower bounds on the share size of such schemes. One ramification of our results is that a natural variant of Shamir's classic scheme [Comm. of ACM, 1979], where bit-decomposition is applied to each share, is optimal for when the underlying field has characteristic 2. Another ramification is that schemes obtained from some monotone formulae are optimal for certain threshold values when the field's characteristic is any constant. We prove our results by defining and investigating an equivalent variant of Karchmer and Wigderson's Monotone Span Programs [CCC, 1993]. We also study the complexity such schemes with the additional requirement that the joint distribution of the shares of any unauthorized set of parties is not only independent of the secret, but also uniformly distributed. This property is critical for security of certain applications in lattice-based cryptography. We show that any such scheme must use $\Omega(n\log n)$ field elements, regardless of the field. Moreover, for any field this is tight up to constant factors for the special cases where any $t=n-1$ parties can reconstruct, as well as for any threshold when the field characteristic is 2.

at April 04, 2021 06:36 AM UTC

Scribe notes by Benjamin Basseri and Richard Xu

Previous post: Inference and statistical physics Next post: TBD. See also all seminar posts and course webpage.

Alexander (Sasha) Rush is a professor at Cornell working in in Deep Learning / NLP. He applies machine learning to problems of text generation, summarizing long documents, and interactions between character and word-based models. Sasha is previously at Harvard, where he taught an awesome NLP class, and we are excited to have him as our guest! (Note: some of the figures in this lecture are taken from other papers or presentations.)

The first half of the talk will focus on how NLP works and what makes it interesting: a bird’s-eye view of the field. The second half of the talk will focus on current research.

Basics of NLP

Textual data has many different challenges that differ from computer vision (CV), since it is a human phenomenon. There are methods that work in computer vision / other ML models that just don’t work for NLP (e.g. GANs). As effective methods were found for computer vision around 2009-2014, we thought that these methods would also work well for NLP. While this was sometimes the case, it has not been true in general.

What are the difficulties of working with natural language? Language works at different scales:

word < phrase < sentence < document < ...

Here are examples of structure at each level:

  1. Zipf’s Law: The frequency of any word is inversely proportional to its popularity rank.
  2. Given the last n symbols, it is often possible to predict the next one (The Shannon Game).
  3. Linguists have found many rules about syntax and semantics of a language.
  4. In a document, we have lots of discourse between different sentences. For example, “it” or other pronouns are context dependent.

In NLP, we will talk about the syntax and semantics of a document. The syntax refers to how words can fit together, and semantics refers to the meaning of these words.

Language Modeling

There are many different NLP tasks such as sentiment analysis, question answering, named entity recognition, and translation. However, recent research shows that these tasks are often related to language modeling.

Language modeling, as explained in Shannon 1948, aims to answer the following question: Think of language as a stochastic process producing symbols. Given the last n symbols, can we predict the next one?

This question is challenging as is. Consider the following example:

A reading lamp on the desk shed glow on polished ___

There are many options: marble/desk/stone/engraving/etc., and it is already difficult to give a probability here. In general, language is hard to model because the next word can be connected to words from a long time ago.

Shannon proposes variants of Markov models to perform this prediction, based on the last couple characters or the context in general.

Since local context matters most, we assume that only the n most recent words matter. Then, we get the model

Measuring Performance

As we have seen in the generative models lecture, we can use cross entropy as a loss function for density estimation models. Given model density distribution q and true distribution p, the cross entropy (which equals negative expected log-likelihood) is defined as follows:

H(p, q) = - E_p \log q(w_t | w_1, \ldots , w_{t-1})

In NLP we tend to use the metric “perplexity”, which is the exponentiated negative cross entropy:

ppl(p, q) = \exp -H(p, q).

This corresponds to the equivalent vocabulary size of a uniformly distributed model. Lower perplexity means our model was closer to the underlying text distribution. As an example, the perplexity of the perfect dice-roll model would be 6.

Why do we care about perplexity anyway?

  1. With a good model we can determine the natural perplexity of a language, which is interesting.
  2. Many NLP questions are language modeling with conditioning. Speech recognition is language modeling conditioned on some sound signal, and translation is language modeling conditioned on text from another language.
  3. More importantly, we have found recent applications in transfer learning. That is, a language model can be trained on some (small) input data for a specific task. Then, such a model becomes effective at the given task!

A few years ago, the best perplexity on WSJ text was 150. Nowadays, it is about 20! To understand how we got here, we look at modern language modeling.

Predicting the next word

We start with the model

p(w_t | w_{1:t-1}; \theta) = softmax(\mathbf{W}\phi(w_{1:t-1}; \theta)).

(The softmax function maps a vector \vec{v} into the probability distribution \vec{p} such that p_i \propto \exp(v_i). That is, p_i = \exp(v_i)/\sum_j \exp(v_j). Note that this is a Boltzman distribution of the type we saw in the statistical physics and variational algorithms lecture)

We call W the output word embeddings, \phi is some neural basis (e.g. \phi are all but the final layers of a neural net with weights \theta, and W is the final layer). However, this means to use softmax to predict requires computing softmax over every word in your vocabulary (tens or hundreds of thousands). This was often infeasible until GPU computing became available.

As an aside, why not predict characters instead of words? The advantage is that there are much fewer characters than words. However, computation with characters is slower. Empirically, character-based models tend to perform worse than word-based. However, character-based models can handle words outside the vocabulary.

Byte-pair encoding offers a bridge between character and word models. This greedily builds up new tokens as repetitive patterns are found in the original text.

In the last decade NLP has seen a few dominant architectures, all using SGD but with varying bases. First, we must cast the words as one-hot vectors, then embed them into vector space:
x_t = Vw_t,
where w_t is the one-hot encoded vector and V is the learned embedding transformation.


NNLM (Neural Network Language Model) is like a CNN. The model predicts on possibly multiple NN transformations:
\phi(w_{1:t-1}; \theta) = \sigma(U[x_{t-k-1} \oplus \ldots \oplus x_{t-1}]]),

where \oplus denotes concatenation, U is some convolutional filter and \sigma is the activation function. This has the benefit of learning fast. The matrices it learns also transfer well.

As an example, GloVe is a NNLM-inspired model. It stores the words in 300-dimensional space. When we project the some words to 2-dimensions using PCA, we find semantic information in the language model.

Language structure

Recurrent Models

A recurrent model uses a fixed set of previous words to predict the next word. A recurrent network uses all previous words:
\phi(w_{1:t-1}; \theta) = \sigma(U[x_{t-1}\oplus \phi(w_{1:t-2};\theta)]).

Previous information is ‘summarized’ in the \phi on the right, so this model uses finite memory. Below is an illustration of the recurrent neural network.

Since the recurrent model uses the full context, it is a more plausible model for how we really process language. Furthermore, the introduction of RNN saw drastically improved performance. In the graph below, the items in the chart are performances from previous NNLMs. The recurrent network performance is “off the chart”.

RNN Performance

However, the model grows with sequence length. This requires gradient flow to backpropagate over arbitrarily long sequences, and often required baroque network designs to facilitate longer sequences while avoiding exploding/vanishing gradients.

Attentional Models

To understand modern NLP we must look at attention. For a set of vectors v, keys k_1, \ldots, k_T and a query q, we define attention as

Att(q, k, v) = \sum_t \alpha_t v_t

where \alpha = softmax(q\cdot k_1 \oplus \ldots \oplus q \cdot k_T).

Here, \alpha can be considered a probability density of the model’s ‘memory’ of the sequence. The model decides which words are important by combining \alpha and v.

The attentional model can be fully autoregressive (use all previously seen words), and the query q can be learned or be specific to an input. In general, we have:
\phi(w_{1:t-1}; \theta) = \sigma(U[Att(q, x_{1:t-1}, x_{1:t-1})])

Since we condense all previous information into the attention mechanism, it is simpler to backpropagate.
In particular attention enables looking at information from a large context without paying for this in the depth of the network and hence in the depth of back-propagation you need to cover. (Boaz’s note: With my crypto background, attention reminds me of the design of block ciphers, which use linear operations to mix between far away parts of the inputs, and then apply non-liearity locally to each small parts.)

Note that attention is defined with respect to a set of vectors. There is no idea of positional information in the attentional model. How do we encode positional information for the model? One way to do this is using sinusoidal encoding in the keys. We store the word n as \cos(\pi n/k) for some period k. Notice that if we choose many different periods, then the cosine ways will almost never meet at the same point. As a result, only recent points will have high dot products between the different cosine values.



A transformer is a stacked attention model. Computation in one layer becomes query, keys and values for the next layer. This is a multiheaded attention model. We learn H projections for each query/key (generally between 8 and 32) then do softmax across these projections:
\alpha^h = softmax((W^{(h)}q)\cdot(V^{(h)}k_1)\oplus \ldots \oplus (W^{(h)}q)\cdot(V^{(h)}k_T)).

Transformer Architecture

These heads can be computed in parallel and can be implemented with batch matrix multiplication. As a result, transformers can be massively scaled and are extremely efficient in modern hardware. This has led these models to be very dominant in the field. Here are some example models:

  1. GPT-1,2,3 are able to generate text that is quite convincing to a human. They also handle the syntax and semantics of a language quite well.
  2. BERT is a transformer-based model that examines text both forwards and backwards in making its predictions. It works well with transfer fining tuning: train on a large data set, then take the feature representations and train a task on top of the learned representation.


In recent years we have had larger and larger models, from GPT1’s 110 million to GPT3’s 175 billion.

Scaling Up

On these massive scales, scaling has become a very interesting issue: how do we process more samples? How do we run distributed computation? How much autoregressive input should each model looks at? (Boaz’s note: To get a sense of how big these models are, GPT-3 was trained on about 1000B tokens. The total number of people that ever lived is about 100B, and only about half since the invention of the printing press, so this is arguably a non-negligible fraction of all text produced in human history.)

For a model like BERT, most of the cost still comes from feed-forward network — mostly matrix multiplications. These are tasks we are familiar with and can scale up.

One question is to have long-range attentional lookup, which is important for language modeling. For now, models often look at most 512 words back. Can we do longer range lookups? One approach to this is kernel feature attention.

Kernel Feature Attention

Recall that we have \alpha=\mathrm{softmax}(q\cdot k_i). Can we approximate this with some kernel K? The main approach is that \alpha\propto \exp(q\cdot k_i), which we approximate with the kernel \exp(v_1\cdot v_2). There is a rich literature on approximating K where
K(v_1, v_2) \approx \phi(v_1)\cdot\phi(v_2)

for some transfomration \phi. Then, we can try to approximate K with linear features.

Practically, transformers do well but are slower. For longer texts, we have faster models that do slightly worse. A recent model called performer is such an example.
LRA Performance

Scaling Down

Ultimately, we want to make models run on “non Google scale” hardware once it has been trained to a specific task. This can often require scaling down.

One approach is to prune weights according to their magnitude. However, since models are often overparameterized and weights do not move much, the weights that get pruned according to this method are usually the weights that were simply initialized closest to 0. In the diagram below, we can consider only leaving the orange weights and cutting out the gray.


Another approach is to mask out the weights that are unnecessary for a specific task, if you’re trying to ship a model for a specific task.

Research Questions

Two major lines of research dominate NLP today:

  1. Attention / Transformer architecture,
  2. Pretraining with language modeling as a deep learning paradigm.

We are also in a space race to produce efficient models with more parameters, given how much scaling has been effective.

The paper Which BERT classifies modern questions in NLP into the following categories:

  1. Tasks: Is (masked) language modeling the best task for pretraining? Language modeling emphasizes local information. We could imagine doing other types of denoising. See also DALL-E.
  2. Efficiency: We see that much fewer parameters are needed in practice after pruning. Does pruning lose something? Pruned models tend to do well on in-sample data, but out-of-sample data tends to make the pruned model do worse.
  3. Data: How does data used in pretraining impact task accuracy? How does the data of pretraining impact task bias?
  4. Interpretability: What does BERT know, and how can we determine this? Does interpretability need to come at a cost to performance?
  5. Multilinguality: Many languages don’t have the same amount of data as English. What methods apply when we have less data?


We have many questions asked during and after lecture. Here are some of the questions.

  1. Q: Should we say GANs fail at NLP or that other generative models are more advanced in NLP than in CV? A: One argument is that language is a human generated system, there are some inherent structures that help with generation. We can do language in left-to-right, but in CV this would be a lot more difficult. At the same time, this can change in the future!
  2. Q: Why are computer vision and NLP somewhat close to each other? A: classically, they are both perception-style tasks under AI. Also, around 2014 we had lots of ideas that come from porting CV ideas into NLP, and recently we have seen NLP ideas ported to CV.
  3. Q: Since languages have unique grammars, is NLP better at some languages? Do we have to translate language to an “NLP-effective” language and back? A: In the past, some languages are better. Ex: we used to struggle with Japanese to other languages but do well with English to other languages. However, modern models are extremely data driven, so we have needed much less hardcoding.
  4. Q: Have we done any scatter plot of the form (data available for language X, performance on X) to see if performance is just a function of available data? A: Not right now, but these plots can potentially be really cool! Multilinguality is a broad area of research in general.
  5. Q: What are some NLP techniques for low-resource languages? A: Bridging is commonly used. Iterative models (translate and translate back with some consistency) is also used to augment the data.
  6. Q: Do you think old-school parsers will make a comeback? A: Unlikely to deploy parsers, but the techniques of parsing is interesting.
  7. Q: Given the large number of possible “correct” answers, has there been work on which “contexts” are most informative? A: Best context is the closest context, which is expected. The other words matter but matter a lot less.
  8. Q: Is there any model that captures the global structure first (e.g. an outline) before writing sequentially, like humans do when they write longer texts? A: Currently no. Should be possible, but we do not have data about the global structure of writing.
  9. Q: Why is our goal density estimation? A: It is useful because it tells us how “surprising” the next word is. This is also related to why a language feels “fast” when you first learn it: because you are not familiar with the words, you cannot anticipate the next words.
  10. Q: Why is lower perplexity better? A: Recall from past talk that lower cross-entropy means less distance between p and q, and intuitive you have more “certainty”.
  11. Q: Is the reason LM so important because evaluations are syntax-focused? A: Evaluations are actually more semantically focused, but syntax and semantics are quite connected.
  12. Q: Do we see attentional models in CV? A: Yes, we have seen more use of transformer in CV. In a standard model we use data only recently, and here we get to work with data across space and time. As such, we will need to encode time positionally.
  13. Q: Why is attention generalized convolution? A: If you have attention with all mass in the local area, that’s probably like convolution.
  14. Q: How do we compare heads with depth? e.g. Is having 5 heads better than 5x depth? A: when we use heads we add a lot less parameters. As such, we can parallelize heads and increase performance.
  15. Q: Do you think transformers are the be-all end-all of NLP models? A: Maybe. To dethrone transformers, you have to both show similar work on small-scale and show that it can be scaled easily.
  16. Q: How does simplicity bias affect these transferrable models? A: surprising and we are not sure. In CV we found that the models quickly notice peculiarities in the data (e.g. how mechanical turks are grouped), but the models do work.
  17. Q: We get bigger models every year and better performance. Will this end soon? A: Probably not, as it seems like having more parameters helps it recognize some additional features.
  18. Q: If we prune models to the same size, will they have the same performance? A: for small models we can seem to prune them, but for the bigger models it is hard to run them in academica given the computational resource constraints.
  19. Q: When we try to remember something from a long time ago we would look up a textbook / etc. Have we had similar approaches in practice? A: transformer training is static at first, and tasks happen later. So, we have to decide how to throw away information before we train on the tasks.
  20. Q: Are better evaluation metrics an important direction for future research? A: Yes — this has been the case for the past few years in academia.
  21. Q: What is a benchmark/task where you think current models show deep lack of capability? A: During generation, models don’t seem to distinguish between information that makes it “sound good” and factually correct information.

by Boaz Barak at April 03, 2021 02:37 PM UTC

In the neighborhood where I live, fire safety regulations require the streets to be super-wide (so wide that two fire trucks can pass even with cars parked along both sides of the street), and to have even wider turnarounds at the ends of the culs-de-sac. To break up the resulting vast expanses of pavement, we have occasional islands of green, public gardens too small to name as a park. They come in several different types: medians to separate the incoming and outgoing lanes at junctions with larger roads,

Traffic island at Frost and Gabrielino, Irvine

barriers to separate small groups of houses from the main flow of the road,

Traffic island on Los Trancos Drive, Irvine

or giving some shape to the turnaround at the end of a cul-de-sac.

Traffic island on Harvey Court, Irvine

Most are ignored except by the community association’s gardeners and by passing cars, but I did find one set up as a neighborhood basketball court:

Traffic island on Perkins Court, Irvine

The rest of the gallery.

(Discuss on Mastodon)

by David Eppstein at April 02, 2021 04:34 PM UTC

Scribe notes by Franklyn Wang

Previous post: Robustness in train and test time Next post: Natural Language Processing (guest lecture by Sasha Rush). See also all seminar posts and course webpage.

lecture slides (pdf)lecture slides (Powerpoint with animation and annotation)video

Digression: Frequentism vs Bayesianism

Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.

Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.

Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”

While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.

For example, suppose that we have a probability distribution D that we can get samples from in the form of x. In statistics, our goal is to calculate a hypothesis h \in \mathcal{H} which minimizes \mathcal{L}_{D}(h) (possibly a loss function, but any minimand will do) where we estimate \mathcal{L}_D with our samples x.

A frequentist does this by defining a family \mathcal{D} of potential distributions, and we need to find a transformation \vec{x} \mapsto h(\vec{x}) which minimizes the cost for all D \in \mathcal{D}, where

These equations amount to saying that h is minimizing the worst-case loss over all distributions.

By contrast, a Bayesian approaches this task by assuming that D = D_{\vec{w}} where \vec{w} \sim P are latent variables sampled from a prior.

Then, let Q_{\vec{x}}(\vec{w}) = P(w | D_{\vec{w}} = x) be the posterior distribution of w conditioned on the observations \vec{x}. We now minimize

In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.

In Bayesian inference, chosing the prior P is often very important. One classical approach is a maximum entropy prior, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.

These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce computational constraints into the picture. In the frequentist approach, we generally constrain the family of mappings \vec{x} \rightarrow \vec{h} to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.

Part 1. An introduction to Statistical Physics

Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.

See video from here:

An atomic state can be thought of as x \in {0, 1}^{n}. Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution p supported on {0, 1}^{n}, rather than in a single state.

Each atomic state has a negative energy function (which we will think of as a utility function) W: {0, 1}^{n} \rightarrow \mathbb{R}. In some sense, the system “wants” to have a high value of \mathbb{E}_{x \sim p}[ W(x)]. In addition, when the temperature is high, the system “wants” to have a high value of entropy.

Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of

p = \arg\max_{q} \mathbb{E}_{x \sim q}[W(x)] + \tau \cdot H(q).

That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).

The variational principle, which we prove later, states that the q which maximizes this satisfies

p(x) \propto \exp(\tau^{-1} \cdot W(x))

which is known as the Boltzmann Distribution. We often write this with a normalizing constant, so p(x) = \exp(\tau^{-1} \cdot W(x) - A_{\tau}(W)), where A_{\tau}(W) = \log (\int \exp(\tau^{1} \cdot W(x))).

Before proving the variational principle, we will go through some examples of statistical physics.

Example: Ising Model

In the Ising model, we have n magnets that are connected in a square grid. The atomic state is each x \in {\pm 1}^{n}, which represents “spin”. The value of W(x) is J\sum_{i \sim j} x_i x_j + J'\sum_{i} x_i, where i \sim j denotes that i and j are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of {x_i, x_ix_j} are sufficient to calculate the value of W(x) (these are known as sufficient statistics). Furthermore, if we wanted to calculate \mathbb{E}[W(x)], it would be enough to calculate the values of \mathbb{E}[x_i] and \mathbb{E}[x_i x_j] and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:

and a video can be found here.

This is what a low-temperature Ising model looks like – note that W(x) is high because almost all adjacent spins are aligned (hence the well-defined regions).

The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, x \in {\pm 1}^{n} still represents spin, and W(x) = \sum w_{i, j} x_i x_j where w_{i, j} \sim \mathcal{N}(0, 1). The SK-model is deeply influential in statistical physics. We say that it is disordered because the utility function W is chosen at random and that it is a mean field model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables x_i and x_j are equally likely to be connected.

Our third example is the posterior distribution, where x is a hidden variable with a uniform prior. We now makes, say, k independent observations O_1, O_2, \ldots O_k. The probability of x is given by

p(x) \propto p(x\text{ satisfies } O_1) \cdot p(x\text{ satisfies } O_2) \ldots p(x\text{ satisfies } O_k)

so p(x) \propto \exp(\log \sum \mathbb{P}(x\text{ satisfies } O_i)). Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.

Proof of the Variational Principle

Now, we prove the variational principle, which states that if p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W)), where A_\tau(W) = \log \int \exp(\tau^{-1} \cdot W(x)) is the normalizing factor, then

p = \arg\max_{q} \mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q)

Before proving the variational principle, we make the following claim: if p is defined as p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W)) then

Proof of claim:
H(p) = -\int \log p(x) p(x) dx = -\int (\tau^{-1} \cdot W(x) - A_{\tau}(W)) p(x) dx
(where we plugged in the definition of p(x) in \log p(x)).

We can now rewrite this as

H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + \mathbb{E}_{x \sim p} A_{\tau}(W)

which means that

H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + A{\tau}(W)

using the fact that A_\tau(W) is a constant (independent of x). Multiplying by \tau and rearranging, we obtain the claim. \blacksquare

Proof of variational principle:
Given the claim, we can now prove the variational principle.
Let q be any distribution. We write

where the first equation uses our likelihood expression. This implies that \mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q) is maximized at q = p, as desired. \blacksquare


  • When \tau = 1, we have A(W) = \max_{q} H(q) + \mathbb{E}_{x \sim q} W(x).
  • In particular, for every value of q, we have that

which we’ve seen before! Note that equality holds when q = p, but to approximate A we can simply take more tractable values of q.

In many situations, we can compute W(x), but can’t compute A_{\tau}(W), which stifles many applications. One upshot of this, though, is that we can calculate ratios of p(x) and p(x'), which is good enough for some applications, like Markov Chain Monte Carlo.

Markov Chain Monte Carlo (MCMC)

An important task in statistics is to sample from a distribution p, for very complicated values of p. MCMC does this by constructing a Markov Chain whose stationary distribution is p. The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states x, so that x \sim x' iff x' \sim x and that for x \sim x', the probabilities of x and x' are similar in the sense that W(x)/W(x') is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.

Metropolis-Hastings Algorithm:

  1. Draw x_0 at random.
  2. For i = 1, 2, \ldots, choose an arbitrary x' \sim x, and let

Then, x_t eventually is distributed as p! To show that this samples from p, we show that the stationary distribution is of this Markov chain is p. In this case, this turns out to be easy, since we can check the detailed balance conditions p(x' | x)p(x) = \min{(p(x), p(x'))} = p(x | x')p(x') so p is the stationary distribution of the Markov Chain.

The stationary distribution is often unique, so we’ve proven that this samples from p eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization.
Indeed, there are examples where convergence to the stationary distribution would take exponential time.

Note: One way to make MCMC faster is to let x' be really close from x, because the likelihood ratio will be closer to 1 and x_i will spend less time stuck at its original location. There’s a tradeoff, however. If x' is too close to x, then the chain will not mix as quickly, where mixing is the convergence of x to the stationary distribution, because generally to mix the value x_t must get close to every point, and making each of x‘s steps smaller will make that more difficult.

Applications of MCMC: Simulated Annealing

One application of MCMC is in simulated annealing. Suppose that we have W: {0, 1}^{n} \rightarrow \mathbb{R}, and we want to find x = \arg \max_{x} W(x). The most direct attempt at solving this problem is creating a Markov Chain that simply samples from x \in \arg\max_{x} W(x). However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!

To this end, we now create a sequence of Markov chains supported on x, whose stationary distribution p_{\tau}(x) \sim W(x)^{1/\tau}, as \tau gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from p_0(x).

Simulated annealing lets us begin by sampling from p_{\infty}, which is uniform on the support. Then we can slowly reduce \tau from \infty to 0. When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on p_{\tau}(x), so the transition probability from x to x' is \min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x)))).

Simulated Annealing Algorithm:

  1. Cooling Begin \tau at \infty (or very large), and lower the value of \tau to zero according to some schedule.
    1a. Settling Now, repeatedly move from x to x' with probability \min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x)))).

Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.

See this video for an illustration of simulated annealing.

Part 2: Bayesian Analysis

Note that the posterior of \vec{w} conditioned on x_1, \ldots x_n is

p(\vec{w} | x_1, \ldots x_n) = \frac{p(\vec{w})}{p(\vec{x})} \cdot p(x_1 | \vec{w}) p(x_2 | \vec{w}) \ldots p(x_n | \vec{w}) \propto \exp\left(\sum X_i(\vec{w})\right)

We now have p_{W}(x) = \exp(W(x) - A(W)), an exponential distribution. Now suppose that W(x) = \langle w, \hat{x} \rangle, where \hat{x} \in \mathbb{R}^{m} are the sufficient statistics of x. For example, the energy function could follow that of a tree, so W(x) = \sum_{(i, j) \in T} w_{i, j} x_i x_j. If you want to find the expected value of $latex \mathbb{E}{x \sim p{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}{x \sim p{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)

Sampling from a Tree-Structured Distribution

Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:

whose joint PDF is W(x) = \exp(\sum_{i, j \in E} -w_{i, j} (x_i - x_j)^2) where x \in {0, 1}^{n}.
(The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)

Then one can show that if the marginal of x_3 is \mu_3 = \mathbb{P}(x_3 = 1), then we have that

\mu_6 = \mu_3 \cdot \frac{e^{w_{3,6}}}{1 + e^{w_{3, 6}}} + (1 - \mu_3) \cdot \frac{1}{1 + e^{w_{3, 6}}}

This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, x_6, we can then simply sample from the marginal, then sample from the conditional distribution on x_6 to sample from the entire distribution. This method is known as belief propogation.

This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.

General Exponential Distributions

Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be p_W(x) = \exp(\langle w, \hat{x} \rangle - A(w)). A few useful facts about these distributions that often appear in calculations: \nabla A(w) = \mu and $latex \nabla^2 (A(w)) = \text{Cov}{p{W}}(x) \succeq 0&bg=ffffff$.

By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so

p_w = \arg\max_{\mathbb{E}_{q} \hat{x} = \mu} H(q).

An analagous idea gives us
\mu = \arg \max_{H(\mu) \ge H(p_{w})} \langle w, \mu \rangle,

so p_w is the maximum entropy distribution consistent with observations \mu.

The important consequence of these two facts is that in principle, w determines \mu and vice versa. (Up to fine print of using minimal sufficient statistics, which we ignore here.) We will refer to w as the “canonical parameter space” and \mu as the “mean representation space”. Now note that the first equation gives us a way of going from \mu to w, and the second equation gives us a way to go from w to \mu, at least information-theoretically.

Now, how do we do this algorithmically? Using w, if were able to sample from p_w (which is generally possible if we can evaluate A(w) then we can estimate the mean \mu through samples. We can also obtain \mu from A(w) by estimating \nabla A.

On the other hand, if you have \mu, to obtain w you first consider A^{\ast}(\mu) = \sup_{w \in \Omega} {\langle \mu, w \rangle - A(w)} and then note that the desired value is \arg\max_{w \in \Omega}{\langle \mu, w \rangle - A(w)}, so solving this problem boils down to estimating \nabla A^{\ast} efficiently, because setting it to zero will give us w.

Thus going from w to \mu requires estimating \nabla A, whereas going from \mu to w requires estimating \nabla A^{\ast}.

When p is a posterior distribution, the observed data typically gives us the weights w, and hence the inference problem becomes to use that to sample from the posterior.

Examples of Exponential Distributions

There are many examples of exponential distributions, which we now give.

  • High-Dimensional Normals: x\in \mathbb{R}^{d}, W(x) = -(x - \mu)^{\top} \Sigma^{-1} (x-\mu)
  • Ising Model: x\in {0, 1}^{d}, W(x) = \sum w_i x_I + \sum_{i, j \in E} w_{i, j} x_i x_j. A sufficient statistic for these is \hat{x} = (x, xx^{\top}), a fact which we will invoke repeatedly.
  • There’s many many more, including Gaussian Markov Random Fields, Latend Dirchlet Allocation, and Mixtures of Gaussians.

Going from canonical parameters to mean parameters

Now, we show how to go from w to \mu in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over x_1, \ldots, x_d \in {0, 1}^{d}, for which

\mathbb{P}(x_1, \ldots x_n) = \mathbb{P}(x_1)\mathbb{P}(x_2) \ldots \mathbb{P}(x_n).

Recall that the partition function can be computed as

A(w) = \max_{q \in \mathcal{Q}} \langle w, \mathbb{E}_{q} \hat{x} \rangle + H(q).

where \mathcal{Q} is the set of all probability distributions. If we instead write \mathcal{Q} as the set of product distributions (parametrized by \mu, or the probability that each variable is 1), we get

A(w) \ge \max_{\mu \in \mathcal{Q}} \sum w_i \mu_i + \sum w_{i,j} \mu_i \mu_j + \sum H(\mu_i),

where H(\mu_i) = -\mu_i \log \mu_i - (1 - \mu_i) \log{1 - \mu_i} and it now suffices to maximize the right hand function over the set K = { \mu\in [0,1]^d | \sum \mu_i = 1 }.

We can generally maximize concave functions over a convex set such as K. Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.

CAVI Algorithm

  1. Let \mu = 1/2 \cdot (1, 1, \ldots 1)
  2. Repatedly choose values of j in [n] (possibly at random, or in a loop.)
    2a. Update \mu_{j} = \arg\max_{x_j} (f(\mu_{-j}, x_j)) (where \mu_{-j} represents the non-j coordinates of \mu)

Part 3. Solution landscapes and the replica method

This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (Note: The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)

Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.

Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.

The replica method

Note: We will have an extended post on the replica method later on.

If you sampled x_1, \ldots x_n from p_{w}, where p_w is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or

\begin{bmatrix} \langle x_1, x_1 \rangle & \cdots & \langle x_1, x_n \rangle \ \vdots & \ddots & \vdots \ \langle x_n, x_1 \rangle & \cdots & \langle x_n, x_n \rangle \end{bmatrix}

we approximate as

\begin{bmatrix} 1 & \cdots & q \ \vdots & \ddots & \vdots \ q& \cdots & 1 \end{bmatrix}

where q is some constant.

Suppose w comes from a probability distribution W. Then a very common problem is computing

\mathbb{E}_{w}[A(w)] = \mathbb{E}_{w}\left[ \log \int \exp(\langle w, \hat{x} \rangle)\right],

which is the expected free energy. However, it turns out to be much easier to find \log (\int E_{w} \exp(\langle w, \hat{x} \rangle)). So here’s what we do. We find

\mathbb{E}_{w} [A(w)] = \lim{n \rightarrow 0} \frac{\mathbb{E}_{w} [\text{exp}(n \cdot A(w))] - 1}{n}

This should already smell weird, because n is going to zero, an unusal notational choice. We can now write this as
\mathbb{E}{w} A(w) = \lim_{n \rightarrow 0} \frac{\mathbb{E}_{w}\left( \int{x} \exp(\langle w, \hat{x} \rangle)\right)^{n} - 1}{n}

which we can write as

\lim_{n \rightarrow 0} \frac{\int_{x_1, \ldots x_n} \mathbb{E}_{w} \exp(\langle w, \hat{x}_1 + \cdots + \hat{x}_n \rangle) - 1}{n}

Now these x_i‘s represent the replicas! Now, we’d hope \psi(n) = \int_{x_1, \ldots x_n} \mathbb{E}_w \exp(\langle w, \hat{x}_1 + \hat{x}_2 + \ldots + \hat{x}_n \rangle) is an analytic function, and we can now write this as \frac{\psi(n) - 1}{n} as n goes to zero.

Generally speaking, \mathbb{E}_{w} [\exp(w, \hat{x}_1 + \cdots + \hat{x}_n)] only depends on overlaps of Q, so we often guess the value of Q and calculate this expectation.


We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe Y so that Y = \lambda S + N, where S is the signal and N is the noise. Thus here we have

p(S', N' | Y) \propto \exp(\beta \langle Y - \lambda S', N' \rangle^2) \cdot p(S'),

and we want to analyze \mathbb{E}_{S' \sim p(\cdot | Y)} \langle S, S' \rangle^2 as a function of \lambda, which can be done by the replica method and exhibits a phase transition.

Other examples include Teacher-student models, where X, Y = (X, f_S(X) + N). Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.

Symmetry breaking

Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.

When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.

Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.

Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.

by Boaz Barak at April 02, 2021 03:04 PM UTC

Today I received an interesting email from our compliance office that is working on the accreditation of our PhD program in Statistics and Computer Science.

One of the requisites for accreditation is to have a certain number of affiliated faculty. To count as an affiliated faculty, however, one must pass certain minimal thresholds of research productivity, the same that are necessary to be promoted to Associate Professor, as quantified according to Italy’s well intentioned but questionably run initiative to conduct research evaluations using quantifiable parameters.

(For context, every Italian professor maintains a list of publications in a site run by the ministry. Although the site is linked to various bibliographic databases, one has to input each publication manually into a local site at one’s own university, then the ministry site fetches the data from the local site. The data in the ministry site is used for these research evaluations. At one point, a secretary and I spent long hours entering my publications from the past ten years, to apply for an Italian grant.)

Be that as it may, the compliance office noted that I did not qualify to be an affiliated faculty (or, for that matter, an Associate Professor) based on my 2016-2020 publication record. That would be seven papers in SoDA and two in FOCS: surely Italian Associate Professors are held to high standards! It turns out, however, that one of the criteria counts only journal publications.

Well, how about the paper in J. ACM and the two papers in SIAM J. on Computing published between 2016 and 2020? That would (barely) be enough, but one SICOMP paper has the same title of a SoDA paper (being, in fact, the same paper) and so the ministry site had rejected it. Luckily, the Bocconi administration was able to remove the SoDA paper from the ministry site, I added again the SICOMP version, and now I finally, if barely, qualify to be an Associate Professor and a PhD program affiliated faculty.

This sounds like the beginning of a long and unproductive relationship between me and the Italian system of research evaluation.

P.S. some colleagues at other Italian universities to whom I told this story argued that the Bocconi administration did not correctly apply the government rules, and that one should count conference proceedings indexed by Scopus; other colleagues said that indeed the government decree n. 589 of August 8, 2018, in article 2, comma 1, part a, only refers to journals. This of course only reinforces my impression that the whole set of evaluation criteria is a dumpster fire that is way too far gone.

by luca at April 02, 2021 02:09 PM UTC

In this post we describe a simple variant of Paxos (or Raft or any Lock-Commit) that is inspired by looking through the lens of HotStuff and Blockchain protocols. The most noticeable difference is that while Paxos and Raft aim to maintain a stable Primary/Leader (and change views infrequently), in Benign...

at April 02, 2021 09:54 AM UTC

Scribe notes by Praneeth Vepakomma

Previous post: Unsupervised learning and generative models Next post: Inference and statistical physics. See also all seminar posts and course webpage.

lecture slides (pdf)lecture slides (Powerpoint with animation and annotation)video

In this blog post, we will focus on the topic of robustness – how well (or not so well) do machine learning algorithms perform when either their training or testing/deployment data differs from our expectations.

We will cover the following areas:

  1. Math/stat refresher
  2. Multiplicative weights algorithm
  3. Robustness
    • train-time
      • robust statistics
      • robust mean estimation
      • data poisoning
    • test-time
      • distribution shifts
      • adversarial perturbations

📝Math/stat refresher

KL refresher

The KL-divergence between the probability distributions is the expectation of the log of probability ratios:

KL-divergence is always non-negative and can be decomposed as the difference between negative entropy and cross-entropy. The non-negativity property thereby implies that \forall distributions p,q, H(p,q) \geq H(p).

Introduction to concentration

Consider a Gaussian random variable X \sim N(\mu, \sigma^2). Concentration is the phenomenon that if you have i.i.d random variables X_1,X_2, \ldots,X_n with expectation \mu, then the empirical average \frac{\sum_i X_i}{n} is distributed approximately like N(\mu, \frac{1}{n})=\frac{1}{\sqrt{n}}N(\mu,1)
Therefore the standard deviation of the empricial average is smaller than the standard deviation of each of the Y_i‘s. The central limit theorem ensure this asymptotically w.r.t n, while non-asymptotic versions of this phenomenon can be seen via popular inequalities such as the Chernoff/Hoeffding/Bernstein styled inequalities. These roughly have the form P[|\sum_i X_i - \mu.n| \geq \epsilon n] \approx \exp(-\epsilon^2 n)

Matrix notations


We denote the spectral norm of a matrix \mathbf{A} by \left | \mathbf{A} \right| = \underset{| v| = 1}{\max}\left| \mathbf{A}v \right| = \lambda_{\max}(\mathbf{A}) and the Frobenius norm by \left | \mathbf{A} \right|_F = \sqrt{\sum\mathbf{A}_{ij}^2}.

PSD Ordering

We use the matrix ordering \mathbf{A} \preceq \mathbf{B} if v^T\mathbf{A} v \leq v^T\mathbf{B} v for all vectors v. We similarly refer to \mathbf{A} \in [a,b] \mathbf{I} if a\mathbf{I} \preceq \mathbf{A} \preceq b\mathbf{I}.

Matrix/vector valued random variables

Vector valued normals: If \mu \in \mathbb{R}^{d} is a vector, and \mathbf{V} \in \mathbb{R}^{d \times d} is a psd covariance matrix, with x \sim N(\mu, \mathbf{V}) normal over \mathbb{R}^{d} with

\mathbb{E}\left[x_{i}\right]=\mu_{i}, \;\; \mathbb{E}\left[\left(x_{i}-\mu_{i}\right)\left(x_{j}-\mu_{j}\right)\right]=\mathbf{V}_{i, j}

For every psd matrix V, there exists a Normal distribution where
\mathbb{E}\left[\left(x_{i}-\mu_{i}\right)\left(x_{j}-\mu_{j}\right)\right]=\mathbf{V}_{i, j}

Standard vector-valued normal: This refers to the case when \mathrm{x} \sim N\left(0^{d}, I_{d}\right) (or \left.\mathrm{x} \sim N(0, I)\right) and

\mathbb{E} x=\overrightarrow{0},\;\; \mathbb{E}\left[x x^{\top}\right]=I \quad\left(\begin{array}{l} \mathbb{E} x_{i}^{2}=1 \ \mathbb{E} x_{i} x_{j}=0 \text { for } i \neq j \end{array}\right)

Matrix concentration

For scalar random variables, we saw that if Y_{1}, \ldots, Y_{n} are i.i.d over \mathbb{R} bounded with expectation \mu then

\mathrm{Pr}\left[\left|\sum Y_{i}-\mu \cdot n\right| \geq \epsilon n\right] \approx \exp \left(-\epsilon^{2} n\right)

If the random variables were vectors instead, we can easily generalize this to n samples of Y_i \in \mathbb{R}^d. Generalizing to matrices is far more interesting, especially when the norm under consideration is the spectral norm instead of the Frobenius norm.

Matrix Bernstein inequality:

If Y_{1}, \ldots, Y_{n} i.i.d symmetric matrices in \mathbb{R}^{d \times d} with \mathbb{E} Y_{i}=\mu,\left|Y_{i}\right| \leq O(1) (i.e bounded in spectral norm), then

\mathrm{Pr}\left[\left|\sum Y_{i}-\mu \cdot n\right| \geq \epsilon n\right] \approx d \cdot \exp \left(-\epsilon^{2} n\right)

Note that \mu is a matrix in this case, instead. The norm under consideration is the spectral norm. Note that the difference in the inequality w.r.t the scalar case is the additional multiplicative factor of d or an additive factor of \log(d) in the exponent as in the equations above. Please refer to Tropp, 2015 Chapter 6 for formally precise statements.

Another property that follow is the the expected norm of the difference follows
\mathbb{E}\left|\sum Y_{i}-\mu\right| \leq O(\sqrt{n \log d})

There are some long-standing conjectures and results on cases when one can get rid of this additional factor of \log(d).

Trivia on log factors: For example as a tangent (as well as some trivia!) on some results of this flavor, where a \log factor was replaced by a constant, includes Spencer’s paper from 1985 titled ‘Six standard deviations suffice’, which showed that a constant of 6 would suffice in certain bounds where a naive result would instead give a \log(n) dependency. The Kadison-singer problem (by Spielman-Srivastava) and the Paulsen problem are two other examples of works in this flavor.

Random matrices

For a random d \times d symmetric matrix matrix \mathbf{A} with \mathbf{A}_{i,j} \sim N(0,1), the spectrum of eigenvalues is distributed according to Wigner semi-circle law on a support between [-2\sqrt{d},+2\sqrt{d}] as shown in the figure below. Note that most mass is observed close to 0.

Marchenko-Pastur distribution (for random empirical covariance matrices)

For \mathbf{X = \frac{1}{n} AA^T, A} \in \mathbb{R}^{d\times n}, \mathbf{A}_{ij} \sim N(0,1)
we have \mathbf{X} is the empirical estimate for covariance of x_1,\ldots,x_n \sim  N(0,\mathbf{I}_d)
the eigenvalues are distributed according to the Marchenko-Pastur distribution. In the case when, d < n , the eigenvalues will be bounded away from zero. When d \approx n , then it has a lot more mass close to 0 (way more than the semi-circle law). When d > n , then there is a spike of d-n eigenvalues at 0 and the rest are bounded away from 0.

Connections to the stability of linear regression
In the regime, when d \approx n, linear regression is most unstable, which is the case when most of the eigenvalues of the empirical covariance are close to 0. When d>n, although linear regression is over-parametrized, the solution is not exact, but the approximate solution is still pretty stable as in the case of d < n. When d\gg n, the condition number is infinite, and therefore we use a pseudo-inverse as opposed to the usual inverse in computing the solution for linear regression. This ignores the subspace on which the matrix has 0 eigenvalues while the inverse is performed. The condition number of the subspace of the non-zero eigenvectors is finite.

Digression: Multiplicative Weights Algorithm

Note: Its variants/connections include techniques/topics of, Follow The Regularized Leader / Regret Minimization / Mirror Descent. Elad Hazan’s lecture notes on online convex optimization and Arora, Hazan, Kale’s article on multiplicative and matrix multiplicative weights are a great reading source for these topics.

The following is a general setup in online optimization and online learning.

Setup: n possible actions a_{1}, \ldots, a_{n}

At time t=1,2, \ldots, T, we incur loss L_{i, t} for action i at time t. After this action, we also learn the loss for actions we did not take. (This is referred to as the experts model in online learning, in contrast to the bandit model where we only learn the loss for the action taken; the experts model is an easier setup than the bandit model.)

The following is the overall approach for learning to optimize in this online setup.

Overall Approach:

  • Initialize p_{0} distribution over action space [n]. We then take a step based on this distribution and observe the incurred loss.
  • Then the distribution is updated as p_{t+1} by letting p_{t+1}(i) \propto p_{t}(i) \exp \left(-\eta L_{i, t}\right). i.e, if an action gave a bad outcome in terms of loss, we downweight it’s probability for the next draw of action to be taken. \eta is the penalty terms that governs the aggresiveness of the downweighting.

The hope is that this approach converges to a “good aggregation or strategy”. We measure the quality of the strategy using regret.

The difference between the cost we incur and the optimal action (or probability distribution over actions) in hindsight is known as the (average) regret.

Note that we compare the loss we incur with the best loss over a fixed (i.e., nonadaptive) probability distribution over the set of possible actions. It can be shown that the best such loss can always be achieved by a distribution that puts all the weight on a single action.

The following theorem bounds the regret.

The ‘prior ignorance’ term captures how good the initialization is. The ‘sensitivity per step’ term captures how aggressive the exploration is (i.e this term governs the potential difference between loss incurred at p_t versus p_{t-1}). As \eta also occurs in the denominator of the prior ignorance term, it needs to be carefully chosen to balance the two terms properly.

The first inequality of this theorem is always true for any p\ast, p_0 whether p\ast is optimal strategy or not. The second (right-most) inequality is true when optimal p\ast is the delta function (which it will be in optimal strategy), p_0 is initialized to uniform distribution, and \eta is set to be \sqrt{\frac{\log(n)}{t}}. Note that in this case, the divergence \Delta_{KL}(p\ast | p_0) = \log n.

To state the ‘overall approach’, more precisely, there needs to be a normalization Z_{t} at every step as below
p_{t+1}(i)=p_{t}(i) \exp \left(-\eta L_{i, t}\right) / Z_{t}
The above can be rearranged as follows upon taking log
L_{t, i}=\frac{1}{\eta} \log \frac{p_{t}(i)}{p_{t+1}(i) \cdot Z_{i}}

By substituting this in below

we get the following expanded version of what the regret amounts to be:

The last equation above is due to the telescoping sum where adjacent terms cancel out except the first and last terms.

The last inequality here is because the cross-entropy is always at least as large as the entropy.

So now we have this so far

PF: Regret \leq \frac{\Delta_{K L}\left(p^{\ast} | p_{0}\right)}{\eta \cdot T}+\frac{1}{\eta \cdot T} \sum_{t} \Delta_{K L}\left(p_{t}, p_{t+1}\right)

Now there is this property that if p, q s.t. p(i) \propto q(i) \rho_{i} for \rho_{i} \in[1-\eta, 1+\eta] then \Delta_{K L}(p | q) \leq O\left(\eta^{2}\right)
Therefore upon substituting this, we prove the upper bound stated over the regret.

This subclaim that was used can be proved as follows

Generalization of multiplicative weights: Follow The Regularized Leader (FTRL)

When the set K of actions is convex (set of probability distributions on discrete actions is convex),

At time t+1, it makes a choice x_{t+1} \in K and learns cost function L_{t+1}: K \rightarrow \mathbb{R} (so the setup is again like the experts model unlike the bandit model).

Now the new action in FTRL is based on the following optimization which is a regularized (with R(x)) loss:
FTRL: x_{t+1}=\arg \min_{x \in R}\left(R(x)+\sum_{i=1}^{t} L_{i}(x)\right)

In this case, the regret bound is given by the following theorem.

x^\ast refers to the optimal choice. This indeed has a similar flavor to the theorem we saw for multiplicative weights. To be precise when
K={\text { set of all distributions on action space }[n]}, \text {and} \; R(x)=-\frac{1}{\eta} H(x) we have that the multiplicate weights becomes FTRL. Similarly there is a view that connects multiplicative weights to mirror descent. (Nisheeth Vishnoi has notes on this.)

Train-Time Robustness

We now look at some robustness issues specific to the training phase in machine learning.

For example, during training, there could be adversarially poisoned examples in the training dataset that damage the trained model’s performance.

Setup: Suppose 1-\epsilon samples are generated from a genuine data distribution and \epsilon samples are maliciously chosen from an arbitrary distribution. i.e, in formal notation: x_{1}, \ldots, x_{(1-\epsilon) n} \sim X \subseteq \mathbb{R}^{d}, \;\; \text {and } x_{(1-\epsilon) n+1}, \ldots, x_{n} are arbitrary. (While for convenience we assume here that the last \epsilon n items are maliciously chosen, the learning algorithm does not get the data in order.)

Assume \left|x_{i}\right|^{2} \approx 1 for i<(1-\epsilon) n

Let us start with the task of estimating the mean under this setup.

Mean estimation: Estimate \mu = \mathbb{E}X

Noiseless case: In the noiseless case, the best estimate for the mean under many measures is the empirical mean \hat{\mu}=\frac{1}{n}\sum_i X_i

Standard concentration results show that for d=1 |\hat{\mu}-\mu| \leq O(1 / \sqrt{n}) and for a general d that |\hat{\mu}-\mu| \leq O(\sqrt{d / n}).

Adversarial case: If there is even a single malicious sample (that can be arbitrarily large or small) then it is well known that the empirical mean can give an arbitrarily bad approximation of the true population mean. For d=1 we can use the empirical median \mu^{\ast}=\mathrm{sort}\left(x_{1}, \ldots, x_{n}\right)_{n / 2}. This is guaranteed to lie in the \left(\frac{1}{2}-\epsilon, \frac{1}{2}+\epsilon\right) quantile of real data. This is most optimal because of the property that X \sim N(\mu, 1),\left|\mu-\mu^{\ast}\right| \leq O(\epsilon+1 / \sqrt{n}).

In a higher-dimension, this story instead goes as follows:

Setup: x_{1}, \ldots, x_{(1-\epsilon) n} \sim X \subseteq \mathbb{R}^{d}, \quad x_{(1-\epsilon) n+1}, \ldots, x_{n} arbitrary.

Median of coordinates (a starter solution): When (d \geq 1), we have \mu_{i}^{\ast}=\mathrm{sort}\left(x_{1, i}, \ldots, x_{n, i}\right)_{n / 2} per coordinate as an initial naive solution. (i.e a median of coordinates). In this case, say if we have X \sim N(\mu, I), then an adversary can perurb the data to make \epsilon fraction of the points be (\mu_1+M,\mu_2+M,\ldots,\mu_d+M) where M is some large number. This will shift in each coordinate i the distribution to have median \mu_i + \Omega(\epsilon) instead of median \mu_i, and hence make \mu^{\ast} \approx \mu + c\cdot (\epsilon, \ldots, \epsilon) which implies that \left|\mu^{\ast}-\mu\right| \approx c\cdot \epsilon \sqrt{d} for some constant c>0.

The obvious next question is: Can we do better? i.e, can we avoid paying this dimension-dependent \sqrt{d} price?

Yes we can! We can use the ‘Tukey Median’!.

What is a Tukey median?

Informally via the picture below, if you have a Tukey median (red), no matter which direction you take from it and look at the half-space, it exactly partitions the data into about half the # of points.

Formal definition of Tukey median is given as follows (fixing some parameters):

A Tukey median* of x_{1}, \ldots, x_{n} \in \mathbb{R}^d is a vector \mu^{\ast} \in \mathbb{R}^d s.t. for every nonzero v \in \mathbb{R}^{d}

A Tukey median need not always exist for a given data. However, we will show that if the data is generated at random according to a nice distribution, and then at most \epsilon fraction of it is perturbed, a Tukey median will exist with high probability. In fact, the true population mean will be such a median.

Existence of Tukey median

THM: If x_{1}, \ldots, x_{(1-\epsilon) n} \sim N(\mu, I) and \sqrt{d / n} \ll \epsilon then

  1. Tukey median \mu^{\ast} exists and
  2. For every Tukey median \mu^{\ast} , \left|\mu-\mu^{\ast}\right| \leq O(\epsilon).

Together 1. and 2. mean that if we search over all vectors and output the first Tukey median that we find, then (a) this process will terminate and (b) its output will be a good approximation for the population mean. In particular, we do not have to pay the extra \sqrt{d} cost needed in the median of coordinates!

** Proof of 1 (existence):**
The mean itself is the Tukey median in this case because, \forall v \neq 0, if we define Y_{1}=\mathrm{sign}\left(\left\langle x_{1}-\mu, v\right\rangle\right), \ldots, Y_{k}=\mathrm{sign}\left(\left\langle x_{k}-\mu, v\right\rangle\right) then these are i.i.d ±1 vars of mean zero, and thus:

\mathrm{Pr}\left[\sum Y_{i}>\epsilon k\right]<\exp \left(-\epsilon^{2} n\right) \ll \exp (-d) for \sqrt{d / n} \ll \epsilon and k=(1-\epsilon)n. Through discretization, we can pretend (losing some constants) that there are only 2^{O(d)} unit vectors in \mathbb{R}^d. Hence we can use the union bound to conclude that for every unit v, if we restrict attention to the “good” (i.e., unperturbed) vectors x_1,\ldots, x_k, then the fraction of them satisfying \langle x_i -\mu , v \rangle >0 will be in 1/2 \pm \epsilon. Since the adversary can perturb at most \epsilon n vectors, the overall frection of x_i‘s such that \langle x_i -\mu , v \rangle >0 will be in 1/2 \pm 2\epsilon. QED(1)

** Proof of 2:**
Let \mu be the population mean (i.e., the “good” x_i‘s are distributed according to N(\mu,I)). Suppose for simplicity, and toward a contradiction, that \left|\mu-\mu^{\ast}\right| = 10 \epsilon.

Let v=\mu-\mu^{\ast}. Then,
\left\langle x_{i}-\mu^{\ast}, v /|v|\right\rangle=\left\langle x_{i}-\mu+v, v /|v|\right\rangle=\left\langle x_{i}-\mu, v /|v|\right\rangle +|v| = \left\langle x_{i}-\mu, v /|v|\right\rangle +10 \epsilon.

Note that \left\langle x_{i}-\mu, v /|v|\right\rangle is distributed as N(0,1), and so we get that \langle x_i - \mu^* , v/|v|\rangle is distributed as N(10\epsilon,1).
Hence, if Y_{i}=\mathrm{sign}\left(\left\langle x_{i}-\mu^{\ast}, v /|v|\right\rangle\right) then \mathrm{Pr}\left[Y_{i}=-1\right]=\mathrm{Pr}[N(10\epsilon ,1) < 0] \leq 1 / 2-5 \epsilon.
This implies that via a similar concentration argument as above, for every v, there will be with high probability at most 1/2 - 4\epsilon fraction of i‘s such that Y_{i}=-1, contradicting our assumption that \mu^\ast was a Tukey median. QED(2)

Exactly computing the Tukey median is NP-hard, but efficient algorithms for robust mean estimation of normals and other distributions exist as referred to in Jerry Li’s lecture notes. In particular, we can use the following approach using spectral signatures and filtering.

Spectral Signatures can efficiently certify that a given vector is in fact a robust mean estimator.
Let x_{1}, \ldots, x_{(1-\epsilon) n} \sim N(\mu, I) and x_{(1-\epsilon) n+1}, \ldots, x_{n} arbitrary
Let \hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} x_{i} be the empirical mean and \widehat{\Sigma}=\frac{1}{n} \sum_{i-1}^{n}\left(x_{i}-\hat{\mu}\right)\left(x_{i}-\hat{\mu}\right)^{\top} be the empirical co-variance. Then we can bound the error of \hat{\mu} as follows:

Claim: |\hat{\mu}-\mu| \leq O\left(\sqrt{\frac{d}{n}}+\sqrt{\epsilon}|\hat{\Sigma}|\right)

In other words, if the spectral norm of the empirical covariance matrix is small, then the empirical mean is a good estimator for the population mean.

Note{ }^{\ast}: If all points are from N(\mu, I) then |\hat{\Sigma}|=O(\sqrt{d / n}).

The Proof for this claim is given below

Explanation: Here, we assume the \mu=0 for simplicity without loss of generality. The norm can be split into additive terms on good points (green text above) and malicious points (red text above). The first term of this inequality follows from standard concentration. For the second (red) term, we can modify it by adding and subtracting \hat{\mu}. We can then apply the Cauchy-Schwartz (cs) inequality to prove it. Upon rearranging the terms and dividing by the norm of \mu, we get the desired result.

Filtering is an approach to turn the certificate into an algorithm to actually find the estimator. The idea is that the same claim holds for non-uniform reweighting of data points to estimate the empirical mean and covariance. Hence we can use a violation of the spectral norm condition (the existence of a large eigenvalue) to assign “blame” to some data points and down-weigh their contributions until we reach a probability distribution over points that on the one hand is spread on roughly 1-O(\epsilon) fraction of the points and on the other hand leads to a weighted empirical covariance matrix with small spectral norm. The above motivates the following robust mean estimation algorithm in the online case as below:

Explanation: We first compute the mean and covariance based on uniform weighting based, and the certificate is checked via the spectral norm of the covariance. If the quality isn’t good enough, then the blame is given to the largest eigenvector of the covariance that contributes to the most error. The weighting is now improved from the uniform initialization via the multiplicative weights styled update as given in step 3. Jerry Li and Jacob Steinhardt have wonderful lecture notes on these topics.
The algorithm above is computationally efficient while the bounds are not that tight.

SoS algorithms: Another approach is via sum of squares algorithms where the guarantees for statistical robustness are much tighter, but they are computationally not very efficient although they are polynomial time. A hybrid approach might as well give a balance of these based on the problem at hand to bridge this gap between computational efficiency vs. statistical efficiency.

A list of relevant references is given below.

Test-time robustness

We now cover robustness issues with distribution shift and adversarial data poisoning during the testing phase in machine learning.

Warm-up with pictures

As shown in Steinhardt, Koh, Liang, 2017 the images below illustrate how poisoning samples can drastically alter the performance of a classifier from good to bad.

Shafahi , Huang, Najibi, Suciu, Studer, Dumitras, Goldstein, 2018 showed how poisoning images look perfectly fine to the human perception, while they flip the model’s performance where a fish is considered to be a dog and vice versa as shown below.

Another problem with regards to test-time robustness is the issue of domain shift where the distribution of test data is different from the distribution of training data.

\underset{x,y \sim D}{\mathbb{E}} L(f(x),y) vs. \underset{x,y \sim D'}{\mathbb{E}} L(f(x),y) . If L is bounded, then if Lipschitz constant is known then distances like earth mover’s distance, T.V distance can be used to bound this. But one needs to be quite careful or these bounds are too large or close to vacuous.

For example, images of dogs taken say in a forest vs. images of dogs taken on roads have huge distance in measures like the ones mentioned above, as many pixels (although not important pixels) across the images are very different and there could be a classifier that performs terribly across the test while it does good on train. But magically, there appears to be a linear relationship between the accuracy on D vs accuracy on D'. Typically one would expect that the line would be below the x=y (45 degree) line.

i.e say if it was trained on cifar 10 (D) and tested on cifar10.2 (D') then finding this line to be a bit lower than the x=y line is not surprising but this linear correlation is quite surprising. It is even more surprising if the datasets are different-say in the case when performance on photos vs illustrations or cartoons of these photos was considered.

What are the potential reasons (intuitively)?

i) Overfitting on cifar test,

ii) Fewer human annotators per image introduces skew towards hardness of dataset,

iii) Running out of images by human annotators as they might end up choosing images that are easier to annotate.

If we achieve better accuracy on D' than we achieved on D, it is a strong indicator of a drop in hardness. If the errors are in the other direction, then this reasoning isn’t as clear.

A toy theoretical model

Here is a toy model can demonstrate this surprising linear relationship between accuracies under domain shift.
Let us do a thought experiment where things are linear. Let us assume there was a true vector cat direction in terms of the representation(feature) as shown in the cartoon below. Let there be some correlated idiosyncratic correlations. An example, idiosyncrasy is say due to cats tending to be photographed more in indoors than in outdoors.

Consider x to be a point that needs to be labeled.

In some dataset D consider the probability that x is labeled as a cat is proportional to the following exponential as:
\mathrm{Pr}[x labeled cat ] \propto \exp (\beta\langle x, C A T+\alpha I\rangle)

where \alpha to denote the theidiosyncratic correlation factor. This is the exponent of the dot product of the image to be labeled x with the CAT direction and the idiosyncratic direction. The same can be done for a dataset D' as

Dataset D^{\prime}: \mathrm{Pr}[x labeled cat ] \propto \exp \left(\beta^{\prime}\left\langle x, C A T+\alpha^{\prime}I^{\prime}\right\rangle\right)

Intuitively \beta' is like the signal to noise ratio. That is, if \beta' < \beta then D' is a harder dataset to classify than D.

So in this toy model, the best accuracy that can be reached for any linear classifier is given by the following, where the softmax of the RHS is the classification probability.

A c c_{D}(C)=\beta\langle C, C A T\rangle+\beta \alpha\langle C, I\rangle
A c c_{D \prime}(C)=\beta^{\prime}\langle C, C A T\rangle+\beta^{\prime} \alpha^{\prime}\left\langle C, I^{\prime}\right\rangle
The first term of this accuracy is the universal and transferable part and the second term is the idiosyncratic part.

The following is an For the A c c_{D \prime}(C) we assume that if C is trained on D then we assume \approx 0. This is because if C is trained on D, then idiosyncratic directions of D,D' are orthogonal to each other. So the \approx 0 and the accuracy will be just this term of \beta^{\prime}\langle C, C A T\rangle.

If the model is learnt by gradient descent then the gradient direction will always be proportional to CAT + \alpha I + Noise as the gradient is of the form
\nabla(\beta\langle C, C A T\rangle+\alpha \beta\langle C, I\rangle)=\beta \cdot C A T+\beta \alpha \cdot I
So, if C is trained on D, then C \propto C A T+\alpha \cdot I+ Noise =\frac{\gamma}{|C A T+\alpha \cdot I|^{2}}(C A T+\alpha \cdot I)+ Noise. Here, \beta is given by \frac{\gamma}{|C A T+\alpha \cdot I|^{2}}.
Therefore the accuracies will be as follows

A c c_{D}(C)=\beta\langle C, C A T+\alpha \cdot I\rangle=\beta \gamma
A c c_{D^{\prime}}(C)=\beta^{\prime}\left\langle C, C A T+\alpha \cdot I^{\prime}\right\rangle=\beta^{\prime} \gamma \cdot \frac{|C A T|^{2}}{|C A T|^{2}+\alpha^{2}|I|^{2}}

Therefore, we see this form of a linear relationship:
A c c_{D^{\prime}}(C)=\frac{\beta^{\prime}}{\beta\left(1+\theta^{2}\right)} \cdot \mathrm{Acc}_{D}(C)

Note that:

  • \beta^{\prime} / \beta<1 iff D^{\prime} harder than D
  • \theta^{2} grows with idiosyncratic component of D

Although this is a toy theoretical model, it explains a linear relationship. However, finding a model that explains this linear relationship in real-life will be an interesting project to think of.

Adversarial perturbations

We now move to the last (yet another active) topic of our blog: adversarial perturbations. As an introductory example (taken from this tutorial), the hog image x was originally classified as a hog with \approx 99.6 \% probability. A small amount of noise \Delta was added to get the x+\Delta image which to the human eye perceptually looks indistinguishable. That said the model now ends up misclassifying the noised hog to something that is not a hog.

Should we be surprised with the efficacy of such adversarial perturbations? Originally-certainly yes, but not as much now in hindsight!

In this example it turns out that the magnitude of each element satisfies \left|\Delta_{i}\right| \approx \frac{1}{64}\left|x_{i}\right| and the 2-norm of this vector is roughly as |\Delta| \approx \frac{1}{64}|x|. Note that the Resnet50 model outputs r(x) at the penultimate layer has a dimension d=2048. We scale |x| such that |r(x)| \approx \sqrt{d}. There is a Lipschitz constant L w.r.t the preservation of the following norms: |r(x+\Delta)-r(x)| \approx L|\Delta|. The final classification decision is made by looking at C \cdot H O G where HOG is a unit vector such that H O G: unit vector s.t. \mathrm{Pr}[x is h o g] \propto\langle H O G, r(x)\rangle. We assume some randomness in the decision as being \sqrt{1-c^{2} / d} \cdot N(0, I) for simplicity. So we have
r(x)=C \cdot H O G+\sqrt{1-c^{2} / d} \cdot N(0, I) where C=\langle x, HOG \rangle.We now have the probability that x is not a hog as

\mathrm{Pr}[x is not hog ]=\frac{\exp (-C)}{\exp (C)+\exp (-C)}.

As we know that the observed probability of x being not a hog is 0.0996, we can calculate that C \approx 3.
Upon normalizing |r(x)|^2 to be d=2048, We can expect the following square of this dot product w.r.t the representation r(x) as:\langle r(x), H O G\rangle^{2} \approx 3 \approx |r(x)|^2 / 700
Therefore the norm of the projection of r(x) to the HOG direction is given by \left|r(x)_{H O G}\right| \approx \frac{1}{25}|r(x)|.

So say if the 2048 vector had one larger element which accounts for the HOG direction, although it still accounts for a small proportion of its total norm. Therefore it wouldn’t need too much noise to flip one class to its wrong class label as shown in the cartoons below.

So if the Lipschitz constant is greater than around 2.5 or 3, then a fraction of 1/25 is enough to zero out the hog direction (as L \gg \frac{64}{25} \approx 2.5).

Robust loss

What are some strategies for training neural networks that are robust to perturbations?

  • A set of transformation that do not change the true nature of the data such as for e.g:
    t_{\Delta}(x)=x+\Delta where the set is

i.e, they only perturb the image or data sample by atmost \epsilon upon applying that transformation.
Now, given a loss function \mathcal{L}: Y \times Y \rightarrow \mathbb{R} and a classifier f: X \rightarrow Y, a robust loss function of f at point (x, y) is defined as

Robust training:

Given x_{1}, y_{1}, \ldots, x_{n}, y_{n} and arobust loss function, a robust training would involve the minimization of this loss which gives us:

Now for the subgoal of finding \nabla_{f} \max_{t \in \mathcal{T}} \mathcal{L}\left(f\left(t(x{i})\right), y_{i}\right) for the sake of optimization, invoking Danskin’s Theorem will greatly help. The theorem basically says that if g(f, t) is nice (diff, continuous) and if \mathcal{T} is compact then we have that:

i.e for any function g(f,t), that depends on f and t as long as the function space \mathcal{T} from which t needs to be chosen is nice (diff, continuous) and compact, then to find the gradient of the maximum of the function g(f,t) then by the theorem one can find maximizer t^* for any particular f, then that can give the required gradient (after some fine print that is given in note below).
Note: The paper below extends to the case when t^{\ast}(f) non unique though there is other fine print. See Appendix A of (Madry Makelov Schmidt Tsipras Vladu 2017).

On the empirical side, there seems to be a trade-off. If one wants to achieve adversarial robustness, it can be achieved by letting go of some model accuracy. See discussion here and the papers cited below.

by Boaz Barak at April 01, 2021 11:14 PM UTC

This is embarrassing to admit but after a few badly timed trades on GameStop options I find myself a bit tight on money. To raise some cash, I reluctantly decided to sell one of my prized possessions, one of my theorems.

For Sale: Boolean formula satisfiability cannot be solved in both logarithmic space and quasilinear time. For a more formal and slightly more general statement and a proof, see this paper.

Bidding starts at 12 BTC (about $705,000). 

The winning bid, upon verified payment, will receive:

  • The ability to give the theorem the name of your choice such as your own name, your best friend's mother's name or "Teddy McTheoremface".
  • A non-fungible token (NFT) attesting ownership of the theorem and the name you have chosen for it.
  • Anyone citing this result will be required to note that you own it and use the name you chose above. You cannot, however, limit the use of the theorem or receive compensation for its use. 
  • By virtue of owning this theorem you will a Fortnow number of zero. This immediately gives you an Erdős number of 2. If you have previously written a paper with Paul Erdős then both of us will now have an Erdős number of 1.
Frequently Asked Questions

Q: Why this theorem?

A: The theorem is in one of my few solely authored papers and I can't afford to share the proceeds of the sale. 

Q: Doesn't Ryan Williams and others have stronger theorems?

A: The results are incomparable. Ryan gives bounds on a single algorithm with low time and space. My theorem allows different machines for the time and space bounds.

Also, to the best of my knowledge, Ryan's theorem is not for sale.

Q: Doesn't the proof of the theorem rely on other people's theorems such as Nepomnjascii? Shouldn't he get some of the value from this sale?

A: I'm not selling the proof of the theorem, just the theorem itself.

Q: If I purchase this theorem will I get to write next year's April Fools post?

A: No.

by Lance Fortnow ( at April 01, 2021 01:22 PM UTC

We call any consistent and sufficiently powerful formal theory that enables to algorithmically in polynomial time verify whether a text is a proof \textbf{efficiently verifiable mathematics} (ev-mathematics). We study the question whether nondeterminism is more powerful than determinism for polynomial time computations in the framework of ev-mathematics. Our main results are as follows. \\ "$\P \subsetneq \NP$ or for any deterministic, polynomial time compression algorithm $A$ there exists a nondeterministic, polynomial time compression machine $M$ that reduces infinitely many binary strings logarithmically stronger than $A$." \\ "$\P \subsetneq \NP$ or f-time resource bounded Kolmogorov complexity of any binary string $x$ can be computed in deterministic polynomial time for each polynomial time constructible function $f$."

at April 01, 2021 12:52 PM UTC

All non-metallic UK currency to feature Computing and Data Science pioneers

Cropped from BBC source

Charles Babbage and Ada Lovelace will be reunited. Their work on designing and programming a computing device, a century before any machine was built, is being honored by their appearance on the reverse of the 5-pound note in a new series authorized by the Bank of England. Royal Statistical Society Fellow Florence Nightingale will appear on the 10, EDSAC creator Maurice Wilkes on the 20, and Alan Turing on the 50.

Today, April 1, we are delighted to convey this news and show previews of the banknotes.

We have mentioned Babbage and Lovelace several times on this blog, the latter first here. Our 2015 post on her work on his project sought to engage a scholarly consensus that tends to minimize it. We argue that it should be judged on the plane of a doctoral or masters advisory relationship. Our point is that she greatly amplified the technical content of his work as shown to the scientific community.

The design shown above makes a leap from their Victorian world to our online age. Being “online” back then probably meant being reached by the new railroad system centered on London. The one indelible connection between our world and theirs is shown by the obverse of the banknote: both worlds have a long-serving British queen.

Nightingale On The 10

Besides a long life (1820–1910), Nightingale has a long entry as a founder of statistics. Her actions and influence extended for more than five decades after her iconic “Lady With the Lamp” ministry to soldiers of the Crimean War depicted on the 10-pound note:

Nightingale 200 Years source

The picture at right is based on a photo by William Kilburn, one of few series of photos she allowed to be taken. We have not found a definitive word on what she is holding—not even from this long article—but we would like to believe it is a letter or paper regarding medical practice. Already in 1854, the Oxford Chronicle and Reading Gazette noted her bringing much more than a lamp:

In a knowledge of the ancient languages and of the higher branches of mathematics, in general art, science, and literature, her attainments are extraordinary. There is scarcely a modern language which she does not understand; and she speaks French, German, and Italian as fluently as her native English.

Not only did she popularize the pie chart and invent other statistical visualization techniques, she was among the first people, period, to bring statistical inference into policy. This included demonstrating with data the higher exposure of nurses to multiple diseases among those they tended. She was inducted into the Royal Statistical Society in 1858.

Her medical conduct principles have been applied expressly during the pandemic, as recognized by the time of the 200th anniversary of her birth last May 12, and field hospitals in the UK for Covid-19 treatment are named for her. Her recognition on the currency was considered a way to honor first-line caregivers and was a unanimous vote of the Bank of England commission. It has the earliest release date of all the banknotes.

Wilkes On The 20

Maurice Wilkes had an even longer life, 1913 to 2010. This marks a stark contrast to Turing. Wilkes was the second winner, in 1967, of the Turing Award. The Turing Award citation said he was

“the builder and designer of the EDSAC, the first computer with an internally stored program.”

He was thus the first realizer of the full vision of Babbage and Lovelace, although many claim the stored-program distinction for the Manchester Baby one year earlier, in 1948.

There was controversy about the order of him and Turing. The banknotes were intended to go in chronological order. Turing’s famous paper conceiving the computer dates to 1936–37. Wilkes’s first brush with the project that became the EDSAC came in 1945, when he spent an iconic 24 hours with a loaned copy of John von Neumann’s First EDVAC Report that he was not able to mimeograph. However, Wilkes was involved with analog computing devices before 1937 and his EDSAC reached full operation in May 1949, whereas Turing’s own main project, the ACE, launched its pilot version only in 1950.

The ultimate determiner was that the 20-pound note was already recently revised to feature the artist Joseph Turner in February 2020. The Turner issue will have a five-year term, so Wilkes will appear in 2025. The banknote design has not yet been executed.

Turing On The 50

Of course, this is the highlight for us, and we are delighted to be able to show the Turing design in full glory. It was formally unveiled last week:

Belfast Telegraph source

The Pilot ACE machine is shown overlaid with Turing machine code from his 1936 paper. Turing tapes with binary code appear in three places. The big one represents his June 23, 1912 birth date—exactly how, we leave as a puzzle. We’ll just hint that all our friends across the Pond write dates differently from how we do. The new banknotes are being released on Turing’s birthday this coming June.

The GCHQ, which grew out of Turing’s employers at Bletchley Park, has just released its own puzzle challenge, which references Easter-egg features of the Turing banknote. The binary code puzzle is 11th of 12, but to our surprise, when you click it, it gives the answer to the birth date part right away. Then it proceeds to something far more cryptic—well, it’s from GCHQ after all.

Other Notes

The Turing release will also complete the conversion from paper to polymer of all Bank of England notes, which are also the only series issued for Wales. Scotland and Northern Ireland issue their own pound-denominated banknotes with different designs. They have yet to convert to polymer, and that will delay any consideration of adopting the new series.

Turing’s family on his father’s side came from Scotland, so there is some sentiment for the Royal Bank of Scotland adopting the design. There is also discussion of recognizing Robin Milner and/or Donald Michie, though both were born in England.

Over in Ireland, both sides of the Northern Ireland border are considering the statistician William Gosset, who as Head Brewer of Guinness over a century ago conceived the Student t-test among many innovations. However, Gosset was also born in England. The Bushmills distillery in Northern Ireland has already appeared on their 5-pound note.

In fact, Gosset and Milner and Michie were all originally nominated alongside Turing for the 50-pound note. We spot Bertrand Russell and William Tutte and Karen Spärck-Jones on the list. In physics, we notice Paul Dirac and John Bell and Stephen Hawking, and also the astronomer Vera Rubin—who was only American, no? Bell was from Northern Ireland.

Sir Michael Atiyah is not on the list because he was living at the time—no living person other than the monarch may appear on a banknote. This law does not apply to e-currency however, and this leads to what for us is the most exciting aspect of the Bank of England initiative.

Semi-Fungible Tokens

The following is a GLL exclusive. It was conveyed to us by Dr. Lofa Polir, who since our story a year ago has resided in England. It comes from a confluence of several facts:

  • National banks regularly issue limited editions of currency apart from the main series.
  • Government banks have largely refrained from direct involvement with digital currencies, but observe them closely.
  • Large sections of the public have come to ascribe intrinsic value to non-fungible tokens (NFTs), which come with a blockchain-verifiable certificate of unique ownership.
  • A non-fungible token by-definition cannot be used as money but only bought and sold as a (unique) commodity.
  • Playing cards in games such as Magic: the Gathering are regularly traded online to the extent of melding with cryptocurrencies or verging on being currency themselves.

The Bank of England recently voted to create an active response to this situation: the semi-fungible token (SFT). Like Bitcoin and other cryptocurrencies, SFTs limit their supply, but unlike them, SFTs are maintained by a central body and backed by its assets. An SFT is fungible by dint of having a fixed redemption rate to standard currency, but its own price may go higher. As with NFTs, each SFT is associated to a piece of digital artwork.

The new issues are based on the Royal Mail’s 2015 Inventive Britain collection, whose honorees include Sir Tim Berners-Lee for the World Wide Web. The main stamps were abstract, but associated issues are not subject to the law against likenesses of living persons:

Composite of src1, src2

Berners-Lee will be the first honoree in the SFT series. We are not authorized to reveal the art design before its release. But Lofa sent us a list of nominees that includes Geoffrey Hinton, Tony Hoare, and Leslie Valiant on the CS side, plus David Spiegelhalter and Demis Hassabis in data science. Unlike with the banknotes, the Bank of England can “catch them all.” Imagine, then, you will soon be able to pay for groceries with cards of some of your favorite computer and data scientists—at least if they are British and you buy the groceries in Britain.

Open Problems

If you see a market for NFTs, do you see one for SFTs—or are they both April Fools?

[added a sentence to the description of SFTs.]

by KWRegan at April 01, 2021 08:36 AM UTC

As is often the case, not all of these are links; they’re copies of posts that I made over on my Mastodon account because they were too short to make full posts here.

by David Eppstein at March 31, 2021 05:54 PM UTC

The New York Times is a great newspaper: it is also No Fun—Molly Ivins

The Gray Lady, as the New York Times is usually called, believes in her real slogan:

All the news that’s fit to print.

This slogan started about 115 years ago, and is on the masthead of the paper.

Today I thought, for something different, we would supply just news.

That is links to blogs on computer theory and related math. Links that are current, that are interesting, and that are fit to refer to. All the blogs that {\dots}

The Links

Our rule is we include a link provided it is current. It must have a recent post—say at least this year. See all for a complete list. But we decided to emphasis those that are up to date.

What’s new by Terence Tao
It is here. A recent article is “Boosting the van der Corput inequality using the tensor power trick”.

Gower’s Weblog by Timothy Gowers
It is here. A recent article is “Leicester mathematics under threat again”.

Not Even Wrong by Peter Woit
It is here. A recent article is “The Future of Fundamental Physics”.

Shtetl-Optimized by Scott Aaronson
It is here. A recent article is “QC ethics and hype: the call is coming from inside the house”.

Some Plane Truths by Adam Sheffer
It is here. A recent article is “The 2021 Polymath Jr Program”.

Computational Complexity by Lance Fortnow and Bill Gasarch
It is here. A recent article is “The key to my Taylor series problem: Buddy can you spare a penny, nickel, dime, or quarter”.

11011110 by David Eppstein
It is here. A recent article is “More mathematics books by women”.

Decentralized Thoughts by Ittai Abraham, Ittay Eyal, Kartik Nayak, Ling Ren, Alin Tomescu, Avishay Yanai
It is here. A recent article is “Living with Asynchrony: the Gather protocol”.

Foundation of Data Science by dstheory
It is here. A recent article previews a talk this Thu. Apr, 1 (at 1pm ET) by Ingrid Daubechies on “Discovering low-dimensional manifolds in high-dimensional data sets.”

Machine Learning Research Blog by Francis Bach
It is here. A recent article “Going beyond least-squares – II : Self-concordant analysis for logistic regression”.

Combinatorics and more by Gil Kalai
It is here. A recent article “What is mathematics (or at least, how it feels)”.

Kamathematics by Gautam Kamath
It is here. A recent article “Learning Theory Alliance and Mentoring Workshop”.

Process Algebra Diary by Luca Aceto
It is here. A recent article “Article by Sergey Kitaev and Anthony Mendes in Jeff Remmel’s memory”.

in Theory by Luca Trevisan
It is here. A recent article is an announcement for accepted papers STOC 2021.

“Marge, I agree with you — in theory. In theory, communism works. In theory.” — Homer Simpson.

Algorithms, Nature, and Society by Nisheeth Vishnoi
It is here. A recent article is an announcement for FOCS 2022.

Open Problems

Did we leave any out? Did we include any we should have {\dots}

by rjlipton at March 30, 2021 06:00 PM UTC

Shuchi Chawla, Madhur Tulsiani, and I are organizing a new summer school aimed at exposing undergraduate students to research directions in theoretical computer science and its applications. The school will take place from May 31 till June 4, 2021. This first iteration will be online, but we hope it will become a recurring and lasting event.  See for the webpage. We already have 5 exciting instructors signed up – Antonio Blanca, Ashia Wlson, Jelani Nelson, Nicole Immorlica, and Yael Kalai. We plan to have both technical tutorials / mini-courses and networking/social events.

This course is aimed at undergraduate students, and in particular students from groups that are currently under-represented in our field. If you are a faculty member in a university, we would be grateful if you can spread the announcement below to both your colleagues, and any mailing lists for undergraduate students, and in particular chapters of affinity groups such as the National Society of Black Engineers, Society of Hispanic Professional Engineers, National Center for Women & Information Technology, etc.

The committee for advancement of theoretical computer science (CATCS) is organizing an online  summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. The course is free of charge,and we welcome applications from undergraduates majoring in computer science or related fields. We particularly encourage applications from students that are members of groups that are currently under-represented in theoretical computer science. 

The course is intended for currently enrolled undergraduate students that are majoring in computer science or related fields. Students will be expected to be familiar with the material typically taught in introductory algorithms and discrete mathematics / mathematics for computer science courses. If you are unsure if you are prepared for the course, please write to us at

To apply for the course, please visit and fill in the application form by April 15, 2021

Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)

Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).

Please email with any questions.

by Boaz Barak at March 29, 2021 02:57 PM UTC

Here's a neat result I heard about at virtual Dagstuhl last week, a new lower bound on the number of hyperplanes that cuts all the edges of a hypercube.

A n-dimensional hypercube has 2n vertices corresponding to the binary strings of length n. Edges are between two vertices that differ in exactly one bit, for a total of (n/2) 2n edges. Hypercubes played a large role in Hao Huang's recent proof of the sensitivity conjecture. 

A hyperplane is just a linear inequality in x1,…,xn the bits of the string corresponding to a vertex. An edge is cut if the inequality is true for one vertex and false for the other.

With n hyperplanes you can cut all the edges in two very different ways. 

  • The hyperplanes xi > 0 for each i. These are n orthogonal hyperplanes.
  • The hyperplanes x1+…+xn > i for each i from 0 to n-1. These are n parallel hyperplanes.
Do you need n hyperplanes to cut all the edges? Mike Paterson found 5 explicit hyperplanes that cuts all the edges of a 6-dimensional hypercube (see survey by Mike Saks). Scaling that up means you need 5n/6 hyperplanes to cut an n-dimensional hypercube. That remains the best known upper bound.

For the lower bound, in 1971 Patrick O'Neil showed that any hyperplane can cut at most O(n0.5) fraction of the edges (sharp by the hyperplane x1+…+xn > n/2). Thus you need at least O(n0.5) hyperplanes which for 50 years was the best known bound.

Gil Yehuda and Amir Yehudayoff have given a new lower bound of O(n0.57). The paper gives a O(n0.51) bound but Yehudayoff said in a talk last week the bound has been improved.

Yehudayoff didn't go into much details in his 20 minute talk but the proofs uses geometry, probability and antichains. 

The result has some applications to complexity, namely you need at least n1.57 wires in a depth-two threshold circuit for parity. But the main fun is the question itself, that we finally made progress and there is still a long way to go.

by Lance Fortnow ( at March 29, 2021 12:47 PM UTC

The next Foundations of Data Science virtual talk will take place on Thursday, April 1st at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 17:00 UTC).  Ingrid Daubechies from Duke Univeristy will speak about “Discovering low-dimensional manifolds in high-dimensional data sets.”

Please register here to join the virtual talk.

Abstract: This talk reviews diffusion methods to identify low-dimensional manifolds underlying high-dimensional datasets, and illustrates that by pinpointing additional mathematical structure, improved results can be obtained. Much of the talk draws on a case study from a collaboration with biological morphologists, who compare different phenotypical structures to study relationships of living or extinct animals with their surroundings and each other. This is typically done from carefully defined anatomical correspondence points (landmarks) on e.g. bones; such landmarking draws on highly specialized knowledge. To make possible more extensive use of large (and growing) databases, algorithms are required for automatic morphological correspondence maps, without any preliminary marking of special features or landmarks by the user.

The series is supported by the NSF HDR TRIPODS Grant 1934846.

by dstheory at March 28, 2021 03:16 AM UTC

Three decades ago, Nisan constructed an explicit pseudorandom generator (PRG) that fools width-$n$ length-$n$ read-once branching programs (ROBPs) with error $\varepsilon$ and seed length $O(\log^2 n + \log n \cdot \log(1/\varepsilon))$ (Combinatorica 1992). Nisan's generator remains the best explicit PRG known for this important model of computation. However, a recent line of work starting with Braverman, Cohen, and Garg (Braverman, Cohen, and Garg SICOMP 2020; Chattopadhyay and Liao CCC 2020; Cohen, Doron, Renard, Sberlo, and Ta-Shma ECCC 2021; Pyne and Vadhan ECCC 2021) has shown how to construct *weighted* pseudorandom generators (WPRGs, aka pseudorandom pseudodistribution generators) with better seed lengths than Nisan's generator when the error parameter $\varepsilon$ is small. In this work, we present an explicit WPRG for width-$n$ length-$n$ ROBPs with seed length $O(\log^2 n + \log(1/\varepsilon))$. Our seed length eliminates $\log \log$ factors from prior constructions, and our generator completes this line of research in the sense that further improvements would require beating Nisan's generator in the standard constant-error regime. Our technique is a variation of a recently-discovered reduction that converts moderate-error PRGs into low-error WPRGs (Cohen et al. ECCC 2021; Pyne and Vadhan ECCC 2021). Our version of the reduction uses averaging samplers. We also point out that as a consequence of the recent work on WPRGs, any randomized space-$S$ decision algorithm can be simulated deterministically in space $O(S^{3/2} / \sqrt{\log S})$. This is a slight improvement over Saks and Zhou's celebrated $O(S^{3/2})$ bound (JCSS 1999). For this application, our improved WPRG is not necessary.

at March 27, 2021 08:27 AM UTC

The joy of knowing, understanding, and creating is common to all the sciences.—Vera Sós.

Dorit Aharonov is a researcher in quantum computation. She was a Ph.D. student of Michael Ben-Or and Avi Wigderson. Her 1999 thesis, entitled “Noisy Quantum Computation,” included her role with Ben-Or as one of several superposed groups who discovered the quantum fault tolerance threshold theorem.

Today Ken and I wish to send our congratulations to Avi and Laci—László Lovász—on winning the 2021 Abel Prize.

The prize citation is:

For their foundational contributions to theoretical computer science and discrete mathematics, and their leading role in shaping them into central fields of modern mathematics.


You might wonder why we are talking about Aharonov? She was a student of Avi and he was a student of mine—see this for more details. I am proud of Avi for his terrific results and also for his great students. I guess Aharonov is a grand-student of mine.

Lovász and Ken’s advisor, Dominic Welsh of Oxford, partnered with Dominic’s student Magnus Bordewich and Michael Freedman of Microsoft on a 2005 paper titled, “Approximate counting and quantum computation.” The paper translates criteria for approximating the Jones polynomial into simulations of quantum circuits. It references how topological quantum field theory (TQFT) implicitly embodies a quantum algorithm that meets the criteria, hence showing that the Jones approximation problem is complete for {\mathsf{BQP}} in a suitable promise-problem sense.

The genealogy of the Jones polynomial is that it was discovered by Vaughan Jones. Aharonov, with Jones and his student Zeph Landau, wrote a 2009 paper that makes the quantum algorithm for approximating the Jones polynomial explicit. To quote the converse of a statement in their abstract, removing the dependence on TQFT makes the algorithm “accessible to computer scientists.”

Aharonov’s work thus represents the mixing of elements of classical mathematics and computing theory (including quantum, i.e. non-classical, computing) that this year’s Abel Prize recognizes. Mixing is indeed the word, since results on the mixing rates of Markov chains by Lovász and many others undergird some of this work. This is true even more directly of a notable paper by her with Andris Ambainis, Julia Kempe, and Umesh Vazirani on quantum walks.

More Quantum

When we hosted the 2012 debate between Gil Kalai and Aram Harrow on the feasibility of large-scale quantum computing, Gil led off with the following photo of the physicist Robert Alicki, Ben-Or, Aharonov, and himself, which was taken in Jerusalem in 2005.

Aharonov wrote an influential long survey in 1998 titled “Quantum Computation,” which grew out of her thesis and is sometimes confused with it. Ken and I owe a small debt to diagrams she used in the survey, for which she in turn alluded to Richard Feynman’s path integrals and their visualizations:

We expanded on this kind of diagram in our textbook on quantum computing with MIT Press, whose second edition is scheduled to appear next month.

Abel Prize and Lower Bounds

Whereas the Fields Medals are awarded only every four years, the Abel Prize is annual. It commemorates the Norwegian mathematician Niels Abel. Although it has been awarded only since 2003, it was originally proposed in 1899 expressly as a counterpart to the Nobel Prizes—and fell through only because of the 1905 separation of Sweden and Norway. We most recently covered the 2019 prize going to Karen Uhlenbeck.

Abel of course is the mathematician who gave his name to abelian groups and much else. About his short life of twenty-seven years, Charles Hermite said:

Abel has left mathematicians enough to keep them busy for five hundred years.

What we note is that Abel’s most famous result is a lower bound type result. He showed the impossibility of solving the general quintic equation in radicals. Finding clever identities, clever inequities, clever algorithms is hard. But not as hard as showing that something cannot be done.

Abel’s result is perhaps like proving today, in the 21st century, a lower bound on the complexity of some concrete problem. Some of Avi’s and also Laci’s best work has been directed toward this quest: Prove some non-trivial lower bound. Avi has worked on numerous approaches, including the fusion method in 1993. Although its lack of breakthroughs has caused jibes about “cold fusion,” Avi noted the following in a recent talk at the Simons Institute:

I will review (and hope to revive) a collection of old works, which suggest obtaining circuit lower bounds by “fusing” many correct computations (of a circuit too small) into an incorrect computation.

In a couple of models, this viewpoint led to (slight) superlinear lower bounds, which nonetheless have not been improved upon for 25 years.

We hope that the Abel Prize award will promote efforts to achieve not-so-slightly super-linear lower bounds on circuit size, running time, algebraic size, quantum effort, anything like that…

Open Problems

We also tip our hat to others who have reported on the Abel Prize, including some of our fellow bloggers: Gil Kalai here, Scott Aaronson here, Lance Fortnow and Bill Gasarch here, plus Kevin Hartnett for Quanta here.

by RJLipton+KWRegan at March 26, 2021 04:22 PM UTC

A very useful tool in Asynchronus distributed computing is Reliable Broadcast, or simply called Broadcast. It allows a leader to send a message, knowing that all parties will eventually receive the same message, even if a malicious adversary control $f$ parties and $f<n/3$. Broadcast is deterministic and takes just a...

at March 26, 2021 10:54 AM UTC