Prediction is very difficult, especially if it’s about the future—Niels Bohr
Boston Globe “Uncertainty” source |
Lisa Randall is a professor of theoretical physics at Harvard. Her research has touched on many of the basic questions of modern physics: supersymmetry, Standard Model observables, cosmological inflation, baryogenesis, grand unified theories, and general relativity. She has also written popular books about her work and science in general. Thus she has a handle on aspects of science that overlap my expertise—not to mention those of her sister Dana Randall, whom I have known as a colleague for many years.
Today, Ken and I thought we would talk about recent developments in particle physics, and their connection to two topics dear to us.
Randall’s most recent popular book is Dark Matter and the Dinosaurs. The idea she advances is that the periodic extinctions in Earth’s history may have been caused when the solar system passes through a plane of dark matter within our galaxy. But dark matter and also dark energy have come under increasing recent doubt, even from their original formulator. Maybe Niels Bohr’s quote should also say:
Prediction is very difficult, especially if it’s about the past.
Randall’s previous book, Knocking on Heaven’s Door, is most relevant to this post. The 1973 Bob Dylan song title it pinches describes the feeling of doing frontier physical science. Insofar as her own work is mostly theoretical, much of it connects to feelings we have in computer science—especially complexity lower bounds where the door mostly feels slammed shut.
But the book is also about the practice of experimental science—not only how to gather knowledge but when and how we can have confidence in it. Its long middle part is titled, “Machines, Measurements, and Probability.” All three elements are foremost in considering a new development that involves two measurements taken 20 years apart.
Last Tuesday’s New York Times highlighted a potential discovery in particle physics. It was in their Tuesday science section.
The result is an experimental discovery that could show that the current model of matter is wrong.
“This is our Mars rover landing moment,” said Chris Polly, a physicist at the Fermi National Accelerator Laboratory, or Fermilab, in Batavia, Ill., who has been working toward this finding for most of his career.
Indeed. It is a Mars landing moment. They both involved many people, lots of exotic machinery, lots of money, many years. Say three billion dollars or so for Mars. Say nearly the same amount for muons—the annual budget for Fermilab is over one half billion dollars. It was certainly enough to reassemble and upgrade a huge accelerator ring that was first used at Brookhaven National Lab on Long Island in 2001:
NY Times src |
The study used the muon to probe the Standard Model of physics. Muons are useful because they are charged like an electron, which helps control them in an accelerator. Yet their mass is roughly 207 times larger than an electron. The same charge helps control their motion and the large mass makes collisions more interesting. As Polly stated in Natalie Wolchover’s story for Quanta:
“[I]f you’re looking for particles that could explain the missing mass of the universe—dark matter—or you’re looking for [supersymmetry], that’s where the muon has a unique role.”
Computer science theory is so different from high end physics. We are closer to the type of research that Randall does. We involve few people, no exotic machinery, and small amounts of money. Maybe the closest attribute we have to high-end research is we also take years and years.
Perhaps we are also jealous of high-end physics. Not just for money, but for the ability of particle physicists to get announcements into the New York Times. Polly said the following about the ending day of the muon experiment two decades ago:
When we revealed the results, people from all over the world flew in to visit the lab. These experiments take decades to build and analyze, so you don’t get to go to very many of these events. We did a little “Drumroll, please” and then had the postdoc managing the spreadsheet hit the button to show it on the projector. Lo and behold, you could see that there was still a three-sigma discrepancy!
At the time he was a graduate assistant assigned to machinery for measuring particle energies. He had fixed a problem where someone had touched a component with bare hands and thereby ruined its sheathing. All such problems were meant to be ironed-out by the drum-roll event. But all this raises two further interesting issues that connect the muon results with issues we think about in computer science. Let’s look at them next.
In an experimental science one must be aware that results are not exact. They are samples from some random process. Flip a coin 10 times in a row. If they all come up heads what does that mean? Could be the coin is fair but this happens about one time in a thousand. Or the coin is biased. Or something else.
Flip a muon many times. That is sample some muon experiment. The outcome is from a random process. Some of it comes from properties of the natural processes themselves and others from incidentals of the measurement apparatus. How do we decide if the experiment means what we think it does?
The theory developed by Carl Gauss and others before and after to delineate the normal distribution was largely prompted by analysis of measurement errors to begin with. This yields the “rule of three,” about the percentage of values that naturally lie within an interval estimate in a normal distribution: 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively.
The question is, how to assess cases where the measurement result is well outside these intervals—when can we conclude it is more than a deviation by natural chance? In social sciences a result is “significant” provided it lies outside two-sigma. In particle physics, there is a convention of a five-sigma effect (99.99994% confidence) being required to qualify as a discovery. No Nobel prize for less.
The situation with the muons has an extra factor of repeated measurements—but there have been only two measurements so far:
Composite of src1, src2 |
The blue line in the left figure is the original Brookhaven measurement; the red is the new one. There is also theoretical uncertainty in the calculation of the Standard Model prediction, and that combines with the measurement error bars to give the sigma baseline. The chart at right normalizes the deviation to parts per billion—the measurements need to be incredibly fine. This scale appears to be about 25% under the current -scale (it shows about for Brookhaven compared to its after revised uncertainty) but it is close enough to get the picture.
Although the new Fermilab result by itself deviates slightly less from the Standard Model, it corroborates the earlier measurement. It is not fully independent from it, but the combination is enough to raise the current claimed deviation to about sigmas. This is well above social science level but below Nobel level. This is with respect to the probability that the effects are real.
The social significance of is that it is above the “” level where hoped-for anomalies have subsequently disappeared for reasons chalked up to natural chance. This is because physicists around the world do many a hundredfold amount of hopeful measurements. Some measurements get initial “bumps” up just because of the numbers. But reduces the natural frequency under 1-in-40,000. This is why reproducing measurements is so important, why the new Fermilab team devoted all the expense and effort. A more independent measurement on other machines could give a higher boost that might get over the line. Time to break out the wallets and hammers?
The is not, however, beyond the realm of recent experience with apparatus faults and modeling error. On the latter, there is still some doubt about the theoretical prediction for the muon’s magnetic moment. In any event, the muon results are exciting but still below what is required for a true discovery. Time will tell.
The second factor we draw attention to concerns the human “hoping” directly.
In any experimental science one must also be aware that people are not unbiased. Scientists have much invested in the outcome of their experiments. Think jobs, tenure even, funding, and more. So a big physics experiment like the muon one must be careful. They follow standard practice to perform blinded data analysis.
This surprised me. This blinding is a crypto-type protocol, which is something we computer scientists study. The muon team performed a protocol that protected against cheating. Here is how they did it:
In this case, the master clock that keeps track of the muons’ wobble had been set to a rate unknown to the researchers. The figure was sealed in envelopes that were locked in the offices at Fermilab and the University of Washington in Seattle.
In a ceremony on Feb. 25 that was recorded on video and watched around the world on Zoom, Dr. Polly opened the Fermilab envelope and David Hertzog from the University of Washington opened the Seattle envelope. The number inside was entered into a spreadsheet, providing a key to all the data, and the result popped out to a chorus of wows.
“That really led to a really exciting moment, because nobody on the collaboration knew the answer until the same moment,” said Saskia Charity, a Fermilab postdoctoral fellow who has been working remotely from Liverpool, England, during the pandemic.
This mechanism for blinding suggests possible crypto questions. They hid the master clock rate. Can this be modeled as one of our crypto problems? Can we prove some security bounds? If they claim that hiding the rate protects against cheating then they should be able to make this claim precise. The discovery of gravitational waves used a blind injection scheme tailored for that experiment. How can this be generalized?
We have discussed two aspects that involve soft numbers rather than hard machines and hard-shelled particles. Perhaps they are interesting new problems for us? What do you think?
Alice works at a charity that puts together bag and box lunches for children.
They all wear masks and they are 12 feet apart and very careful, and nobody there has gotten COVID.
Then Alice gets here first COVID shot and says:
I am not going to work for that charity until I have had my second shot and waited 4 weeks so I am immune.
She is really scared of getting COVID NOW that she is on the verge of being immune.
Is that logical? She was not scared before. So does it make sense to be scared now? I see where she is coming from emotionally, but is there a logical argument for her viewpoint? I ask nonrhetorically.
bill g.
Authors: Robert Cardona, Eva Miranda, Daniel Peralta-Salas
Download: PDF
Abstract: In this article we construct a compact Riemannian manifold of high dimension
on which the time dependent Euler equations are Turing complete. More
precisely, the halting of any Turing machine with a given input is equivalent
to a certain global solution of the Euler equations entering a certain open set
in the space of divergence-free vector fields. In particular, this implies the
undecidability of whether a solution to the Euler equations with an initial
datum will reach a certain open set or not in the space of divergence-free
fields. This result goes one step further in Tao's programme to study the
blow-up problem for the Euler and Navier-Stokes equations using fluid
computers. As a remarkable spin-off, our method of proof allows us to give a
counterexample to a conjecture of Moore dating back to 1998 on the
non-existence of analytic maps on compact manifolds that are Turing complete.
Authors: Sanjana Dey, Ramesh K. Jallu, Subhas C. Nandy
Download: PDF
Abstract: In this article, we study the Euclidean minimum spanning tree problem in an
imprecise setup. The problem is known as the \emph{Minimum Spanning Tree
Problem with Neighborhoods} in the literature. We study the problem where the
neighborhoods are represented as non-crossing line segments. Given a set ${\cal
S}$ of $n$ disjoint line segments in $I\!\!R^2$, the objective is to find a
minimum spanning tree (MST) that contains exactly one end-point from each
segment in $\cal S$ and the cost of the MST is minimum among $2^n$ possible
MSTs. We show that finding such an MST is NP-hard in general, and propose a
$2\alpha$-factor approximation algorithm for the same, where $\alpha$ is the
approximation factor of the best-known approximation algorithm to compute a
minimum cost Steiner tree in an undirected graph with non-negative edge
weights. As an implication of our reduction, we can show that the unrestricted
version of the problem (i.e., one point must be chosen from each segment such
that the cost of MST is as minimum as possible) is also NP-hard. We also
propose a parameterized algorithm for the problem based on the "separability"
parameter defined for segments.
Authors: Paul Beaujean, Florian Sikora, Florian Yger
Download: PDF
Abstract: Feature generation is an open topic of investigation in graph machine
learning. In this paper, we study the use of graph homomorphism density
features as a scalable alternative to homomorphism numbers which retain similar
theoretical properties and ability to take into account inductive bias. For
this, we propose a high-performance implementation of a simple sampling
algorithm which computes additive approximations of homomorphism densities. In
the context of graph machine learning, we demonstrate in experiments that
simple linear models trained on sample homomorphism densities can achieve
performance comparable to graph neural networks on standard graph
classification datasets. Finally, we show in experiments on synthetic data that
this algorithm scales to very large graphs when implemented with Bloom filters.
Authors: Amnon Rosenmann
Download: PDF
Abstract: The $k$-cardinality assignment problem asks for finding a maximal (minimal)
weight of a matching of cardinality $k$ in a weighted bipartite graph
$K_{n,n}$, $k \leq n$. The algorithm of Gassner and Klinz from 2010 for the
parametric assignment problem computes in time $O(n^3)$ the set of
$k$-cardinality assignments for those integers $k \leq n$ which refer to
"essential" terms of a corresponding maxpolynomial. We show here that one can
extend this algorithm and compute in a second stage the other "semi-essential"
terms in time $O(n^2)$, which results in a time complexity of $O(n^3)$ for the
whole sequence of $k=1,...,n$-cardinality assignments. The more there are
assignments left to be computed at the second stage the faster the two-stage
algorithm runs. In general, however, there is no benefit for this two-stage
algorithm on the existing algorithms, e.g. the simpler network flow algorithm
based on the successive shortest path algorithm which also computes all the
$k$-cardinality assignments in time $O(n^3)$.
Oded Goldreich is a theoretical computer scientist at the Weizmann Institute in Rehovot, Israel. He’s best known for helping to lay the rigorous foundations of cryptography in the 1980s, through seminal results like the Goldreich-Levin Theorem (every one-way function can be modified to have a hard-core predicate), the Goldreich-Goldwasser-Micali Theorem (every pseudorandom generator can be made into a pseudorandom function), and the Goldreich-Micali-Wigderson protocol for secure multi-party computation. I first met Oded more than 20 years ago, when he lectured at a summer school at the Institute for Advanced Study in Princeton, barefoot and wearing a tank top and what looked like pajama pants. It was a bracing introduction to complexity-theoretic cryptography. Since then, I’ve interacted with Oded from time to time, partly around his firm belief that quantum computing is impossible.
Last month a committee in Israel voted to award Goldreich the Israel Prize (roughly analogous to the US National Medal of Science), for which I’d say Goldreich had been a plausible candidate for decades. But alas, Yoav Gallant, Netanyahu’s Education Minister, then rather non-gallantly blocked the award, solely because he objected to Goldreich’s far-left political views (and apparently because of various statements Goldreich signed, including in support of a boycott of Ariel University, which is in the West Bank). The case went all the way to the Israeli Supreme Court (!), which ruled two days ago in Gallant’s favor: he gets to “delay” the award to investigate the matter further, and in the meantime has apparently sent out invitations for an award ceremony next week that doesn’t include Goldreich. Some are now calling for the other winners to boycott the prize in solidarity until this is righted.
I doubt readers of this blog need convincing that this is a travesty and an embarrassment, a shanda, for the Netanyahu government itself. That I disagree with Goldreich’s far-left views (or might disagree, if I knew in any detail what they were) is totally immaterial to that judgment. In my opinion, not even Goldreich’s belief in the impossibility of quantum computers should affect his eligibility for the prize.
Maybe it would be better to say that, as far as his academic colleagues in Israel and beyond are concerned, Goldreich has won the Israel Prize; it’s only some irrelevant external agent who’s blocking his receipt of it. Ironically, though, among Goldreich’s many heterodox beliefs is a total rejection of the value of scientific prizes (although Goldreich has also said he wouldn’t refuse the Israel Prize if offered it!).
In unrelated news, the 2020 Turing Award has been given to Al Aho and Jeff Ullman. Aho and Ullman have both been celebrated leaders in CS for half a century, having laid many of the foundations of formal languages and compilers, and having coauthored one of CS’s defining textbooks with John Hopcroft (who already received a different Turing Award).
But again there’s a controversy. Apparently, in 2011, Ullman wrote to an Iranian student who wanted to work with him, saying that as “a matter of principle,” he would not accept Iranian students until the Iranian government recognized Israel. Maybe I should say that I, like Ullman, am both a Jew and a Zionist, but I find it hard to imagine the state of mind that would cause me to hold some hapless student responsible for the misdeeds of their birth-country’s government. Ironically, this is a mirror-image of the tactics that the BDS movement has wielded against Israeli academics. Unlike Goldreich, though, Ullman seems to have gone beyond merely expressing his beliefs, actually turning them into a one-man foreign policy.
I’m proud of the Iranian students I’ve mentored and hope to mentor more. While I don’t think this issue should affect Ullman’s Turing Award (and I haven’t seen anyone claim that it should), I do think it’s appropriate to use the occasion to express our opposition to all forms of discrimination. I fully endorse Shafi Goldwasser’s response in her capacity as Director of the Simons Institute for Theory of Computing in Berkeley:
As a senior member of the computer science community and an American-Israeli, I stand with our Iranian students and scholars and outright reject any notion by which admission, support, or promotion of individuals in academic settings should be impeded by national origin or politics. Individuals should not be conflated with the countries or institutions they come from. Statements and actions to the contrary have no place in our computer science community. Anyone experiencing such behavior will find a committed ally in me.
As for Al Aho? I knew him fifteen years ago, when he became interested in quantum computing, in part due to his then-student Krysta Svore (who’s now the head of Microsoft’s quantum computing efforts). Al struck me as not only a famous scientist but a gentleman who radiated kindness everywhere. I’m not aware of any controversies he’s been involved in and never heard anyone say a bad word about him.
Anyway, this seems like a good occasion to recognize some foundational achievements in computer science, as well as the complex human beings who produce them!
Scott Aaronson wrote last month about the hype over quantum computing. I'd thought I'd drop a few stories.
I was once asked to review a grant proposal (outside the US) that claimed it would find a quantum algorithm for NP-hard problems. I wrote a scathing review but the grant was funded because I failed to prove that it was impossible. I replied that they should fund my research to teleport people from Chicago to Paris because they couldn't prove I couldn't do it. I never got a response.
I went to a Microsoft Faculty Research Summit which had a big focus on quantum computing. I complained of the quantum computing hype. My friends in the field denied the hype. Later at the summit a research head said that Microsoft will solve world hunger with quantum computing.
I was meeting with a congressional staffer who had worked on the National Quantum Initiative which coincidentally was being announced that day. I said something about high risk, high reward. He looked shocked--nobody had told him before that quantum computing is a speculative technology.
Quantum computing has generated a large number of beautiful and challenging scientific questions. Thinking about quantum has helped generate classical complexity and algorithmic results. But quantum computing having a real-world impact in the near or mid-term is unlikely. Most scientists I know working directly in quantum research are honest about the limitations and challenges in quantum computing. But somehow that message is not often getting to the next layers up, the policy makers, the research managers, the university administrators, the media and the venture capitalists.
But who knows, maybe some quantum heuristic that doesn't need much entanglement will change the world tomorrow. I can't prove it's impossible.
Amazon Advertising opens applications for early career scientists. The new program, which offers full-time two-year positions, is aimed at recent PhD graduates who want to innovate, publish, and have their work impact millions of customers. The application deadline is May 14.
Can you trust a model whose designer had access to the test/holdout set? This implicit question in Dwork et al 2015 launched a new field, adaptive data analysis. The question referred to the fact that in many scientific settings as well as modern machine learning (with its standardized datasets like CIFAR, ImageNet etc.) the model designer has full access to the holdout set and is free to ignore the
(Basic Dictum of Data Science) “Thou shalt not train on the test/holdout set.”
Furthermore, even researchers who scrupulously follow the Basic Dictum may be unknowingly violating it when they take inspiration (and design choices) from published works by others who presumably published only the best of the many models they evaluated on the test set.
Dwork et al. showed that if the test set has size $N$, and the designer is allowed to see the error of the first $i-1$ models on the test set before designing the $i$’th model, then a clever designer can use so-called wacky boosting (see this blog post) to ensure the accuracy of the $t$’th model on the test set as high as $\Omega(\sqrt{t/N})$. In other words, the test set could become essentially useless once $t \gg N$, a condition that holds in ML, whereby in popular datasets (CIFAR10, CIFAR100, ImageNet etc.) $N$ is no more than $100,000$ and the total number of models being trained world-wide is well in the millions if not higher (once you include hyperparameter searches).
Meta-overfitting Error (MOE) of a model is the difference between its average error on the test data and its expected error on the full distribution. (It is closely related to false discovery rate in statistics.)
This blog post concerns our new paper, which gives meaningful upper bounds on this sort of trouble for popular deep net architectures, whereas prior ideas from adaptive data analysis gave no nontrivial estimates. We call our estimate Rip van Winkle’s Razor which combines references to Occam’s Razor and the mythical person who fell asleep for 20 years.
It is well-known that for a model trained without ever querying the test set, MOE scales (with high probability over choice of the test set) as $1/\sqrt{N}$ where $N$ is the size of the test set. Furthermore standard concentration bounds imply that even if we train $t$ models without ever referring to the test set (in other words, using proper data hygiene) then the maximum meta-overfitting error among the $t$ models scales whp as $O(\sqrt{\log(t)/ N})$. The trouble pinpointed by Dwork et al. can happen only if models are designed adaptively, with test error of the previous models shaping the design of the next model.
Adaptive Data Analysis has come up with many good practices for honest researchers to mitigate such issues. For instance, Dwork et al. showed that using Differential Privacy on labels while evaluating models can lower MOE. Or the Ladder mechanism helps in Kaggle-like settings where the test dataset resides on a server that can choose to answers only a selected subset of queries, which essentially takes away the MOE issue.
For several good practices matching lower bounds exist showing a way to construct cheating models with MOE matching the upper bound.
However such recommended best practices do not help with understanding the MOE in the performance numbers of a new model since there is no guarantee that the inventors never tuned models using the test set, or didn’t get inspiration from existing models that may have been designed that way. Thus statistically speaking the above results still give no reason to believe that a modern deep net such as ResNet152 has low MOE.
Recht et al. 2019 summed up the MOE issue in a catchy title: Do ImageNet Classifiers Generalize to ImageNet? They tried to answer their question experimentally by creating new test sets from scratch –we discuss their results later.
The starting point of our work is the following classical concentration bounds:
Folklore Theorem With high probability over the choice of a test set of size $N$, the MOE of all models with description length at most $k$ bits is $O(\sqrt{k/N})$.
At first sight this doesn’t seem to help us because one cannot imagine modern deep nets having a short description. The most obvious description involves reporting values of the net parameters, which requires millions or even hundreds of millions of bits, resulting in a vacuous upper bound on MOE.
Another obvious description would be the computer program used to produce the model using the (publicly available) training and validation sets. However, these programs usually rely on imported libraries through layers of encapsulation and so the effective program size is pretty large as well.
Our new upper bound involves a more careful definition of Description Length: it is the smallest description that allows a referee to reproduce a model of similar performance using the (universally available) training and validation datasets.
While this phrasing may appear reminiscent of the review process for conferences and journals, there is a subtle difference with respect to what the referee can or cannot be assumed to know. (Clearly, assumptions about the referee can greatly affect description length —e.g, a referee ignorant of even basic calculus might need a very long explanation!)
Informed Referee: “Knows everything that was known to humanity (e.g., about deep learning, mathematics,optimization, statistics etc.) right up to the moment of creation of the Test set.”
Unbiased Referee: Knows nothing discovered since the Test set was created.
Thus Description Length of a model is the number of bits in the shortest description that allows an informed but unbiased referee to reproduce the claimed result.
Note that informed referees let descriptions get shorter. Unbiased require longer descriptions that rule out any statistical “contamination” due to any interaction whatsoever with the test set. For example, momentum techniques in optimization were well-studied before the creation of ImageNet test set, so informed referees can be expected to understand a line like “SGD with momentum 0.9.” But a line like “Use Batch Normalization” cannot be understood by unbiased referees since conceivably this technique (invented after 2012) might have become popular precisely because it leads to better performance on the test set of ImageNet.
By now it should be clear why the estimate is named after “Rip van Winkle”: the referee can be thought of as an infinitely well-informed researcher who went into deep sleep at the moment of creation of the test set, and has just been woken up years later to start refereeing the latest papers. Real-life journal referees who luckily did not suffer this way should try to simulate the idealized Rip van Winkle in their heads while perusing the description submitted by the researcher.
To allow as short a description as possible the researcher is allowed to compress the description of their new deep net non-destructively using any compression that would make sense to Rip van Winkle (e.g., Huffman Coding). The description of the compression method itself is not counted towards the description length – provided the same method is used for all papers submitted to Rip van Winkle. To give an example, a technique appearing in a text known to Rip van Winkle could be succinctly referred to using the book’s ISBN number and page number.
As an illustration, here we provide a suitable description allowing Rip van Winkle to reproduce a mainstream ImageNet model, ResNet-152, which achieves $4.49\%$ top-5 test error.
The description consists of three types of expressions: English phrases, Math equations, and directed graphs. In the paper, we describe in detail how to encode each of them into binary strings and count their lengths. The allowed vocabulary includes primitive concepts that were known before 2012, such as CONV, MaxPool, ReLU, SGD etc., as well as a graph-theoretic notation/shorthand for describing net architecture. The newly introduced concepts including Batch-Norm, Layer, Block are defined precisely using Math, English, and other primitive concepts.
According to our estimate, the length of the above description is $1032$ bits, which translates into a upper bound on meta-overfitting error of merely $5\%$! This suggests the real top-5 error of the model on full distribution is at most $9.49\%$. In the paper we also provide a $980$-bit long description for reproducing DenseNet-264, which leads to $5.06\%$ upper bound on its meta-overfitting error.
Note that the number $5.06$ suggests higher precision than actually given by the method, since it is possible to quibble about the coding assumptions that led to it. Perhaps others might use a more classical coding mechanism and obtain an estimate of $6\%$ or $7\%$.
But the important point is that unlike existing bounds in Adaptive Data Analysis, there is no dependence on $t$, the number of models that have been tested before, and the bound is non-vacuous.
Our estimates indicate that the issue of meta-overfitting on ImageNet for these mainstream models is mild. The reason is that despite the vast number of parameters and hyper-parameters in today’s deep nets, the information content of these models is not high given knowledge circa 2012.
Recently Recht et al. tried to reach an empirical upper bound on MOE for ImageNet and CIFAR-10. They created new tests sets by carefully replicating the methodology used for constructing the original ones. They found that error of famous published models of the past seven years is as much as 10-15% higher on the new test set as compared to the original. On the face of it, this seemed to confirm a case of bad meta-overfitting. But they also presented evidence that the swing in test error was due to systemic effects during test set creation. For instance, a comparable swing happens also for models that predated the creation of ImageNet (and thus were not overfitted to the ImageNet test set). A followup study of a hundred Kaggle competitions used fresh, identically distributed test sets that were available from the official competition organizers. The authors concluded that MOE does not appear to be significant in modern ML.
To us the disquieting takeaway from Recht et al.’s results was that estimating MOE by creating a new test set is rife with systematic bias at best, and perhaps impossible, especially in datasets concerning rare or one-time phenomena (e.g., stock prices). Thus their work still left a pressing need for effective upper bounds on meta-overfitting error. Our Rip van Winkle’s Razor is elementary, and easily deployable by the average researcher. We hope it becomes part of the standard toolbox in Adaptive Data Analysis.
In this blog post, I outline how existing advertising platforms do not prevent high-stakes ads from reaching different demographics at different rates. The post then describes how pushing this responsibility down to advertisers rather than addressing it at the platform leaves manipulating a complex system to those least aware of the system’s inner workings. She then proposes a simpler, more unified solution to this problem: advertising slots should be either targetable or untargetable, and high-stakes ads should be in the untargeted segment. Finally, the post concludes with a discussion of how this segmentation need not cost these systems substantial revenue if reserve prices are used appropriately.
The next TCS+ talk will take place this coming Wednesday, April 14th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC). Andrea Lincoln from UC Berkeley will speak about “New Techniques for Proving Fine-Grained Average-Case Hardness” (abstract below).
You can reserve a spot as an individual or a group to join us live by signing up on the online form. Due to security concerns, registration is required to attend the interactive talk. (The recorded talk will also be posted on our website afterwards, so people who did not sign up will still be able to watch the talk)
As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.
Abstract: In this talk I will cover a new technique for worst-case to average-case reductions. There are two primary concepts introduced in this talk: “factored” problems and a framework for worst-case to average-case fine-grained (WCtoACFG) self reductions.
We will define new versions of OV, kSUM and zero-k-clique that are both worst-case and average-case fine-grained hard assuming the core hypotheses of fine-grained complexity. We then use these as a basis for fine-grained hardness and average-case hardness of other problems. Our hard factored problems are also simple enough that we can reduce them to many other problems, e.g. to edit distance, k-LCS and versions of Max-Flow. We further consider counting variants of the factored problems and give WCtoACFG reductions for them for a natural distribution.
To show hardness for these factored problems we formalize the framework of [Boix-Adsera et al. 2019] that was used to give a WCtoACFG reduction for counting k-cliques. We define an explicit property of problems such that if a problem has that property one can use the framework on the problem to get a WCtoACFG self reduction. In total these factored problems and the framework together give tight fine-grained average-case hardness for various problems including the counting variant of regular expression matching.
Based on joint work with Mina Dalirrooyfard and Virginia Vassilevska Williams.
TL;DR: The summer school we are organizing is looking for TAs. Please forward this to your students as well as any departmental mailing lists.
Are you passionate about teaching? Or about increasing diversity within TCS? If so, we need your help!
The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. We particularly encourage participants from groups that are currently under-represented in TCS. See https://boazbk.github.io/tcs-summerschool/ for more details.
We are looking for TAs to help run the school.
TAs will have the following responsibilities:
• Plan team building and ice breaking activities and social events for the summer school
• Lead small groups during the week
• Monitor questions in chat during lectures
• Work with one of the instructors to prepare one homework
• Grade homework
• Provide mentorship to students
• Possibly assist with reviewing applications and other technical/admin aspects of running the school
The time commitment will be ~20 hours during the week of May 31-June 4; ~5-10 hours prior to that week; and ~2-3 hours following that week. We are hoping to pay an amount of $500 to each TA (please note that international students will need a CPT for this).
To apply for a TA position, please fill in the application form at https://forms.gle/QCxLn8R81Ga4JQLH8 by April 15, 2021. Please also have a faculty advisor send a short recommendation to summer-school-admin@boazbarak.org. Please ask them to use the subject “TA recommendation for <<Your Name>>”.
Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)
Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).
Please email summer-school-admin@boazbarak.org with any questions.
CIFellows is a general CRA/CCC program for matching postdocs with advisors, and supports 2-year postdocs. Applications will open mid-April, and are due early May. It is competitive across all areas of CS.
Website: https://cifellows2021.org/
Email: shachar.lovett@gmail.com
Although the prime numbers are rigidly determined, they somehow feel like experimental data—Tim Gowers
Finnish Scientific Courage Award source |
Kaisa Matomäki is a Finnish mathematician working in number theory. She has some terrific results on prime numbers—results that have won several important prizes including the 2021 Ruth Lyttle Satter Prize: It is presented to a woman who has made an outstanding contribution to mathematics research. The prize citation specifically mentions a 2015 paper with Maksym Radziwill that contributed to Terence Tao’s resolution of the Erdős discrepancy problem—and indeed we highlighted Tao’s followup work with her and Radziwill in our 2015 post on this.
Today, Ken and I thought we would combine one of her new deep theorems with some shallow observations.
The set of primes is denoted by , as usual. As complexity theorists we have known that for decades that the Boolean circuit complexity of is polynomial. This follows from the existence of a polynomial random algorithm for due to Robert Solovay and Volker Strassen, and then applying Len Adleman’s connection between such algorithms and Boolean circuit complexity.
Matomäki views in a different way. As an analytic number theorist she is concerned with the density and structure of primes, not so much the complexity of recognizing or generating a prime.
Exact results on the set of primes are hard to come by.
Twin Primes: We know that there are infinitely many such that and are prime. But not proved.
Goldbach: We know that every even number from onward is the sum of two primes. But not proved.
Density of Primes: We known that the gap between consecutive primes is at most But not proved.
The Riemann Hypothesis attempts to say more about density of the primes than the Prime Number Theorem does. It has been proved equivalent to some statements about approximate growth rates, yet even these forms have not been touched.
Instead of trying for better approximate results about primes, can we learn more by relaxing the notion of “prime” itself? For each , define to be the set of numbers that are products of at most primes. Then is the set of primes (not the prime powers) and is the set of products of two primes—which include the Blum integers of special interest to cryptography. Collectively the sets represent different levels of being “almost prime.” The motivating question is:
How well and in what ways does the structure of the sets of almost-primes model the structure of the primes?
In 2010, John Friedlander and Henryk Iwaniec wrote a monograph titled Opera de Cribro, which is Latin for “Works of the Sieve.” They proved certain results about and speculated whether their methods improve them to work at least for . Then they say, “It would be interesting to get integers with at most two prime divisors.” This is where Matomäki comes in—with a second kind of relaxing, one applying to “almost all” members of .
Matomäki has a brand new paper (revised two weeks ago from a December original) titled, “Almost primes in almost all very short intervals.” It follows on from a paper by one of her students, Joni Teräväinen, titled “Almost all primes in almost all short intervals.”
Her main result is the following theorem.
Theorem 1 Suppose as and put . Then for almost all and , the interval contains a .
Roger Heath-Brown had established the corresponding result for primes (that is, for ) but only assuming simultaneously the Riemann hypothesis and the pair correlation conjecture for the zeros of the Riemann zeta function. With no hypotheses, the presence of primes in almost all intervals is known only when the intervals have length —concretely, is known. Thus, Matomäki has traded off the use of conjectures for the relaxed notion of prime.
The “almost call” is in the sense of average density of whose short interval is populated. This average instead of worst case mirrors our difficulties with circuit lower bounds. We can easily show that on average Boolean function have huge circuit lower bounds. But worst case bounds for specific functions is beyond reach. Hmmm.
We previously proved:
Theorem 2 (Lower Bound) For any , the depth of a DeMorgan boolean circuit for is at least .
Recall a DeMorgan formula on variables is a binary tree whose leaves are labeled with variables or their negations, and whose internal nodes are labeled with either for OR and for AND gates.
Here is a high level view of this result. See here for details.
Assume there is a DeMorgan Boolean circuit for is at most depth. We can use the random restriction method to set lots of the inputs to random values .
We claim that with high probability the input looks like
The right-most bits
are left unset, and the other bits to the left have some bits randomly set. Moreover, the number of bits in the above right is order .
Now set all the bits in the left group that are unset also to random values . Leave the right group unset.
The point of this is that if we assume that the formula had size at most order , then we have that the formula is likely to be constant. But this contradicts the density of primes and composites.
Can Matomäki’s theorem be used to get stronger lower bounds on the complexity of ? Can it go beyond the quadratic limit?
After several attempts, I finally found the energy to start writing a book. It grew out of lecture notes for a graduate class I taught last semester. I make the draft available so that I can get feedback before a (hopefully) final effort next semester.
The goal of the book is to present old and recent results in learning theory, for the most widely-used learning architectures. This book is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are large users of learning methods, such as computer vision or natural language processing.
A particular effort is made to prove many results from first principles, while keeping the exposition as simple as possible. This will naturally lead to a choice of key results that show-case in simple but relevant instances the important concepts in learning theory. Some general results will also be presented without proofs. Of course the concept of first principles is subjective, and a good knowledge of linear algebra, probability theory and differential calculus will be assumed.
Moreover, I will focus on the part of learning theory that does not exist outside of algorithms that can be run in practice, and thus all algorithmic frameworks described in this book are routinely used. For most learning methods, some simple illustrative experiments are presented, with the plan to have accompanying code (Matlab, Julia, and Python) so that students can see for themselves that the algorithms are simple and effective in synthetic experiments.
This is not an introductory textbook on machine learning. There are already several good ones in several languages [1, 2]. Many topics are not covered, and many more are not covered in much depth. There are many good textbooks on learning theory that go deeper [3, 4, 5].
The choice of topics is arbitrary (and thus personal). Many important algorithmic frameworks are forgotten (e.g., reinforcement learning, unsupervised learning, etc.). Suggestions of extra themes are welcome! A few additional chapters are currently being written such as: ensemble learning, bandit optimization, probabilistic methods, structured prediction.
This is still work in progress. In particular, there are still a lot of typos, probably some mistakes, and almost surely places where more details are needed; readers are most welcome to report them to me (and then get credit for it). Also, the bibliography is currently quite short and would benefit from some expansion (all suggestions welcome, in particular for giving proper credit).
Moreover, I am convinced that simpler mathematical arguments are possible in many places in the book. If you are aware of elegant and simple ideas that I have overlooked, please let me know.
[1] Chloé-Agathe Azencott. Introduction au Machine Learning. Dunod, 2019.
[2] Ethem Alpaydin. Introduction to Machine Learning. MIT Press, 2020.
[3] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
[4] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[5] Andreas Christmann and Ingo Steinwart. Support Vector Machines. Springer, 2008.
A somewhat quieter month by recent standards. Three Two papers: graph property testing and quantum distribution testing. (Ed: The distribution testing paper was a revision of a paper we already covered in Sept 2020.)
Robust Self-Ordering versus Local Self-Ordering by Oded Goldreich (ECCC). In Nov 2020, we covered a paper that uses a tool called self-ordered graphs, that transferred bit string function lower bounds to graph property testing. Consider a labeled graph. A graph is self-ordered if its automorphism group only contains the identity element (it has no non-trivial isomorphisms). A graph is robustly self-ordered, if every permutation of the vertices leads to a (labeled) graph that is sufficiently “far” according to edit distance. Given a self-ordered graph \(G\), a local self-ordering procedure is the following. Given access to a copy \(G’\) of \(G\) and a vertex \(v \in V(G’)\), this procedure determines the (unique) vertex in \(V(G)\) that corresponds to \(v\) with sublinear queries to \(G\). In other words, it can locally “label” the graph. Intuitively, one would think that more robustly self-ordered graphs will be easier to locally label. This paper studies the relation between robust and local self-ordering. Curiously, this paper refutes the above intuition for bounded-degree graphs, and (weakly) confirms it for dense graphs. Roughly speaking, there are bounded degree graphs that are highly robustly self-ordered, for which any local self-ordering procedure requires \(\omega(\sqrt{n})\) queries. Moreover, there are bounded degree graphs with \(O(\log n)\)-query local self-ordering procedures, yet are not robustly self-ordered even for weak parameters. For dense graphs, the existence of fast non-adaptive local self-ordering procedures implies robust self-ordering.
Testing identity of collections of quantum states: sample complexity analysis by Marco Fanizza, Raffaele Salvia, and Vittorio Giovannetti (arXiv). This paper takes identity testing to the quantum setting. One should think of a \(d\)-dimensional quantum state as a \(d \times d\) density matrix (with some special properties). To learn the state entirely up to error \(\varepsilon\) would require \(O(\varepsilon^{-2} d^2)\) samples/measurements. A recent result of Badescu-O’Donnell-Wright proves that identity testing to a known state can be done significantly faster using \(O(\varepsilon^{-2} d)\) measurements. This paper takes this result a step further by consider a set of \(N\) quantum states. A “sample” is like a classical sample, where one gets a sample from a distribution of quantum states. The YES (“uniform”) case is when all the states are identical. The NO (“far from uniform”) case is when they are “far” from being the same state. This paper proves that \(O(\varepsilon^{-2}\sqrt{N}d)\) samples suffices for distinguishing these cases.
Update (April 5, 2021): So it turns out that Adam Chalcraft and Michael Greene already proved the essential result of this post back in 1994 (hat tip to commenter Dylan). Not terribly surprising in retrospect!
My son Daniel had his fourth birthday a couple weeks ago. For a present, he got an electric train set. (For completeness—and since the details of the train set will be rather important to the post—it’s called “WESPREX Create a Dinosaur Track”, but this is not an ad and I’m not getting a kickback for it.)
As you can see, the main feature of this set is a Y-shaped junction, which has a flap that can control which direction the train goes. The logic is as follows:
The train set also comes with bridges and tunnels; thus, there’s no restriction of planarity. Finally, the train set comes with little gadgets that can reverse the train’s direction, sending it back in the direction that it came from:
These gadgets don’t seem particularly important, though, since we could always replace them if we wanted by a Y-junction together with a loop.
Notice that, at each Y-junction, the position of the flap stores one bit of internal state, and that the train can both “read” and “write” these bits as it moves around. Thus, a question naturally arises: can this train set do any nontrivial computations? If there are n Y-junctions, then can it cycle through exp(n) different states? Could it even solve PSPACE-complete problems, if we let it run for exponential time? (For a very different example of a model-train-like system that, as it turns out, is able to express PSPACE-complete problems, see this recent paper by Erik Demaine et al.)
Whatever the answers regarding Daniel’s train set, I knew immediately on watching the thing go that I’d have to write a “paperlet” on the problem and publish it on my blog (no, I don’t inflict such things on journals!). Today’s post constitutes my third “paperlet,” on the general theme of a discrete dynamical system that someone showed me in real life (e.g. in a children’s toy or in biology) having more structure and regularity than one might naïvely expect. My first such paperlet, from 2014, was on a 1960s toy called the Digi-Comp II; my second, from 2016, was on DNA strings acted on by recombinase (OK, that one was associated with a paper in Science, but my combinatorial analysis wasn’t the main point of the paper).
Anyway, after spending an enjoyable evening on the problem of Daniel’s train set, I was able to prove that, alas, the possible behaviors are quite limited (I classified them all), falling far short of computational universality.
If you feel like I’m wasting your time with trivialities (or if you simply enjoy puzzles), then before you read any further, I encourage you to stop and try to prove this for yourself!
Back yet? OK then…
Theorem: Assume a finite amount of train track. Then after a linear amount of time, the train will necessarily enter a “boring infinite loop”—i.e., an attractor state in which at most two of the flaps keep getting toggled, and the rest of the flaps are fixed in place. In more detail, the attractor must take one of four forms:
I. a line (with reversing gadgets on both ends),
II. a simple cycle,
III. a “lollipop” (with one reversing gadget and one flap that keeps getting toggled), or
IV. a “dumbbell” (with two flaps that keep getting toggled).
In more detail still, there are seven possible topologically distinct trajectories for the train, as shown in the figure below.
Here the red paths represent the attractors, where the train loops around and around for an unlimited amount of time, while the blue paths represent “runways” where the train spends a limited amount of time on its way into the attractor. Every degree-3 vertex is assumed to have a Y-junction, while every degree-1 vertex is assumed to have a reversing gadget, unless (in IIb) the train starts at that vertex and never returns to it.
The proof of the theorem rests on two simple observations.
Observation 1: While the Y-junctions correspond to vertices of degree 3, there are no vertices of degree 4 or higher. This means that, if the train ever revisits a vertex v (other than the start vertex) for a second time, then there must be some edge e incident to v that it also traverses for a second time immediately afterward.
Observation 2: Suppose the train traverses some edge e, then goes around a simple cycle (meaning, one where no edges or vertices are reused), and then traverses e again, going in the same direction as the first time. Then from that point forward, the train will just continue around the same simple cycle forever.
The proof of Observation 2 is simply that, if there were any flap that might be in the train’s way as it continued around the simple cycle, then the train would already have pushed it out of the way its first time around the cycle, and nothing that happened thereafter could possibly change the flap’s position.
Using the two observations above, let’s now prove the theorem. Let the train start where it will, and follow it as it traces out a path. Since the graph is finite, at some point some already-traversed edge must be traversed a second time. Let e be the first such edge. By Observation 1, this will also be the first time the train’s path intersects itself at all. There are then three cases:
Case 1: The train traverses e in the same direction as it did the first time. By Observation 2, the train is now stuck in a simple cycle forever after. So the only question is what the train could’ve done before entering the simple cycle. We claim that at most, it could’ve traversed a simple path. For otherwise, we’d contradict the assumption that e was the first edge that the train visited twice on its journey. So the trajectory must have type IIa, IIb, or IIc in the figure.
Case 2: Immediately after traversing e, the train hits a reversing gadget and traverses e again the other way. In this case, the train will clearly retrace its entire path and then continue past its starting point; the question is what happens next. If it hits another reversing gadget, then the trajectory will have type I in the figure. If it enters a simple cycle and stays in it, then the trajectory will have type IIb in the figure. If, finally, it makes a simple cycle and then exits the cycle, then the trajectory will have type III in the figure. In this last case, the train’s trajectory will form a “lollipop” shape. Note that there must be a Y-junction where the “stick” of the lollipop meets the “candy” (i.e., the simple cycle), with the base of the Y aligned with the stick (since otherwise the train would’ve continued around and around the candy). From this, we deduce that every time the train goes around the candy, it does so in a different orientation (clockwise or counterclockwise) than the time before; and that the train toggles the Y-junction’s flap every time it exits the candy (although not when it enters the candy).
Case 3: At some point after traversing e in the forward direction (but not immediately after), the train traverses e in the reverse direction. In this case, the broad picture is analogous to Case 2. So far, the train has made a lollipop with a Y-junction connecting the stick to the candy (i.e. cycle), the base of the Y aligned with the stick, and e at the very top of the stick. The question is what happens next. If the train next hits a reversing gadget, the trajectory will have type III in the figure. If it enters a new simple cycle, disjoint from the first cycle, and never leaves it, the trajectory will have type IId in the figure. If it enters a new simple cycle, disjoint from the first cycle, and does leave it, then the trajectory now has a “dumbbell” pattern, type IV in the figure (also shown in the first video). There’s only one other situation to worry about: namely, that the train makes a new cycle that intersects the first cycle, forming a “theta” (θ) shaped trajectory. In this case, there must be a Y-junction at the point where the new cycle bumps into the old cycle. Now, if the base of the Y isn’t part of the old cycle, then the train never could’ve made it all the way around the old cycle in the first place (it would’ve exited the old cycle at this Y-junction), contradiction. If the base of the Y is part of the old cycle, then the flap must have been initially set to let the train make it all the way around the old cycle; when the train then reenters the old cycle, the flap must be moved so that the train will never make it all the way around the old cycle again. So now the train is stuck in a new simple cycle (sharing some edges with the old cycle), and the trajectory has type IIc in the figure.
This completes the proof of the theorem.
We might wonder: why isn’t this model train set capable of universal computation, of AND, OR, and NOT gates—or at any rate, of some computation more interesting than repeatedly toggling one or two flaps? My answer might sound tautological: it’s simply that the logic of the Y-junctions is too limited. Yes, the flaps can get pushed out of the way—that’s a “bit flip”—but every time such a flip happens, it helps to set up a “groove” in which the train just wants to continue around and around forever, not flipping any additional bits, with only the minor complications of the lollipop and dumbbell structures to deal with. Even though my proof of the theorem might’ve seemed like a tedious case analysis, it had this as its unifying message.
It’s interesting to think about what gadgets would need to be added to the train set to make it computationally universal, or at least expressively richer—able, as turned out to be the case for the Digi-Comp II, to express some nontrivial complexity class falling short of P. So for example, what if we had degree-4 vertices, with little turnstile gadgets? Or multiple trains, which could be synchronized to the millisecond to control how they interacted with each other via the flaps, or which could even crash into each other? I look forward to reading your ideas in the comment section!
For the truth is this: quantum complexity classes, BosonSampling, closed timelike curves, circuit complexity in black holes and AdS/CFT, etc. etc.—all these topics are great, but the same models and problems do get stale after a while. I aspire for my research agenda to chug forward, full steam ahead, into new computational domains.
PS. Happy Easter to those who celebrate!
Scribe notes by Benjamin Basseri and Richard Xu
Previous post: Inference and statistical physics Next post: TBD. See also all seminar posts and course webpage.
Alexander (Sasha) Rush is a professor at Cornell working in in Deep Learning / NLP. He applies machine learning to problems of text generation, summarizing long documents, and interactions between character and word-based models. Sasha is previously at Harvard, where he taught an awesome NLP class, and we are excited to have him as our guest! (Note: some of the figures in this lecture are taken from other papers or presentations.)
The first half of the talk will focus on how NLP works and what makes it interesting: a bird’s-eye view of the field. The second half of the talk will focus on current research.
Textual data has many different challenges that differ from computer vision (CV), since it is a human phenomenon. There are methods that work in computer vision / other ML models that just don’t work for NLP (e.g. GANs). As effective methods were found for computer vision around 2009-2014, we thought that these methods would also work well for NLP. While this was sometimes the case, it has not been true in general.
What are the difficulties of working with natural language? Language works at different scales:
word < phrase < sentence < document < ...
Here are examples of structure at each level:
In NLP, we will talk about the syntax and semantics of a document. The syntax refers to how words can fit together, and semantics refers to the meaning of these words.
There are many different NLP tasks such as sentiment analysis, question answering, named entity recognition, and translation. However, recent research shows that these tasks are often related to language modeling.
Language modeling, as explained in Shannon 1948, aims to answer the following question: Think of language as a stochastic process producing symbols. Given the last symbols, can we predict the next one?
This question is challenging as is. Consider the following example:
A reading lamp on the desk shed glow on polished ___
There are many options: marble/desk/stone/engraving/etc., and it is already difficult to give a probability here. In general, language is hard to model because the next word can be connected to words from a long time ago.
Shannon proposes variants of Markov models to perform this prediction, based on the last couple characters or the context in general.
Since local context matters most, we assume that only the most recent words matter. Then, we get the model
As we have seen in the generative models lecture, we can use cross entropy as a loss function for density estimation models. Given model density distribution and true distribution , the cross entropy (which equals negative expected log-likelihood) is defined as follows:
In NLP we tend to use the metric “perplexity”, which is the exponentiated negative cross entropy:
This corresponds to the equivalent vocabulary size of a uniformly distributed model. Lower perplexity means our model was closer to the underlying text distribution. As an example, the perplexity of the perfect dice-roll model would be 6.
Why do we care about perplexity anyway?
A few years ago, the best perplexity on WSJ text was 150. Nowadays, it is about 20! To understand how we got here, we look at modern language modeling.
We start with the model
(The softmax function maps a vector into the probability distribution such that . That is, . Note that this is a Boltzman distribution of the type we saw in the statistical physics and variational algorithms lecture)
We call W the output word embeddings, is some neural basis (e.g. are all but the final layers of a neural net with weights , and W is the final layer). However, this means to use softmax to predict requires computing softmax over every word in your vocabulary (tens or hundreds of thousands). This was often infeasible until GPU computing became available.
As an aside, why not predict characters instead of words? The advantage is that there are much fewer characters than words. However, computation with characters is slower. Empirically, character-based models tend to perform worse than word-based. However, character-based models can handle words outside the vocabulary.
Byte-pair encoding offers a bridge between character and word models. This greedily builds up new tokens as repetitive patterns are found in the original text.
In the last decade NLP has seen a few dominant architectures, all using SGD but with varying bases. First, we must cast the words as one-hot vectors, then embed them into vector space:
where is the one-hot encoded vector and is the learned embedding transformation.
NNLM (Neural Network Language Model) is like a CNN. The model predicts on possibly multiple NN transformations:
where denotes concatenation, is some convolutional filter and is the activation function. This has the benefit of learning fast. The matrices it learns also transfer well.
As an example, GloVe is a NNLM-inspired model. It stores the words in 300-dimensional space. When we project the some words to 2-dimensions using PCA, we find semantic information in the language model.
A recurrent model uses a fixed set of previous words to predict the next word. A recurrent network uses all previous words:
Previous information is ‘summarized’ in the on the right, so this model uses finite memory. Below is an illustration of the recurrent neural network.
Since the recurrent model uses the full context, it is a more plausible model for how we really process language. Furthermore, the introduction of RNN saw drastically improved performance. In the graph below, the items in the chart are performances from previous NNLMs. The recurrent network performance is “off the chart”.
However, the model grows with sequence length. This requires gradient flow to backpropagate over arbitrarily long sequences, and often required baroque network designs to facilitate longer sequences while avoiding exploding/vanishing gradients.
To understand modern NLP we must look at attention. For a set of vectors , keys and a query , we define attention as
where .
Here, can be considered a probability density of the model’s ‘memory’ of the sequence. The model decides which words are important by combining and .
The attentional model can be fully autoregressive (use all previously seen words), and the query can be learned or be specific to an input. In general, we have:
Since we condense all previous information into the attention mechanism, it is simpler to backpropagate.
In particular attention enables looking at information from a large context without paying for this in the depth of the network and hence in the depth of back-propagation you need to cover. (Boaz’s note: With my crypto background, attention reminds me of the design of block ciphers, which use linear operations to mix between far away parts of the inputs, and then apply non-liearity locally to each small parts.)
Note that attention is defined with respect to a set of vectors. There is no idea of positional information in the attentional model. How do we encode positional information for the model? One way to do this is using sinusoidal encoding in the keys. We store the word as for some period . Notice that if we choose many different periods, then the cosine ways will almost never meet at the same point. As a result, only recent points will have high dot products between the different cosine values.
A transformer is a stacked attention model. Computation in one layer becomes query, keys and values for the next layer. This is a multiheaded attention model. We learn projections for each query/key (generally between 8 and 32) then do softmax across these projections:
These heads can be computed in parallel and can be implemented with batch matrix multiplication. As a result, transformers can be massively scaled and are extremely efficient in modern hardware. This has led these models to be very dominant in the field. Here are some example models:
In recent years we have had larger and larger models, from GPT1’s 110 million to GPT3’s 175 billion.
On these massive scales, scaling has become a very interesting issue: how do we process more samples? How do we run distributed computation? How much autoregressive input should each model looks at? (Boaz’s note: To get a sense of how big these models are, GPT-3 was trained on about 1000B tokens. The total number of people that ever lived is about 100B, and only about half since the invention of the printing press, so this is arguably a non-negligible fraction of all text produced in human history.)
For a model like BERT, most of the cost still comes from feed-forward network — mostly matrix multiplications. These are tasks we are familiar with and can scale up.
One question is to have long-range attentional lookup, which is important for language modeling. For now, models often look at most 512 words back. Can we do longer range lookups? One approach to this is kernel feature attention.
Recall that we have . Can we approximate this with some kernel ? The main approach is that , which we approximate with the kernel . There is a rich literature on approximating where
for some transfomration . Then, we can try to approximate with linear features.
Practically, transformers do well but are slower. For longer texts, we have faster models that do slightly worse. A recent model called performer is such an example.
Ultimately, we want to make models run on “non Google scale” hardware once it has been trained to a specific task. This can often require scaling down.
One approach is to prune weights according to their magnitude. However, since models are often overparameterized and weights do not move much, the weights that get pruned according to this method are usually the weights that were simply initialized closest to 0. In the diagram below, we can consider only leaving the orange weights and cutting out the gray.
Another approach is to mask out the weights that are unnecessary for a specific task, if you’re trying to ship a model for a specific task.
Two major lines of research dominate NLP today:
We are also in a space race to produce efficient models with more parameters, given how much scaling has been effective.
The paper Which BERT classifies modern questions in NLP into the following categories:
We have many questions asked during and after lecture. Here are some of the questions.
In the neighborhood where I live, fire safety regulations require the streets to be super-wide (so wide that two fire trucks can pass even with cars parked along both sides of the street), and to have even wider turnarounds at the ends of the culs-de-sac. To break up the resulting vast expanses of pavement, we have occasional islands of green, public gardens too small to name as a park. They come in several different types: medians to separate the incoming and outgoing lanes at junctions with larger roads,
barriers to separate small groups of houses from the main flow of the road,
or giving some shape to the turnaround at the end of a cul-de-sac.
Most are ignored except by the community association’s gardeners and by passing cars, but I did find one set up as a neighborhood basketball court:
Scribe notes by Franklyn Wang
Previous post: Robustness in train and test time Next post: Natural Language Processing (guest lecture by Sasha Rush). See also all seminar posts and course webpage.
lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video
Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.
Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.
Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”
While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.
For example, suppose that we have a probability distribution that we can get samples from in the form of . In statistics, our goal is to calculate a hypothesis which minimizes (possibly a loss function, but any minimand will do) where we estimate with our samples .
A frequentist does this by defining a family of potential distributions, and we need to find a transformation which minimizes the cost for all , where
These equations amount to saying that is minimizing the worst-case loss over all distributions.
By contrast, a Bayesian approaches this task by assuming that where are latent variables sampled from a prior.
Then, let be the posterior distribution of conditioned on the observations . We now minimize
In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.
In Bayesian inference, chosing the prior is often very important. One classical approach is a maximum entropy prior, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.
These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce computational constraints into the picture. In the frequentist approach, we generally constrain the family of mappings to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.
Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.
See video from here:
An atomic state can be thought of as . Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution supported on , rather than in a single state.
Each atomic state has a negative energy function (which we will think of as a utility function) . In some sense, the system “wants” to have a high value of . In addition, when the temperature is high, the system “wants” to have a high value of entropy.
Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of
That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).
The variational principle, which we prove later, states that the which maximizes this satisfies
which is known as the Boltzmann Distribution. We often write this with a normalizing constant, so , where .
Before proving the variational principle, we will go through some examples of statistical physics.
In the Ising model, we have magnets that are connected in a square grid. The atomic state is each , which represents “spin”. The value of is , where denotes that and are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of are sufficient to calculate the value of (these are known as sufficient statistics). Furthermore, if we wanted to calculate , it would be enough to calculate the values of and and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:
and a video can be found here.
This is what a low-temperature Ising model looks like – note that is high because almost all adjacent spins are aligned (hence the well-defined regions).
The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, still represents spin, and where . The SK-model is deeply influential in statistical physics. We say that it is disordered because the utility function is chosen at random and that it is a mean field model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables and are equally likely to be connected.
Our third example is the posterior distribution, where is a hidden variable with a uniform prior. We now makes, say, independent observations . The probability of is given by
so . Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.
Now, we prove the variational principle, which states that if , where is the normalizing factor, then
Before proving the variational principle, we make the following claim: if is defined as then
Proof of claim:
Write
(where we plugged in the definition of in ).
We can now rewrite this as
which means that
using the fact that is a constant (independent of ). Multiplying by and rearranging, we obtain the claim.
Proof of variational principle:
Given the claim, we can now prove the variational principle.
Let be any distribution. We write
where the first equation uses our likelihood expression. This implies that is maximized at , as desired.
Remarks:
which we’ve seen before! Note that equality holds when , but to approximate we can simply take more tractable values of .
In many situations, we can compute , but can’t compute , which stifles many applications. One upshot of this, though, is that we can calculate ratios of and , which is good enough for some applications, like Markov Chain Monte Carlo.
An important task in statistics is to sample from a distribution , for very complicated values of . MCMC does this by constructing a Markov Chain whose stationary distribution is . The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states , so that iff and that for , the probabilities of and are similar in the sense that is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.
Metropolis-Hastings Algorithm:
Then, eventually is distributed as ! To show that this samples from , we show that the stationary distribution is of this Markov chain is . In this case, this turns out to be easy, since we can check the detailed balance conditions so is the stationary distribution of the Markov Chain.
The stationary distribution is often unique, so we’ve proven that this samples from eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization.
Indeed, there are examples where convergence to the stationary distribution would take exponential time.
Note: One way to make MCMC faster is to let be really close from , because the likelihood ratio will be closer to and will spend less time stuck at its original location. There’s a tradeoff, however. If is too close to , then the chain will not mix as quickly, where mixing is the convergence of to the stationary distribution, because generally to mix the value must get close to every point, and making each of ‘s steps smaller will make that more difficult.
One application of MCMC is in simulated annealing. Suppose that we have , and we want to find . The most direct attempt at solving this problem is creating a Markov Chain that simply samples from . However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!
To this end, we now create a sequence of Markov chains supported on , whose stationary distribution , as gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from .
Simulated annealing lets us begin by sampling from , which is uniform on the support. Then we can slowly reduce from to . When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on , so the transition probability from to is .
Simulated Annealing Algorithm:
Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.
See this video for an illustration of simulated annealing.
Note that the posterior of conditioned on is
We now have , an exponential distribution. Now suppose that , where are the sufficient statistics of . For example, the energy function could follow that of a tree, so . If you want to find the expected value of $latex \mathbb{E}{x \sim p{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}{x \sim p{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)
Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:
,
whose joint PDF is where .
(The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)
Then one can show that if the marginal of is , then we have that
This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, , we can then simply sample from the marginal, then sample from the conditional distribution on to sample from the entire distribution. This method is known as belief propogation.
This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.
Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be . A few useful facts about these distributions that often appear in calculations: and $latex \nabla^2 (A(w)) = \text{Cov}{p{W}}(x) \succeq 0&bg=ffffff$.
By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so
An analagous idea gives us
so is the maximum entropy distribution consistent with observations .
The important consequence of these two facts is that in principle, determines and vice versa. (Up to fine print of using minimal sufficient statistics, which we ignore here.) We will refer to as the “canonical parameter space” and as the “mean representation space”. Now note that the first equation gives us a way of going from to , and the second equation gives us a way to go from to , at least information-theoretically.
Now, how do we do this algorithmically? Using , if were able to sample from (which is generally possible if we can evaluate then we can estimate the mean through samples. We can also obtain from by estimating .
On the other hand, if you have , to obtain you first consider and then note that the desired value is , so solving this problem boils down to estimating efficiently, because setting it to zero will give us .
Thus going from to requires estimating , whereas going from to requires estimating .
When is a posterior distribution, the observed data typically gives us the weights , and hence the inference problem becomes to use that to sample from the posterior.
There are many examples of exponential distributions, which we now give.
Now, we show how to go from to in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over , for which
Recall that the partition function can be computed as
where is the set of all probability distributions. If we instead write as the set of product distributions (parametrized by , or the probability that each variable is ), we get
where and it now suffices to maximize the right hand function over the set .
We can generally maximize concave functions over a convex set such as . Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.
CAVI Algorithm
This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (Note: The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)
Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.
Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.
Note: We will have an extended post on the replica method later on.
If you sampled from , where is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or
we approximate as
where is some constant.
Suppose comes from a probability distribution . Then a very common problem is computing
which is the expected free energy. However, it turns out to be much easier to find . So here’s what we do. We find
This should already smell weird, because is going to zero, an unusal notational choice. We can now write this as
which we can write as
Now these ‘s represent the replicas! Now, we’d hope is an analytic function, and we can now write this as as goes to zero.
Generally speaking, only depends on overlaps of , so we often guess the value of and calculate this expectation.
We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe so that , where is the signal and is the noise. Thus here we have
and we want to analyze as a function of , which can be done by the replica method and exhibits a phase transition.
Other examples include Teacher-student models, where . Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.
Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.
When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.
Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.
Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.
Today I received an interesting email from our compliance office that is working on the accreditation of our PhD program in Statistics and Computer Science.
One of the requisites for accreditation is to have a certain number of affiliated faculty. To count as an affiliated faculty, however, one must pass certain minimal thresholds of research productivity, the same that are necessary to be promoted to Associate Professor, as quantified according to Italy’s well intentioned but questionably run initiative to conduct research evaluations using quantifiable parameters.
(For context, every Italian professor maintains a list of publications in a site run by the ministry. Although the site is linked to various bibliographic databases, one has to input each publication manually into a local site at one’s own university, then the ministry site fetches the data from the local site. The data in the ministry site is used for these research evaluations. At one point, a secretary and I spent long hours entering my publications from the past ten years, to apply for an Italian grant.)
Be that as it may, the compliance office noted that I did not qualify to be an affiliated faculty (or, for that matter, an Associate Professor) based on my 2016-2020 publication record. That would be seven papers in SoDA and two in FOCS: surely Italian Associate Professors are held to high standards! It turns out, however, that one of the criteria counts only journal publications.
Well, how about the paper in J. ACM and the two papers in SIAM J. on Computing published between 2016 and 2020? That would (barely) be enough, but one SICOMP paper has the same title of a SoDA paper (being, in fact, the same paper) and so the ministry site had rejected it. Luckily, the Bocconi administration was able to remove the SoDA paper from the ministry site, I added again the SICOMP version, and now I finally, if barely, qualify to be an Associate Professor and a PhD program affiliated faculty.
This sounds like the beginning of a long and unproductive relationship between me and the Italian system of research evaluation.
P.S. some colleagues at other Italian universities to whom I told this story argued that the Bocconi administration did not correctly apply the government rules, and that one should count conference proceedings indexed by Scopus; other colleagues said that indeed the government decree n. 589 of August 8, 2018, in article 2, comma 1, part a, only refers to journals. This of course only reinforces my impression that the whole set of evaluation criteria is a dumpster fire that is way too far gone.
Scribe notes by Praneeth Vepakomma
Previous post: Unsupervised learning and generative models Next post: Inference and statistical physics. See also all seminar posts and course webpage.
lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video
In this blog post, we will focus on the topic of robustness – how well (or not so well) do machine learning algorithms perform when either their training or testing/deployment data differs from our expectations.
We will cover the following areas:
The KL-divergence between the probability distributions is the expectation of the log of probability ratios:
KL-divergence is always non-negative and can be decomposed as the difference between negative entropy and cross-entropy. The non-negativity property thereby implies that distributions ,
Consider a Gaussian random variable . Concentration is the phenomenon that if you have i.i.d random variables with expectation , then the empirical average is distributed approximately like
Therefore the standard deviation of the empricial average is smaller than the standard deviation of each of the ‘s. The central limit theorem ensure this asymptotically w.r.t , while non-asymptotic versions of this phenomenon can be seen via popular inequalities such as the Chernoff/Hoeffding/Bernstein styled inequalities. These roughly have the form
We denote the spectral norm of a matrix by and the Frobenius norm by .
We use the matrix ordering if for all vectors . We similarly refer to if .
Vector valued normals: If is a vector, and is a psd covariance matrix, with normal over with
For every psd matrix V, there exists a Normal distribution where
Standard vector-valued normal: This refers to the case when (or and
For scalar random variables, we saw that if are i.i.d over bounded with expectation then
If the random variables were vectors instead, we can easily generalize this to samples of . Generalizing to matrices is far more interesting, especially when the norm under consideration is the spectral norm instead of the Frobenius norm.
If i.i.d symmetric matrices in with (i.e bounded in spectral norm), then
Note that is a matrix in this case, instead. The norm under consideration is the spectral norm. Note that the difference in the inequality w.r.t the scalar case is the additional multiplicative factor of d or an additive factor of in the exponent as in the equations above. Please refer to Tropp, 2015 Chapter 6 for formally precise statements.
Another property that follow is the the expected norm of the difference follows
There are some long-standing conjectures and results on cases when one can get rid of this additional factor of .
Trivia on log factors: For example as a tangent (as well as some trivia!) on some results of this flavor, where a factor was replaced by a constant, includes Spencer’s paper from 1985 titled ‘Six standard deviations suffice’, which showed that a constant of 6 would suffice in certain bounds where a naive result would instead give a dependency. The Kadison-singer problem (by Spielman-Srivastava) and the Paulsen problem are two other examples of works in this flavor.
For a random symmetric matrix matrix with , the spectrum of eigenvalues is distributed according to Wigner semi-circle law on a support between as shown in the figure below. Note that most mass is observed close to .
For
we have is the empirical estimate for covariance of
the eigenvalues are distributed according to the Marchenko-Pastur distribution. In the case when, , the eigenvalues will be bounded away from zero. When , then it has a lot more mass close to (way more than the semi-circle law). When , then there is a spike of eigenvalues at and the rest are bounded away from .
Connections to the stability of linear regression
In the regime, when , linear regression is most unstable, which is the case when most of the eigenvalues of the empirical covariance are close to . When , although linear regression is over-parametrized, the solution is not exact, but the approximate solution is still pretty stable as in the case of . When , the condition number is infinite, and therefore we use a pseudo-inverse as opposed to the usual inverse in computing the solution for linear regression. This ignores the subspace on which the matrix has eigenvalues while the inverse is performed. The condition number of the subspace of the non-zero eigenvectors is finite.
Note: Its variants/connections include techniques/topics of, Follow The Regularized Leader / Regret Minimization / Mirror Descent. Elad Hazan’s lecture notes on online convex optimization and Arora, Hazan, Kale’s article on multiplicative and matrix multiplicative weights are a great reading source for these topics.
The following is a general setup in online optimization and online learning.
Setup: possible actions
At time , we incur loss for action at time . After this action, we also learn the loss for actions we did not take. (This is referred to as the experts model in online learning, in contrast to the bandit model where we only learn the loss for the action taken; the experts model is an easier setup than the bandit model.)
The following is the overall approach for learning to optimize in this online setup.
Overall Approach:
The hope is that this approach converges to a “good aggregation or strategy”. We measure the quality of the strategy using regret.
The difference between the cost we incur and the optimal action (or probability distribution over actions) in hindsight is known as the (average) regret.
Note that we compare the loss we incur with the best loss over a fixed (i.e., nonadaptive) probability distribution over the set of possible actions. It can be shown that the best such loss can always be achieved by a distribution that puts all the weight on a single action.
The following theorem bounds the regret.
The ‘prior ignorance’ term captures how good the initialization is. The ‘sensitivity per step’ term captures how aggressive the exploration is (i.e this term governs the potential difference between loss incurred at versus ). As also occurs in the denominator of the prior ignorance term, it needs to be carefully chosen to balance the two terms properly.
The first inequality of this theorem is always true for any whether is optimal strategy or not. The second (right-most) inequality is true when optimal is the delta function (which it will be in optimal strategy), is initialized to uniform distribution, and is set to be . Note that in this case, the divergence .
To state the ‘overall approach’, more precisely, there needs to be a normalization at every step as below
The above can be rearranged as follows upon taking log
By substituting this in below
we get the following expanded version of what the regret amounts to be:
The last equation above is due to the telescoping sum where adjacent terms cancel out except the first and last terms.
The last inequality here is because the cross-entropy is always at least as large as the entropy.
So now we have this so far
PF: Regret
Now there is this property that if s.t. for then
Therefore upon substituting this, we prove the upper bound stated over the regret.
This subclaim that was used can be proved as follows
When the set K of actions is convex (set of probability distributions on discrete actions is convex),
At time , it makes a choice and learns cost function (so the setup is again like the experts model unlike the bandit model).
Now the new action in FTRL is based on the following optimization which is a regularized (with ) loss:
FTRL:
In this case, the regret bound is given by the following theorem.
refers to the optimal choice. This indeed has a similar flavor to the theorem we saw for multiplicative weights. To be precise when
we have that the multiplicate weights becomes FTRL. Similarly there is a view that connects multiplicative weights to mirror descent. (Nisheeth Vishnoi has notes on this.)
We now look at some robustness issues specific to the training phase in machine learning.
For example, during training, there could be adversarially poisoned examples in the training dataset that damage the trained model’s performance.
Setup: Suppose samples are generated from a genuine data distribution and samples are maliciously chosen from an arbitrary distribution. i.e, in formal notation: are arbitrary. (While for convenience we assume here that the last items are maliciously chosen, the learning algorithm does not get the data in order.)
Assume for
Let us start with the task of estimating the mean under this setup.
Mean estimation: Estimate
Noiseless case: In the noiseless case, the best estimate for the mean under many measures is the empirical mean
Standard concentration results show that for and for a general that .
Adversarial case: If there is even a single malicious sample (that can be arbitrarily large or small) then it is well known that the empirical mean can give an arbitrarily bad approximation of the true population mean. For we can use the empirical median . This is guaranteed to lie in the quantile of real data. This is most optimal because of the property that .
In a higher-dimension, this story instead goes as follows:
Setup: arbitrary.
Median of coordinates (a starter solution): When , we have per coordinate as an initial naive solution. (i.e a median of coordinates). In this case, say if we have then an adversary can perurb the data to make fraction of the points be where is some large number. This will shift in each coordinate the distribution to have median instead of median , and hence make which implies that for some constant .
The obvious next question is: Can we do better? i.e, can we avoid paying this dimension-dependent price?
Yes we can! We can use the ‘Tukey Median’!.
What is a Tukey median?
Informally via the picture below, if you have a Tukey median (red), no matter which direction you take from it and look at the half-space, it exactly partitions the data into about half the # of points.
Formal definition of Tukey median is given as follows (fixing some parameters):
A Tukey median* of is a vector s.t. for every nonzero
A Tukey median need not always exist for a given data. However, we will show that if the data is generated at random according to a nice distribution, and then at most fraction of it is perturbed, a Tukey median will exist with high probability. In fact, the true population mean will be such a median.
Existence of Tukey median
THM: If and then
Together 1. and 2. mean that if we search over all vectors and output the first Tukey median that we find, then (a) this process will terminate and (b) its output will be a good approximation for the population mean. In particular, we do not have to pay the extra cost needed in the median of coordinates!
** Proof of 1 (existence):**
The mean itself is the Tukey median in this case because, , if we define then these are i.i.d ±1 vars of mean zero, and thus:
for and . Through discretization, we can pretend (losing some constants) that there are only unit vectors in . Hence we can use the union bound to conclude that for every unit , if we restrict attention to the “good” (i.e., unperturbed) vectors , then the fraction of them satisfying will be in . Since the adversary can perturb at most vectors, the overall frection of ‘s such that will be in . QED(1)
** Proof of 2:**
Let be the population mean (i.e., the “good” ‘s are distributed according to ). Suppose for simplicity, and toward a contradiction, that .
Let . Then,
.
Note that is distributed as , and so we get that is distributed as .
Hence, if then .
This implies that via a similar concentration argument as above, for every , there will be with high probability at most fraction of ‘s such that , contradicting our assumption that was a Tukey median. QED(2)
Exactly computing the Tukey median is NP-hard, but efficient algorithms for robust mean estimation of normals and other distributions exist as referred to in Jerry Li’s lecture notes. In particular, we can use the following approach using spectral signatures and filtering.
Spectral Signatures can efficiently certify that a given vector is in fact a robust mean estimator.
Let and arbitrary
Let be the empirical mean and be the empirical co-variance. Then we can bound the error of as follows:
Claim:
In other words, if the spectral norm of the empirical covariance matrix is small, then the empirical mean is a good estimator for the population mean.
Note: If all points are from then .
The Proof for this claim is given below
Explanation: Here, we assume the for simplicity without loss of generality. The norm can be split into additive terms on good points (green text above) and malicious points (red text above). The first term of this inequality follows from standard concentration. For the second (red) term, we can modify it by adding and subtracting . We can then apply the Cauchy-Schwartz (cs) inequality to prove it. Upon rearranging the terms and dividing by the norm of , we get the desired result.
Filtering is an approach to turn the certificate into an algorithm to actually find the estimator. The idea is that the same claim holds for non-uniform reweighting of data points to estimate the empirical mean and covariance. Hence we can use a violation of the spectral norm condition (the existence of a large eigenvalue) to assign “blame” to some data points and down-weigh their contributions until we reach a probability distribution over points that on the one hand is spread on roughly fraction of the points and on the other hand leads to a weighted empirical covariance matrix with small spectral norm. The above motivates the following robust mean estimation algorithm in the online case as below:
Explanation: We first compute the mean and covariance based on uniform weighting based, and the certificate is checked via the spectral norm of the covariance. If the quality isn’t good enough, then the blame is given to the largest eigenvector of the covariance that contributes to the most error. The weighting is now improved from the uniform initialization via the multiplicative weights styled update as given in step 3. Jerry Li and Jacob Steinhardt have wonderful lecture notes on these topics.
The algorithm above is computationally efficient while the bounds are not that tight.
SoS algorithms: Another approach is via sum of squares algorithms where the guarantees for statistical robustness are much tighter, but they are computationally not very efficient although they are polynomial time. A hybrid approach might as well give a balance of these based on the problem at hand to bridge this gap between computational efficiency vs. statistical efficiency.
A list of relevant references is given below.
We now cover robustness issues with distribution shift and adversarial data poisoning during the testing phase in machine learning.
As shown in Steinhardt, Koh, Liang, 2017 the images below illustrate how poisoning samples can drastically alter the performance of a classifier from good to bad.
Shafahi , Huang, Najibi, Suciu, Studer, Dumitras, Goldstein, 2018 showed how poisoning images look perfectly fine to the human perception, while they flip the model’s performance where a fish is considered to be a dog and vice versa as shown below.
Another problem with regards to test-time robustness is the issue of domain shift where the distribution of test data is different from the distribution of training data.
vs. . If L is bounded, then if Lipschitz constant is known then distances like earth mover’s distance, T.V distance can be used to bound this. But one needs to be quite careful or these bounds are too large or close to vacuous.
For example, images of dogs taken say in a forest vs. images of dogs taken on roads have huge distance in measures like the ones mentioned above, as many pixels (although not important pixels) across the images are very different and there could be a classifier that performs terribly across the test while it does good on train. But magically, there appears to be a linear relationship between the accuracy on vs accuracy on . Typically one would expect that the line would be below the x=y (45 degree) line.
i.e say if it was trained on cifar 10 () and tested on cifar10.2 () then finding this line to be a bit lower than the line is not surprising but this linear correlation is quite surprising. It is even more surprising if the datasets are different-say in the case when performance on photos vs illustrations or cartoons of these photos was considered.
What are the potential reasons (intuitively)?
i) Overfitting on cifar test,
ii) Fewer human annotators per image introduces skew towards hardness of dataset,
iii) Running out of images by human annotators as they might end up choosing images that are easier to annotate.
If we achieve better accuracy on than we achieved on , it is a strong indicator of a drop in hardness. If the errors are in the other direction, then this reasoning isn’t as clear.
Here is a toy model can demonstrate this surprising linear relationship between accuracies under domain shift.
Let us do a thought experiment where things are linear. Let us assume there was a true vector cat direction in terms of the representation(feature) as shown in the cartoon below. Let there be some correlated idiosyncratic correlations. An example, idiosyncrasy is say due to cats tending to be photographed more in indoors than in outdoors.
Consider to be a point that needs to be labeled.
In some dataset consider the probability that is labeled as a cat is proportional to the following exponential as:
labeled cat
where to denote the theidiosyncratic correlation factor. This is the exponent of the dot product of the image to be labeled with the CAT direction and the idiosyncratic direction. The same can be done for a dataset as
Dataset labeled cat
Intuitively is like the signal to noise ratio. That is, if then is a harder dataset to classify than .
So in this toy model, the best accuracy that can be reached for any linear classifier is given by the following, where the softmax of the RHS is the classification probability.
The first term of this accuracy is the universal and transferable part and the second term is the idiosyncratic part.
The following is an For the we assume that if is trained on then we assume . This is because if is trained on , then idiosyncratic directions of are orthogonal to each other. So the and the accuracy will be just this term of .
If the model is learnt by gradient descent then the gradient direction will always be proportional to as the gradient is of the form
So, if is trained on , then Noise Noise. Here, is given by .
Therefore the accuracies will be as follows
Therefore, we see this form of a linear relationship:
Note that:
Although this is a toy theoretical model, it explains a linear relationship. However, finding a model that explains this linear relationship in real-life will be an interesting project to think of.
We now move to the last (yet another active) topic of our blog: adversarial perturbations. As an introductory example (taken from this tutorial), the hog image was originally classified as a hog with probability. A small amount of noise was added to get the image which to the human eye perceptually looks indistinguishable. That said the model now ends up misclassifying the noised hog to something that is not a hog.
Should we be surprised with the efficacy of such adversarial perturbations? Originally-certainly yes, but not as much now in hindsight!
In this example it turns out that the magnitude of each element satisfies and the 2-norm of this vector is roughly as . Note that the Resnet50 model outputs at the penultimate layer has a dimension . We scale such that . There is a Lipschitz constant w.r.t the preservation of the following norms: . The final classification decision is made by looking at where is a unit vector such that unit vector s.t. is . We assume some randomness in the decision as being for simplicity. So we have
where .We now have the probability that is not a hog as
is not hog .
As we know that the observed probability of being not a hog is 0.0996, we can calculate that .
Upon normalizing to be , We can expect the following square of this dot product w.r.t the representation as:
Therefore the norm of the projection of to the HOG direction is given by .
So say if the 2048 vector had one larger element which accounts for the HOG direction, although it still accounts for a small proportion of its total norm. Therefore it wouldn’t need too much noise to flip one class to its wrong class label as shown in the cartoons below.
So if the Lipschitz constant is greater than around 2.5 or 3, then a fraction of 1/25 is enough to zero out the hog direction (as ).
What are some strategies for training neural networks that are robust to perturbations?
i.e, they only perturb the image or data sample by atmost upon applying that transformation.
Now, given a loss function and a classifier , a robust loss function of at point is defined as
Given and arobust loss function, a robust training would involve the minimization of this loss which gives us:
Now for the subgoal of finding for the sake of optimization, invoking Danskin’s Theorem will greatly help. The theorem basically says that if is nice (diff, continuous) and if is compact then we have that:
i.e for any function , that depends on and as long as the function space from which needs to be chosen is nice (diff, continuous) and compact, then to find the gradient of the maximum of the function then by the theorem one can find maximizer for any particular , then that can give the required gradient (after some fine print that is given in note below).
Note: The paper below extends to the case when non unique though there is other fine print. See Appendix A of (Madry Makelov Schmidt Tsipras Vladu 2017).
On the empirical side, there seems to be a trade-off. If one wants to achieve adversarial robustness, it can be achieved by letting go of some model accuracy. See discussion here and the papers cited below.
This is embarrassing to admit but after a few badly timed trades on GameStop options I find myself a bit tight on money. To raise some cash, I reluctantly decided to sell one of my prized possessions, one of my theorems.
For Sale: Boolean formula satisfiability cannot be solved in both logarithmic space and quasilinear time. For a more formal and slightly more general statement and a proof, see this paper.
Bidding starts at 12 BTC (about $705,000).
The winning bid, upon verified payment, will receive:
All non-metallic UK currency to feature Computing and Data Science pioneers
Cropped from BBC source |
Charles Babbage and Ada Lovelace will be reunited. Their work on designing and programming a computing device, a century before any machine was built, is being honored by their appearance on the reverse of the 5-pound note in a new series authorized by the Bank of England. Royal Statistical Society Fellow Florence Nightingale will appear on the 10, EDSAC creator Maurice Wilkes on the 20, and Alan Turing on the 50.
Today, April 1, we are delighted to convey this news and show previews of the banknotes.
We have mentioned Babbage and Lovelace several times on this blog, the latter first here. Our 2015 post on her work on his project sought to engage a scholarly consensus that tends to minimize it. We argue that it should be judged on the plane of a doctoral or masters advisory relationship. Our point is that she greatly amplified the technical content of his work as shown to the scientific community.
The design shown above makes a leap from their Victorian world to our online age. Being “online” back then probably meant being reached by the new railroad system centered on London. The one indelible connection between our world and theirs is shown by the obverse of the banknote: both worlds have a long-serving British queen.
Besides a long life (1820–1910), Nightingale has a long entry as a founder of statistics. Her actions and influence extended for more than five decades after her iconic “Lady With the Lamp” ministry to soldiers of the Crimean War depicted on the 10-pound note:
Nightingale 200 Years source |
The picture at right is based on a photo by William Kilburn, one of few series of photos she allowed to be taken. We have not found a definitive word on what she is holding—not even from this long article—but we would like to believe it is a letter or paper regarding medical practice. Already in 1854, the Oxford Chronicle and Reading Gazette noted her bringing much more than a lamp:
In a knowledge of the ancient languages and of the higher branches of mathematics, in general art, science, and literature, her attainments are extraordinary. There is scarcely a modern language which she does not understand; and she speaks French, German, and Italian as fluently as her native English.
Not only did she popularize the pie chart and invent other statistical visualization techniques, she was among the first people, period, to bring statistical inference into policy. This included demonstrating with data the higher exposure of nurses to multiple diseases among those they tended. She was inducted into the Royal Statistical Society in 1858.
Her medical conduct principles have been applied expressly during the pandemic, as recognized by the time of the 200th anniversary of her birth last May 12, and field hospitals in the UK for Covid-19 treatment are named for her. Her recognition on the currency was considered a way to honor first-line caregivers and was a unanimous vote of the Bank of England commission. It has the earliest release date of all the banknotes.
Maurice Wilkes had an even longer life, 1913 to 2010. This marks a stark contrast to Turing. Wilkes was the second winner, in 1967, of the Turing Award. The Turing Award citation said he was
“the builder and designer of the EDSAC, the first computer with an internally stored program.”
He was thus the first realizer of the full vision of Babbage and Lovelace, although many claim the stored-program distinction for the Manchester Baby one year earlier, in 1948.
There was controversy about the order of him and Turing. The banknotes were intended to go in chronological order. Turing’s famous paper conceiving the computer dates to 1936–37. Wilkes’s first brush with the project that became the EDSAC came in 1945, when he spent an iconic 24 hours with a loaned copy of John von Neumann’s First EDVAC Report that he was not able to mimeograph. However, Wilkes was involved with analog computing devices before 1937 and his EDSAC reached full operation in May 1949, whereas Turing’s own main project, the ACE, launched its pilot version only in 1950.
The ultimate determiner was that the 20-pound note was already recently revised to feature the artist Joseph Turner in February 2020. The Turner issue will have a five-year term, so Wilkes will appear in 2025. The banknote design has not yet been executed.
Of course, this is the highlight for us, and we are delighted to be able to show the Turing design in full glory. It was formally unveiled last week:
Belfast Telegraph source |
The Pilot ACE machine is shown overlaid with Turing machine code from his 1936 paper. Turing tapes with binary code appear in three places. The big one represents his June 23, 1912 birth date—exactly how, we leave as a puzzle. We’ll just hint that all our friends across the Pond write dates differently from how we do. The new banknotes are being released on Turing’s birthday this coming June.
The GCHQ, which grew out of Turing’s employers at Bletchley Park, has just released its own puzzle challenge, which references Easter-egg features of the Turing banknote. The binary code puzzle is 11th of 12, but to our surprise, when you click it, it gives the answer to the birth date part right away. Then it proceeds to something far more cryptic—well, it’s from GCHQ after all.
The Turing release will also complete the conversion from paper to polymer of all Bank of England notes, which are also the only series issued for Wales. Scotland and Northern Ireland issue their own pound-denominated banknotes with different designs. They have yet to convert to polymer, and that will delay any consideration of adopting the new series.
Turing’s family on his father’s side came from Scotland, so there is some sentiment for the Royal Bank of Scotland adopting the design. There is also discussion of recognizing Robin Milner and/or Donald Michie, though both were born in England.
Over in Ireland, both sides of the Northern Ireland border are considering the statistician William Gosset, who as Head Brewer of Guinness over a century ago conceived the Student t-test among many innovations. However, Gosset was also born in England. The Bushmills distillery in Northern Ireland has already appeared on their 5-pound note.
In fact, Gosset and Milner and Michie were all originally nominated alongside Turing for the 50-pound note. We spot Bertrand Russell and William Tutte and Karen Spärck-Jones on the list. In physics, we notice Paul Dirac and John Bell and Stephen Hawking, and also the astronomer Vera Rubin—who was only American, no? Bell was from Northern Ireland.
Sir Michael Atiyah is not on the list because he was living at the time—no living person other than the monarch may appear on a banknote. This law does not apply to e-currency however, and this leads to what for us is the most exciting aspect of the Bank of England initiative.
The following is a GLL exclusive. It was conveyed to us by Dr. Lofa Polir, who since our story a year ago has resided in England. It comes from a confluence of several facts:
The Bank of England recently voted to create an active response to this situation: the semi-fungible token (SFT). Like Bitcoin and other cryptocurrencies, SFTs limit their supply, but unlike them, SFTs are maintained by a central body and backed by its assets. An SFT is fungible by dint of having a fixed redemption rate to standard currency, but its own price may go higher. As with NFTs, each SFT is associated to a piece of digital artwork.
The new issues are based on the Royal Mail’s 2015 Inventive Britain collection, whose honorees include Sir Tim Berners-Lee for the World Wide Web. The main stamps were abstract, but associated issues are not subject to the law against likenesses of living persons:
Composite of src1, src2 |
Berners-Lee will be the first honoree in the SFT series. We are not authorized to reveal the art design before its release. But Lofa sent us a list of nominees that includes Geoffrey Hinton, Tony Hoare, and Leslie Valiant on the CS side, plus David Spiegelhalter and Demis Hassabis in data science. Unlike with the banknotes, the Bank of England can “catch them all.” Imagine, then, you will soon be able to pay for groceries with cards of some of your favorite computer and data scientists—at least if they are British and you buy the groceries in Britain.
If you see a market for NFTs, do you see one for SFTs—or are they both April Fools?
[added a sentence to the description of SFTs.]
As is often the case, not all of these are links; they’re copies of posts that I made over on my Mastodon account because they were too short to make full posts here.
Borromean rings (\(\mathbb{M}\)), now a Good Article on Wikipedia. You can’t make the Borromean rings with geometric circles: as Tverberg observed, an inversion would make one of them a line. But then each of the other two circles would span an angle less than \(\pi\) as viewed from the line, leaving an unspanned direction along which the line could escape to infinity, contradicting the inseparability of the rings.
The same proof shows that Brunnian links cannot be made from four geometric circles. But the original proof that Borromean rings are not circular, by Freedman and Skora using 4d hyperbolic geometry, applies more generally to any Brunnian link no matter how many components it has.
Squares can be divided into \(45^\circ\)–\(60^\circ\)–\(75^\circ\) triangles (\(\mathbb{M}\)), as Ed Pegg posts on twitter. This is an interesting variant on something I looked at a long time ago, triangulating the square to minimize the maximum angle. But Pegg’s subdivision is not edge-to-edge. And the triangles on my page are not all similar. Do squares have edge-to-edge triangulations by similar acute triangles?
Science joins other publishers in making it easy for authors to change their names on past publications (\(\mathbb{M}\), via). Transgender scientists made this change necessary but I think it benefits all of us. I’ve seen plenty of name changes in the literature for other reasons (marriage, divorce, escaping prejudice, …) and making it easier to find your old papers is generally a good thing.
Cairo in Denmark? (\(\mathbb{M}\)). This is a screenshot from Google Maps of Havnepromenade in Copenhagen, where it meets the end of Cort Adelers Gade. I’m sad to have missed this spot when I was in Copenhagen a couple years ago; it’s a nice example of a Cairo pentagonal tiling.
My latest illustration trick: make complex non-Boolean combinations of shapes by overlaying simple shapes, exploding the overlay into simple regions, and unioning the regions into the shapes I want (\(\mathbb{M}\)). Exploding and unioning are single clicks in Illustrator’s pathfinder tool. I used two levels of this technique to make this diagram of a Brunnian link: once to make the shape of a single component, and again to make the over-under relation between components.
Real induction (\(\mathbb{M}\)). An induction principle for proving statements about all real numbers in intervals.
Recamán Sequence (\(\mathbb{M}\)). Linocut art by Josh Millard in a style drawn from a Numberphile video.
Quanta surveys recent developments in fast matrix multiplication (\(\mathbb{M}\)), including a paper from last SODA by Josh Alman and Virginia Williams improving the exponent from 2.3728639 to 2.3728596. Known barriers to the widely-expressed hope that the exponent can be reduced to 2 (see the SODA paper’s introduction) are, I think too specific to be convincing. On the other hand at this rate there’s still a long way from here to 2…
Printing algebraic numbers (\(\mathbb{M}\), via). So your code can handle symbolic representations of exact algebraic numbers rather than just approximating everything as complex/float. How do you get those representations out to human users in a readable way? From Fredrik Johansson, long-time developer of open symbolic algebra and exact real arithmetic software including SymPy and Calcium.
Fermat’s right triangle theorem (\(\mathbb{M}\)), now a Good Article on Wikipedia. As Fibonacci observed and Fermat proved (as his only surviving proof), integer right triangles cannot have an area that is a perfect square. One corollary of this is the exponent-4 case of Fermat’s Last Theorem.
Tom Harris asks: What’s your favorite OEIS sequence? Mine for this week at least is the Moser–De Bruijn sequence, OEIS A000695.
The most recent image of me on my site was from 2009 and getting pretty out-of-date. So I took a screenshot of myself as I look these days on Zoom (\(\mathbb{M}\)). This is far from the first time I’ve let my hair go at least this long, but I’m still looking forward to a time when I will feel more comfortable getting a haircut. It occurs to me that there has been a reversal: now, instead of showing off rebelliousness, I am visibly conforming to safety rules.
Fight over open data in rock climbing (\(\mathbb{M}\), via).
Weird probability distributions on dyadic rationals from a simple averaging process (\(\mathbb{M}\)). Start with the two-element multiset \(\{0,1\}\), repeatedly draw two-element samples (without replacement) from the multiset and include the average into the multiset. The result tends to cluster around some value, not the same value every time but itself drawn from a bimodal distribution, with the clustering sort of looking Cauchy but not quite. Interesting new MathOverflow question.
How many hyperplanes does it take to slice all the edges of an -dimensional hypercube (\(\mathbb{M}\))? Two if you can skim the ends of the edges or include entire edges in hyperplanes but more to really slice them. The obvious answer of axis-parallel cuts can be improved to \(5n/6\), and Lance Fortnow reports that the newest lower bound is \(\Omega(n^{0.57})\), by Gal Yehuda and Amir Yehudayoff, improving their own new preprint.
A list of current blogs on computer theory and related math (\(\mathbb{M}\)), from Gödel’s Lost Letter and P=NP.
The New York Times is a great newspaper: it is also No Fun—Molly Ivins
The Gray Lady, as the New York Times is usually called, believes in her real slogan:
All the news that’s fit to print.
This slogan started about 115 years ago, and is on the masthead of the paper.
Today I thought, for something different, we would supply just news.
That is links to blogs on computer theory and related math. Links that are current, that are interesting, and that are fit to refer to. All the blogs that
Our rule is we include a link provided it is current. It must have a recent post—say at least this year. See all for a complete list. But we decided to emphasis those that are up to date.
What’s new by Terence Tao
It is here. A recent article is “Boosting the van der Corput inequality using the tensor power trick”.
Gower’s Weblog by Timothy Gowers
It is here. A recent article is “Leicester mathematics under threat again”.
Not Even Wrong by Peter Woit
It is here. A recent article is “The Future of Fundamental Physics”.
Shtetl-Optimized by Scott Aaronson
It is here. A recent article is “QC ethics and hype: the call is coming from inside the house”.
Some Plane Truths by Adam Sheffer
It is here. A recent article is “The 2021 Polymath Jr Program”.
Computational Complexity by Lance Fortnow and Bill Gasarch
It is here. A recent article is “The key to my Taylor series problem: Buddy can you spare a penny, nickel, dime, or quarter”.
11011110 by David Eppstein
It is here. A recent article is “More mathematics books by women”.
Decentralized Thoughts by Ittai Abraham, Ittay Eyal, Kartik Nayak, Ling Ren, Alin Tomescu, Avishay Yanai
It is here. A recent article is “Living with Asynchrony: the Gather protocol”.
Foundation of Data Science by dstheory
It is here. A recent article previews a talk this Thu. Apr, 1 (at 1pm ET) by Ingrid Daubechies on “Discovering low-dimensional manifolds in high-dimensional data sets.”
Machine Learning Research Blog by Francis Bach
It is here. A recent article “Going beyond least-squares – II : Self-concordant analysis for logistic regression”.
Combinatorics and more by Gil Kalai
It is here. A recent article “What is mathematics (or at least, how it feels)”.
Kamathematics by Gautam Kamath
It is here. A recent article “Learning Theory Alliance and Mentoring Workshop”.
Process Algebra Diary by Luca Aceto
It is here. A recent article “Article by Sergey Kitaev and Anthony Mendes in Jeff Remmel’s memory”.
in Theory by Luca Trevisan
It is here. A recent article is an announcement for accepted papers STOC 2021.
“Marge, I agree with you — in theory. In theory, communism works. In theory.” — Homer Simpson.
Algorithms, Nature, and Society by Nisheeth Vishnoi
It is here. A recent article is an announcement for FOCS 2022.
Did we leave any out? Did we include any we should have
Shuchi Chawla, Madhur Tulsiani, and I are organizing a new summer school aimed at exposing undergraduate students to research directions in theoretical computer science and its applications. The school will take place from May 31 till June 4, 2021. This first iteration will be online, but we hope it will become a recurring and lasting event. See https://boazbk.github.io/tcs-summerschool/ for the webpage. We already have 5 exciting instructors signed up – Antonio Blanca, Ashia Wlson, Jelani Nelson, Nicole Immorlica, and Yael Kalai. We plan to have both technical tutorials / mini-courses and networking/social events.
This course is aimed at undergraduate students, and in particular students from groups that are currently under-represented in our field. If you are a faculty member in a university, we would be grateful if you can spread the announcement below to both your colleagues, and any mailing lists for undergraduate students, and in particular chapters of affinity groups such as the National Society of Black Engineers, Society of Hispanic Professional Engineers, National Center for Women & Information Technology, etc.
The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. The course is free of charge,and we welcome applications from undergraduates majoring in computer science or related fields. We particularly encourage applications from students that are members of groups that are currently under-represented in theoretical computer science.
The course is intended for currently enrolled undergraduate students that are majoring in computer science or related fields. Students will be expected to be familiar with the material typically taught in introductory algorithms and discrete mathematics / mathematics for computer science courses. If you are unsure if you are prepared for the course, please write to us at summer-school-admin@boazbarak.org
To apply for the course, please visit https://boazbk.github.io/tcs-summerschool/ and fill in the application form by April 15, 2021.
Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)
Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).
Please email summer-school-admin@boazbarak.org with any questions.
Here's a neat result I heard about at virtual Dagstuhl last week, a new lower bound on the number of hyperplanes that cuts all the edges of a hypercube.
A n-dimensional hypercube has 2^{n} vertices corresponding to the binary strings of length n. Edges are between two vertices that differ in exactly one bit, for a total of (n/2) 2^{n} edges. Hypercubes played a large role in Hao Huang's recent proof of the sensitivity conjecture.
A hyperplane is just a linear inequality in x_{1},…,x_{n} the bits of the string corresponding to a vertex. An edge is cut if the inequality is true for one vertex and false for the other.
With n hyperplanes you can cut all the edges in two very different ways.
The next Foundations of Data Science virtual talk will take place on Thursday, April 1st at 10:00 AM Pacific Time (13:00 Eastern Time, 19:00 Central European Time, 17:00 UTC). Ingrid Daubechies from Duke Univeristy will speak about “Discovering low-dimensional manifolds in high-dimensional data sets.”
Please register here to join the virtual talk.
Abstract: This talk reviews diffusion methods to identify low-dimensional manifolds underlying high-dimensional datasets, and illustrates that by pinpointing additional mathematical structure, improved results can be obtained. Much of the talk draws on a case study from a collaboration with biological morphologists, who compare different phenotypical structures to study relationships of living or extinct animals with their surroundings and each other. This is typically done from carefully defined anatomical correspondence points (landmarks) on e.g. bones; such landmarking draws on highly specialized knowledge. To make possible more extensive use of large (and growing) databases, algorithms are required for automatic morphological correspondence maps, without any preliminary marking of special features or landmarks by the user.
The series is supported by the NSF HDR TRIPODS Grant 1934846.
The joy of knowing, understanding, and creating is common to all the sciences.—Vera Sós.
Dorit Aharonov is a researcher in quantum computation. She was a Ph.D. student of Michael Ben-Or and Avi Wigderson. Her 1999 thesis, entitled “Noisy Quantum Computation,” included her role with Ben-Or as one of several superposed groups who discovered the quantum fault tolerance threshold theorem.
Today Ken and I wish to send our congratulations to Avi and Laci—László Lovász—on winning the 2021 Abel Prize.
The prize citation is:
For their foundational contributions to theoretical computer science and discrete mathematics, and their leading role in shaping them into central fields of modern mathematics.
You might wonder why we are talking about Aharonov? She was a student of Avi and he was a student of mine—see this for more details. I am proud of Avi for his terrific results and also for his great students. I guess Aharonov is a grand-student of mine.
Lovász and Ken’s advisor, Dominic Welsh of Oxford, partnered with Dominic’s student Magnus Bordewich and Michael Freedman of Microsoft on a 2005 paper titled, “Approximate counting and quantum computation.” The paper translates criteria for approximating the Jones polynomial into simulations of quantum circuits. It references how topological quantum field theory (TQFT) implicitly embodies a quantum algorithm that meets the criteria, hence showing that the Jones approximation problem is complete for in a suitable promise-problem sense.
The genealogy of the Jones polynomial is that it was discovered by Vaughan Jones. Aharonov, with Jones and his student Zeph Landau, wrote a 2009 paper that makes the quantum algorithm for approximating the Jones polynomial explicit. To quote the converse of a statement in their abstract, removing the dependence on TQFT makes the algorithm “accessible to computer scientists.”
Aharonov’s work thus represents the mixing of elements of classical mathematics and computing theory (including quantum, i.e. non-classical, computing) that this year’s Abel Prize recognizes. Mixing is indeed the word, since results on the mixing rates of Markov chains by Lovász and many others undergird some of this work. This is true even more directly of a notable paper by her with Andris Ambainis, Julia Kempe, and Umesh Vazirani on quantum walks.
When we hosted the 2012 debate between Gil Kalai and Aram Harrow on the feasibility of large-scale quantum computing, Gil led off with the following photo of the physicist Robert Alicki, Ben-Or, Aharonov, and himself, which was taken in Jerusalem in 2005.
Aharonov wrote an influential long survey in 1998 titled “Quantum Computation,” which grew out of her thesis and is sometimes confused with it. Ken and I owe a small debt to diagrams she used in the survey, for which she in turn alluded to Richard Feynman’s path integrals and their visualizations:
We expanded on this kind of diagram in our textbook on quantum computing with MIT Press, whose second edition is scheduled to appear next month.
Whereas the Fields Medals are awarded only every four years, the Abel Prize is annual. It commemorates the Norwegian mathematician Niels Abel. Although it has been awarded only since 2003, it was originally proposed in 1899 expressly as a counterpart to the Nobel Prizes—and fell through only because of the 1905 separation of Sweden and Norway. We most recently covered the 2019 prize going to Karen Uhlenbeck.
Abel of course is the mathematician who gave his name to abelian groups and much else. About his short life of twenty-seven years, Charles Hermite said:
Abel has left mathematicians enough to keep them busy for five hundred years.
What we note is that Abel’s most famous result is a lower bound type result. He showed the impossibility of solving the general quintic equation in radicals. Finding clever identities, clever inequities, clever algorithms is hard. But not as hard as showing that something cannot be done.
Abel’s result is perhaps like proving today, in the 21st century, a lower bound on the complexity of some concrete problem. Some of Avi’s and also Laci’s best work has been directed toward this quest: Prove some non-trivial lower bound. Avi has worked on numerous approaches, including the fusion method in 1993. Although its lack of breakthroughs has caused jibes about “cold fusion,” Avi noted the following in a recent talk at the Simons Institute:
I will review (and hope to revive) a collection of old works, which suggest obtaining circuit lower bounds by “fusing” many correct computations (of a circuit too small) into an incorrect computation.
In a couple of models, this viewpoint led to (slight) superlinear lower bounds, which nonetheless have not been improved upon for 25 years.
We hope that the Abel Prize award will promote efforts to achieve not-so-slightly super-linear lower bounds on circuit size, running time, algebraic size, quantum effort, anything like that…
We also tip our hat to others who have reported on the Abel Prize, including some of our fellow bloggers: Gil Kalai here, Scott Aaronson here, Lance Fortnow and Bill Gasarch here, plus Kevin Hartnett for Quanta here.