The Department of Computer Science at the University of Illinois Urbana-Champaign invites applications for full-time tenure-track faculty positions at all levels (Assistant Professor, Associate Professor, Full Professor). We particularly encourage applications in quantum computing, but also welcome applications from exceptional candidates in other areas.

The title of this post came from an opinion piece in the Wall Street Journal yesterday on vaccine distribution. Many attempts to get the vaccines to the right groups first have slowed down distribution and sometime even caused vaccines to go to waste. Rules to help spread vaccines across minority groups often backfire. Often when some rules lead to inequity, we try to fix it with more rules when we need less much less. Attempts to distribute vaccines to multiple medical and pharmacy sites have made it difficult to get appointments even if you are eligible.

Randomness is the simplest way to fairness. The movie Contagion got it right, just choose birthdays by picking balls from a bin to distribute the vaccine. Then people can just show up at a few chosen sites with proof of birthday. No need to sign up.

You could argue to add back conditions like age, medical conditions, jobs but that just leads you down the same problematic path. The fastest way to get past this pandemic is to get vaccines into arms. Trust the randomness.

Krea University, an upcoming liberal arts university located near Chennai, India, is looking for dynamic tenure track faculty across disciplines and experience levels in computer science.

The next TCS+ talk will take place this coming Wednesday, March 3th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 18:00 UTC). Steve Hanneke from TTIC will speak about “A Theory of Universal Learning” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Due to security concerns, registration is required to attend the interactive talk. (The recorded talk will also be posted on our website aftwerwards, so people who did not sign up will still be able to watch the talk.) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its “learning curve”, that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy.

In this work, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this work is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case.

Joint work with Olivier Bousquet, Shay Moran, Ramon van Handel, and Amir Yehudayoff.

Authors: Chi-Ning Chou, Alexander Golovnev, Madhu Sudan, Santhoshini Velusamy Download:PDF Abstract: A Boolean constraint satisfaction problem (CSP), Max-CSP$(f)$, is a
maximization problem specified by a constraint $f:\{-1,1\}^k\to\{0,1\}$. An
instance of the problem consists of $m$ constraint applications on $n$ Boolean
variables, where each constraint application applies the constraint to $k$
literals chosen from the $n$ variables and their negations. The goal is to
compute the maximum number of constraints that can be satisfied by a Boolean
assignment to the $n$~variables. In the $(\gamma,\beta)$-approximation version
of the problem for parameters $\gamma \geq \beta \in [0,1]$, the goal is to
distinguish instances where at least $\gamma$ fraction of the constraints can
be satisfied from instances where at most $\beta$ fraction of the constraints
can be satisfied.

In this work we completely characterize the approximability of all Boolean
CSPs in the streaming model. Specifically, given $f$, $\gamma$ and $\beta$ we
show that either (1) the $(\gamma,\beta)$-approximation version of Max-CSP$(f)$
has a probabilistic streaming algorithm using $O(\log n)$ space, or (2) for
every $\varepsilon > 0$ the
$(\gamma-\varepsilon,\beta+\varepsilon)$-approximation version of Max-CSP$(f)$
requires $\Omega(\sqrt{n})$ space for probabilistic streaming algorithms.
Previously such a separation was known only for $k=2$. We stress that for
$k=2$, there are only finitely many distinct problems to consider.

Our positive results show wider applicability of bias-based algorithms used
previously by [Guruswami-Velingker-Velusamy APPROX'17], [Chou-Golovnev-Velusamy
FOCS'20] by giving a systematic way to explore biases. Our negative results
combine the Fourier analytic methods of [Kapralov-Khanna-Sudan SODA'15], which
we extend to a wider class of CSPs, with a rich collection of reductions among
communication complexity problems that lie at the heart of the negative
results.

Authors: Yi Li, Honghao Lin, David P. Woodruff Download:PDF Abstract: Sketching is a dimensionality reduction technique where one compresses a
matrix by linear combinations that are typically chosen at random. A line of
work has shown how to sketch the Hessian to speed up each iteration in a second
order method, but such sketches usually depend only on the matrix at hand, and
in a number of cases are even oblivious to the input matrix. One could instead
hope to learn a distribution on sketching matrices that is optimized for the
specific distribution of input matrices. We show how to design learned sketches
for the Hessian in the context of second order methods, where we learn
potentially different sketches for the different iterations of an optimization
procedure. We show empirically that learned sketches, compared with their
"non-learned" counterparts, improve the approximation accuracy for important
problems, including LASSO, SVM, and matrix estimation with nuclear norm
constraints. Several of our schemes can be proven to perform no worse than
their unlearned counterparts. Additionally, we show that a smaller sketching
dimension of the column space of a tall matrix is possible, assuming an oracle
for predicting rows which have a large leverage score.

Authors: Aditya Desai, Benjamin Coleman, Anshumali Shrivastava Download:PDF Abstract: We introduce Density sketches (DS): a succinct online summary of the data
distribution. DS can accurately estimate point wise probability density.
Interestingly, DS also provides a capability to sample unseen novel data from
the underlying data distribution. Thus, analogous to popular generative models,
DS allows us to succinctly replace the real-data in almost all machine learning
pipelines with synthetic examples drawn from the same distribution as the
original data. However, unlike generative models, which do not have any
statistical guarantees, DS leads to theoretically sound asymptotically
converging consistent estimators of the underlying density function. Density
sketches also have many appealing properties making them ideal for large-scale
distributed applications. DS construction is an online algorithm. The sketches
are additive, i.e., the sum of two sketches is the sketch of the combined data.
These properties allow data to be collected from distributed sources,
compressed into a density sketch, efficiently transmitted in the sketch form to
a central server, merged, and re-sampled into a synthetic database for modeling
applications. Thus, density sketches can potentially revolutionize how we
store, communicate, and distribute data.

Authors: Vitaly Feldman, Kunal Talwar Download:PDF Abstract: Locally Differentially Private (LDP) Reports are commonly used for collection
of statistics and machine learning in the federated setting. In many cases the
best known LDP algorithms require sending prohibitively large messages from the
client device to the server (such as when constructing histograms over large
domain or learning a high-dimensional model). This has led to significant
efforts on reducing the communication cost of LDP algorithms.

At the same time LDP reports are known to have relatively little information
about the user's data due to randomization. Several schemes are known that
exploit this fact to design low-communication versions of LDP algorithm but all
of them do so at the expense of a significant loss in utility. Here we
demonstrate a general approach that, under standard cryptographic assumptions,
compresses every efficient LDP algorithm with negligible loss in privacy and
utility guarantees. The practical implication of our result is that in typical
applications the message can be compressed to the size of the server's
pseudo-random generator seed. More generally, we relate the properties of an
LDP randomizer to the power of a pseudo-random generator that suffices for
compressing the LDP randomizer. From this general approach we derive
low-communication algorithms for the problems of frequency estimation and
high-dimensional mean estimation. Our algorithms are simpler and more accurate
than existing low-communication LDP algorithms for these well-studied problems.

Authors: Wei Yao, Junyi Ye, Renita Murimi, Guiling Wang Download:PDF Abstract: Blockchain is a distributed ledger that is decentralized, immutable, and
transparent, which maintains a continuously growing list of transaction records
ordered into blocks. As the core of blockchain, the consensus algorithm is an
agreement to validate the correctness of blockchain transactions. For example,
Bitcoin is a public blockchain where each node in Bitcoin uses the Proof of
Work (PoW) algorithm to reach a consensus by competing to solve a puzzle.
Unlike a public blockchain, a consortium blockchain is an enterprise-level
blockchain that does not contend with the issues of creating a resource-saving
global consensus protocol. This paper highilights several state-of-the art
solutions in consensus algorithms for enterprise blockchain. For example, the
HyperLedger by Linux Foundation includes implementing Practical Byzantine Fault
Tolerance (PBFT) as the consensus algorithm. PBFT can tolerate a range of
malicious nodes and reach consensus with quadratic complexity. Another
consensus algorithm, HotStuff, implemented by Facebook Libra project, has
achieved linear complexity of the authenticator. This paper presents the
operational mechanisms of these and other consensus protocols, and analyzes and
compares their advantages and drawbacks.

Authors: Josh Alman Download:PDF Abstract: For a matrix $M$ and a positive integer $r$, the rank $r$ rigidity of $M$ is
the smallest number of entries of $M$ which one must change to make its rank at
most $r$. There are many known applications of rigidity lower bounds to a
variety of areas in complexity theory, but fewer known applications of rigidity
upper bounds. In this paper, we use rigidity upper bounds to prove new upper
bounds in a few different models of computation. Our results include:

$\bullet$ For any $d> 1$, and over any field $\mathbb{F}$, the $N \times N$
Walsh-Hadamard transform has a depth-$d$ linear circuit of size $O(d \cdot N^{1
+ 0.96/d})$. This circumvents a known lower bound of $\Omega(d \cdot N^{1 +
1/d})$ for circuits with bounded coefficients over $\mathbb{C}$ by Pudl\'ak
(2000), by using coefficients of magnitude polynomial in $N$. Our construction
also generalizes to linear transformations given by a Kronecker power of any
fixed $2 \times 2$ matrix.

$\bullet$ The $N \times N$ Walsh-Hadamard transform has a linear circuit of
size $\leq (1.81 + o(1)) N \log_2 N$, improving on the bound of $\approx 1.88 N
\log_2 N$ which one obtains from the standard fast Walsh-Hadamard transform.

$\bullet$ A new rigidity upper bound, showing that the following classes of
matrices are not rigid enough to prove circuit lower bounds using Valiant's
approach:

$-$ for any field $\mathbb{F}$ and any function $f : \{0,1\}^n \to
\mathbb{F}$, the matrix $V_f \in \mathbb{F}^{2^n \times 2^n}$ given by, for any
$x,y \in \{0,1\}^n$, $V_f[x,y] = f(x \wedge y)$, and

$-$ for any field $\mathbb{F}$ and any fixed-size matrices $M_1, \ldots, M_n
\in \mathbb{F}^{q \times q}$, the Kronecker product $M_1 \otimes M_2 \otimes
\cdots \otimes M_n$.

This generalizes recent results on non-rigidity, using a simpler approach
which avoids needing the polynomial method.

Authors: Tzanis Anevlavis, Matthew Philippe, Daniel Neider, Paulo Tabuada Download:PDF Abstract: While most approaches in formal methods address system correctness, ensuring
robustness has remained a challenge. In this paper we introduce the logic rLTL
which provides a means to formally reason about both correctness and robustness
in system design. Furthermore, we identify a large fragment of rLTL for which
the verification problem can be efficiently solved, i.e., verification can be
done by using an automaton, recognizing the behaviors described by the rLTL
formula $\varphi$, of size at most $\mathcal{O} \left( 3^{ |\varphi|} \right)$,
where $|\varphi|$ is the length of $\varphi$. This result improves upon the
previously known bound of $\mathcal{O}\left(5^{|\varphi|} \right)$ for rLTL
verification and is closer to the LTL bound of $\mathcal{O}\left( 2^{|\varphi|}
\right)$. The usefulness of this fragment is demonstrated by a number of case
studies showing its practical significance in terms of expressiveness, the
ability to describe robustness, and the fine-grained information that rLTL
brings to the process of system verification. Moreover, these advantages come
at a low computational overhead with respect to LTL verification.

Authors: Chaya Keller Download:PDF Abstract: Consider the hypergraph whose vertex set is a family of $n$ lines in general
position in the plane, and whose hyperedges are induced by intersections with a
finite family of pseudo-discs. We prove that the number of 3-sized hyperedges
is bounded by $O(n^2)$ and that this bound is tight. In addition, we show that
the total number of hyperedges is $O(n^3)$.

Authors: Eric Balkanski, Sharon Qian, Yaron Singer Download:PDF Abstract: For many optimization problems in machine learning, finding an optimal
solution is computationally intractable and we seek algorithms that perform
well in practice. Since computational intractability often results from
pathological instances, we look for methods to benchmark the performance of
algorithms against optimal solutions on real-world instances. The main
challenge is that an optimal solution cannot be efficiently computed for
intractable problems, and we therefore often do not know how far a solution is
from being optimal. A major question is therefore how to measure the
performance of an algorithm in comparison to an optimal solution on instances
we encounter in practice.

In this paper, we address this question in the context of submodular
optimization problems. For the canonical problem of submodular maximization
under a cardinality constraint, it is intractable to compute a solution that is
better than a $1-1/e \approx 0.63$ fraction of the optimum. Algorithms like the
celebrated greedy algorithm are guaranteed to achieve this $1-1/e$ bound on any
instance and are used in practice.

Our main contribution is not a new algorithm for submodular maximization but
an analytical method that measures how close an algorithm for submodular
maximization is to optimal on a given problem instance. We use this method to
show that on a wide variety of real-world datasets and objectives, the
approximation of the solution found by greedy goes well beyond $1-1/e$ and is
often at least 0.95. We develop this method using a novel technique that lower
bounds the objective of a dual minimization problem to obtain an upper bound on
the value of an optimal solution to the primal maximization problem.

In this lecture, we move from the world of supervised learning to unsupervised learning, with a focus on generative models. We will

Introduce unsupervised learning and the relevant notations.

Discuss various approaches for generative models, such as PCA, VAE, Flow Models, and GAN.

Discuss theoretical and practical results we currently have for these approaches.

Setup for Unsupervised Learning

In supervised learning, we have data and we want to understand the distribution . For example,

Probability estimation: Given , can we compute/approximate (the probability that is output under )?

Generation: Can we sample from , or from a “nearby” distribution?

Encoding: Can we find a representation such that for , makes it easy to answer semantic questions on ? And such that corresponds to “semantic similarity” of and ?

Prediction: We would like to be able to predict (for example) the second half of from the first half. More generally, we want to solve the conditional generation task, where given some function (e.g., the projection to the first half) and some value , we can sample from the conditional probability distribution .

Our “dream” is to solve all of those by the following setup:

There is an “encoder” that maps into a representation in the latent space, and then a “decoder” that can transform such a representation back into . We would like it to be the case that:

Generation: For , the induced distribution is “nice” and efficiently sampleable (e.g., the standard normal over ) such that we can (approximately) sample from by sampling and outputting .

Density estimation: We would like to be able to evaluate the probability that . For example, if is the inverse of , and we could do so by computing .

Semantic representation: We would like the latent representation to map into meaningful latent space. Ideally, linear directions in this space will correspond to semantic attributes.

Conditional sampling: We would like to be able to do conditional generation, and in particular for some functions and values , be able to sample from the set of ‘s such that

Ideally, if we could map images to the latent variables used to generate them and vice versa (as in the cartoon from the last lecture), then we could achieve these goals:

At the moment, we do not have a single system that can solve all these problems for a natural domain such as images or language, but we have several approaches that achieve part of the dream.

Digressions. Before discussing concrete models, we make three digressions. One will be non-technical, and the other three technical. The three technical digressions are the following:

If we have multiple objectives, we want a way to interpolate between them.

To measure how good our models are, we have to measure distances between statistical distributions.

Once we come up with generating models, we would metrics for measuring how good they are.

Non-technical digression: Is deep learning a cargo cult science? (spoiler: no)

In an influential essay, Richard Feynman coined the term “cargo cult science” for the activities that have superficial similarities to science but do not follow the scientific method. Some of the tools we use in machine learning look suspiciously close to “cargo cult science.” We use the tools of classical learning, but in a setting in which they were not designed to work in and on which we have no guarantees that they will work. For example, we run (stochastic) gradient descent – an algorithm designed to minimize a convex function – to minimize convex loss. We also write use empirical risk minimization – minimizing loss on our training set – in a setting where we have no guarantee that it will not lead to “overfitting.”

And yet, unlike the original cargo cults, in deep learning, “the planes do land”, or at least they often do. When we use a tool in a situation that it was not designed to work in, it can play out in one (or mixture) of the following scenarios:

Murphy’s Law: “Anything that can go wrong will go wrong.” As computer scientists, we are used to this scenario. The natural state of our systems is that they have bugs and errors. There is a reason why software engineering talks about “contracts”, “invariants”, preconditions” and “postconditions”: typically, if we try to use a component in a situation that it wasn’t designed for, it will not turn out well. This is doubly the case in security and cryptography, where people have learned the hard way time and again that Murphy’s law holds sway.

“Marley’s Law”: “Every little thing gonna be alright”. In machine learning, we sometimes see the opposite phenomenon- we use algorithms outside the conditions under which they have been analysed or designed to work in, but they still produce good results. Part of it could be because ML algorithms are already robust to certain errors in their inputs, and their output was only guaranteed to be approximately correct in the first place.

Murphy’s law does occasionally pop up, even in machine learning. We will see examples of both phenomena in this lecture.

Technical digression 1: Optimization with Multiple Objectives

During machine learning, we often have multiple objectives to optimize. For example, we may want both an efficient encoder and an effective decoder, but there is a tradeoff between them.

Suppose we have 2 loss functions and , but there can be a trade off between them. The pareto curve is the set

If a model is above the curve, it is not optimal. If it is below the curve, the model is infeasible.

When the set is convex, we can reach any point on the curve by minimizing . The proof is by the picture above: for any point on the curve, there is a tangent line at that is strictly below the curve. If is the normal vector for this line, then the global minimum of on the feasible set will be . This motivates the common practice of minimizing two introducing a hyperparameter to aggregate two objectives into one.

When is not convex, it may well be that:

Some points on are not minima of

might have multiple minima

Depending on the path one takes, it is possible to get “stuck” in a point that is not a global minima

The following figure demonstrates all three possibilities

Par for the course, this does not stop people in machine learning from using this approach to minimize different objectives, and often “Marley’s Law” holds, and this works fine. But this is not always the case. A nice blog post by Degrave and Kurshonova discusses this issue and why sometimes we do in fact, see “Murphy’s law” when we combine objectives. They also detail some other approaches for combining objectives, but there is no single way that will work in all cases.

Figure from Degrave-Kurshonova demonstrating where the algorithm could reach in the non-convex case depending on initialization and :

Technical digression 2: Distances between probability measures

Suppose we have two distributions over . There are two common ways of measuring the distances between them.

The Total Variance (TV) (also known as statistical distance) between and is equal to

The second equality can be proved by constructing that outputs 1 on where and vice versa. The definition has a crypto-flavored interpretation: For any adversary , the TV measures the advantage they can have over half of determining whether or .

Second, the Kullback–Leibler (KL) Divergence between and is equal to

(The total variation distance is symmetric, in the sense that , but the KL divergence is not. Both have the property that they are non-negative and equal to zero if and only if .)

Unlike the total variation distance, which is bounded between and , the KL divergence can be arbitrarily large and even infinite (though it can be shown using the concavity of log that it is always non-negative). To interpret the KL divergence, it is helpful to separate between the case that is close to zero and the case where it is a large number. If for some , then we would need about samples to distinguish between samples of and samples of . In particular, suppose that we get and we want to distinguish between the case that we they were independently sampled from and the case that they were independently sampled from . A natural (and as it turns out, optimal) approach is to use a likelihood ratio test where we decide the samples came from if . For example, if we set then this approach will guarantee that our “false positive rate” (announcing that samples came from when they really came from ) will be most . Taking logs and using the fact that the probability of these independent samples is the product of probabilities, this amounts to testing whether . When samples come from , the expectation of the righthand side is , so we see that to ensure is larger than we need the number samples to be at least (and as it turns out, this will do).

When the divergence is a large number , we can think of it as the number of bits of “surprise” in as opposed to . For example, in the common case where is obtained by conditioning on some event , will typically be (some fine print applies). In general, if is obtained from by revealing bits of information (i.e., by conditioning on a random variable whose mutual information with is ) then .

Generalizations: The total variation distance is a special case of metrics of the form . These are known as integral probability metrics and include examples such as the Wasserstein distance, Dudley metric, and Maximum Mean Discrepancy. KL divergence is a special case of divergence measures known as -divergence, which are measures of the form . The KL divergence is obtained by setting . (In fact even the TV distance is a special case of divergence by setting .)

Normal distributions: It is a useful exercise to calculate the TV and KL distances for normal random variables. If and , then since most probability mass in the regime where , (i.e., up to some multiplicative constant). For KL divergence, if we selected from a normal between and then with probability about half we’ll have and with probability about half we will have . By selecting , we increase probability of the former to and the decrease the probability of the latter to . So we have bias towards ‘s where , or . Hence . The above generalizes to higher dimensions. If is a -variate normal, and for , then (for small ) while .

If and is a “narrow normal” of the form then their TV distance is close to while . In the dimensional case, if and for some covariance matrix , then . The two last terms are often less significant. For example if then .

Given a distribution of natural data and a purported generative model , how do we measure the quality of ?

A natural measure is the KL divergence but it can be hard to evaluate, since it involves the term which we cannot evaluate. However, we can rewrite the KL divergence as . The term is equal to where is the entropy of . The term is known as the cross entropy of and . Note that the cross-entropy of and is simply the expectation of the negative log likelihood of for sampled from .

When is fixed, minimizing corresponds to minimizing the cross entropy or equivalently, maximizing the log likelihood. This is useful since often is the case that we can sample elements from (e.g., natural images) but can only evaluate the probability function for . Hence a common metric in such cases is minimizing the cross-entropy / negative log likelihood . For images, a common metric is “bits per pixel” which simply equals where is the length of . Another metric (often used in natural language processing) is perplexity, which interchanges the expectation and the logarithm. The logarithm of the perplexity of is where is the length of (e.g., in tokens). Another way to write this is that log of the perplexity is the average of where is the probability of under conditioned on the first parts of .

Memorization for log-likelihood. The issue of “overfitting” is even more problematic for generative models than for classifiers. Given samples and enough parameters, we can easily come up with a model corresponding to the uniform distribution . This is obviously a useless model that will never generate new examples. However, this model will not only get a large log likelihood value on the training set, in fact, it will get even better log likelihood than the true distribution! For example, any reasonable natural distribution on images would have at least tens of millions, if not billions or trillions of potential images. In contrast, a typical training set might have fewer than 1M samples. Hence, unlike in the classification setting, for generation, the “overfitting” model will not only match but can, in fact, beat the ground truth. (This is reminiscent of the following quote from Peter and Wendy: “Not a sound is to be heard, save when they give vent to a wonderful imitation of the lonely call of the coyote. The cry is answered by other braves; and some of them do it even better than the coyotes, who are not very good at it.”)

If we cannot compute the density function, then benchmarking becomes more difficult. What often happens in practice is an “I know it when I see it” approach. The paper includes a few pictures generated by the model, and if the pictures look realistic, we think it is a good model. However, this can be deceiving. After all, we are feeding in good pictures into the model, so generating a good photo may not be particularly hard (e.g. the model might memorize some good pictures and use those as outputs).

There is another metric called the inception score, which loosely corresponds to how similar the “inception” neural network finds the GAN model to ImageNet (in the sense that inception thinks it covers many of the ImageNet classes and that produces images on which inception has high confidence) but it too has its problems. Ravuri-Vinyalis 2019 used a GAN model with a good inception score used its outputs to train a different model on ImageNet. Despite the high inception score (which should have indicated that the GANs output are as good as ImageNets) the accuracy when training on the GAN output dropped from the original value of to as low as ! (Even in the best case, accuracy dropped by at least 30 points.) Compare this with the 11-14% drop when we train on ImageNet and test on ImageNet v2.

This figure from Goodfellow’s tutorial describes generative models where we know and don’t know how to compute the density function:

Auto Encoder / Decoder

We now shift our attention to the encoder/decoder architecture mentioned above.

Recall that we want to understand , generate new elements , and find a good representation of the elements. Our dream is to solve all of the issues with auto encoder/decoder, whose setup is as follows:

That is, we want , such that

The representation enables us to solve tasks such as generation, classification, etc..

To each the first point, we can aim to minimize . However, we can of course, make this loss zero by letting and be the identity function. Much of the framework of generative models can be considered as placing some restrictions on the “communication channel” that rule out this trivial approach, with the hope that would require the encoder and decoder to “intelligently” correspond to the structure of the natural data.

Auto Encoders: noiseless short

A natural idea is to simply restrict the dimension of the latent space to be small (). In principle, the optimal compression scheme for a probability distribution will require knowing the distribution. Moreover, the optimal compression will maximize the entropy of the latent data . Since the maximum entropy distribution is uniform (in the discrete case), we could easily sample from it. (In the continuous setting, the standard normal distribution plays the role of the uniform distribution.)

For starter, consider the case of picking to be small and minimizing for linear , . Since is a rank matrix, we can write this as finding a rank matrix that minimizes where is our input data. It can be shown that that would minimize this will be the projection to the top eigenvectors of which exactly corresponds to Principal Component Analysis (PCA).

In the nonlinear case, we can obtain better compression. However, we do not achieve our other goals:

It is not the case that we can generate realistic data by sampling uniform/normal and output

It is not the case that semantic similarity between and corresponds to large dot product between and .

It seems that model just rediscovers a compression algorithm like JPEG. We do not expect the JPEG encoding of an image to be semantically informative, and JPEG decoding of a random file will not be a good way to generate realistic images. It turns out that sometimes “Murphy’s law” does hold and if it’s possible to minimize the loss in a not very useful way then that will indeed be the case.

Variational Auto Encoder (VAE)

We now discuss variational auto encoders (VAEs). We can think of these as generalization auto-encoders to the case where the channel has some Gaussian noise. We will describe VAEs in two nearly equivalent ways:

We can think of VAEs as trying to optimize two objectives: both the auto-encoder objective of minimizing and another objective of minimizing the KL divergence between and the standard normal distribution .

We can think of VAEs as trying to maximize a proxy for the log-likelihood. This proxy is a quantity known as the “Evidence Lower Bound (ELBO)” which we can evaluate using and and is always smaller or equal to the log-likelihood.

We start with the first description. One view of VAEs is that we search for a pair of encoder and decoder that are aimed at minimizing the following two objectives:

(standard AE objective)

(distance of latent from the standard normal)

To make the second term a function of , we consider as a probability distribution with respect to a fixed . To ensure this makes sense, we need to make randomized. A randomized Neural network has “sampling neurons” that take no input, have parameters and produce an element . We can train such a network by fixing a random and defining the neuron to simply output .

ELBO derivation: Another view of VAEs is that they aim at maximizing a term known as the evidence lower bound or ELBO. We start by deriving this bound. Let be the standard normal distribution over the latent space. Define to be the distribution of conditioned on decoding to (i.e., , and define be the distribution . Since , we know that

By the definition of , . Hence we can derive that

(since depends only on , given that .)

Rearranging, we see that

or in other words, we have the following theorem:

Theorem (ELBO): For every (possibly randomized) maps and , distribution over and ,

The left-hand side of this inequality is simply the log-likelihood of . The right-hand side (which, as the inequality shows, is always smaller or equal to it) is known as the evidence lower bound or ELBO. We can think of VAEs as trying to maximize the ELBO.

The reason that the two views are roughly equivalent is the follows:

The first term of the ELBO, known as the reconstruction term, is if we assume some normal noise, then the probabiility taht will be proportional to since for , we get that and hence maximizing this term corresponds to minimizing the square distance.

The second term of the ELBO, known as the divergence term, is which is roughly equal to , where is the dimension of the latent space. Hence maximizing this term corresponds to minimizing the KL divergence between and the standard normal distribution.

How well does VAE work? First of all, we can actually generate images using them. We also find that similar inputs will have similar encodings, which is good. However, sometimes VAEs can still “cheat” (as in auto encoders). There is a risk that the learned model will split to two parts of the form . The first part of the data is there to minimize divergence, while the second part is there for reconstruction. Such a model is similarly uninformative.

However, VAEs have found practical success. For example, Hou et. al 2016 used VAE to create an encoding where two dimensions seem to correspond to “sunglasses” and “blondness”, as illustrated below. We do note that “sunglasses” and “blondness” are somewhere between “semantic” and “syntactic” attributes. They do correspond to relatively local changes in “pixel space”.

The picture can be blurry because of the noise we injected to make random. However, recent models have used new techniques (e.g. vector quantized VAE and hierarchical VAE) to resolve the blurriness and significantly improve on state of art.

Flow Models

In a flow model, we flip the order of and and set (so must be invertible). The input to will come from the standard normal distribution . The idea is that we obtain by a composition of simple invertible functions. We use the fact that if we can compute the density function of a distribution over and is invertible and differentiable, then we can compute the density function of (i.e., the distribution obtained by sampling and outputting ). To see why this is the case, consider the setting when and a small rectangle . If is small enough, will be roughly linear and hence will map into a parallelogram . Shifting the coordinate by corresponds to shifting the output of by the vector and shifting the coordinate by corresponds to shifting the output of by the vector . For every , the density of under will be proportional to the density of with the proportionality fector being .

Overall we the density of under will equal times the inverse determinant of the Jacobian of at the point

There are different ways to compose together simple reversible functions to compute a complex one. Indeed, this issue also arises in cryptography and quantum computing (e.g., the Fiestel cipher). Using similar ideas, it is not hard to show that any probability distribution can be approximated by a (sufficiently big) combination of simple reversible functions.

In practice, we have some recent succcessful flow models. A few examples of these models are in the lecture slides.

Giving up on the dream

In section 2, we had a dream of doing both representation and generation at once. So far, we have not been able to find success with these models. What if we do each goal separately?

The tasks of representation becomes self-supervised learning with approaches such SIMCLR. The task of generation can be solved by GANs. Both areas have had recent success.

Open-AI CLIP and DALL-E is a pair of models that perform each part of these tasks well, and suggest an approach to merge them. CLIP does representation for both texts and images where the two encoders are aligned, i.e. is large. DALL-E, given some text, generates an image corresponding to the text. Below are images generated by DALL-E when asked for an armchair in the shape of an avocado.

Contrastive learning

The general approach used in CLIP is called contrastive learning.

Suppose we have some representation function and inputs which represent similar objects. Let , then we want to be large when , but small when . So, let the loss function be How do we create similar ? In SIMCLR, are augmentations of the same image . In CLIP, is an image and a text that describes it.

CLIPs representation space does seem to have nice properties such as correspondence between semantic attributes and linear directions, which enables doing some “semantic linear algebra” on representations: (see this based on Vladimir Hatlakov’s code – in the snippet below tenc maps text to its encoding/representation and get_img finds nearest image to representation in a the unsplash dataset):

GANs

The theory of GANs is currently not well-developed. As an objective, we want images that “look real” (which is not well defined), and we have no posterior distribution. If we just define the distribution based on real images, our GAN might memorize the photos to beat us.

However, we know that Neural Networks are good at discriminating real vs. fake images. So, we add in a discriminator and define the loss function

The generator model and discriminator model form a 2-player game, which are often harder to train and very delicate. We typically train by changing a player’s action to the best response. However, we need to be careful if the two players have very different skill levels. They may be stuck in a setting where no change of strategies will make much difference, since the stronger player always dominates the weaker one. In particular in GANs we need to ensure that the generator is not cheating by using a degenerate distribution that still succeeds with respect to the discriminator.

If a 2-player model makes training more difficult, why do we use it? If we fix the discriminator, then the generator can find a picture that the discriminator thinks is real and only output that one, obtaining low loss. As a result, the discriminator needs to update along with the generator. This example also highlights that the discriminator’s job is often harder. To fix this, we have to somehow require the generator to give us good entropy.

Finally, how good are GANs in practice? Recently, we have had GANs that make great images as well as audios. For example, modern deepfake techniques often use GANs in their architecture. However, it is still unclear how rich the images are.

George Cantor has been featured here and here and here before on GLL. Of course, he invented modern set theory and changed math forever. His birthday is soon, so we thought we would talk about him now—he was born on March 3rd in 1845.

Today we thought it might be fun to have a quiz on math quotes.

Wait. Cantor did not invent quotation marks, nor is he known for many quotes. He does of course have many famous results, and they will live forever. But his results were subject to immediate horrible criticism and therefore memorable quotes.

Leopold Kronecker was a particular source of barbs. For example: “What good is your beautiful proof on the transcendence of ? Why investigate such problems, given that irrational numbers do not even exist?”

As a complexity theorist I must say that Kronecker has a point when he also said:

“Definitions must contain the means of reaching a decision in a finite number of steps, and existence proofs must be conducted so that the quantity in question can be calculated with any degree of accuracy.”

David Hilbert defended Cantor and said: “No one shall expel us from the paradise that Cantor has created.”

On to the quiz. Each quote is followed by two possible authors in alphabetical order. You should pick the one you think is correct. The players are:

1. Douglas Adams 2. Bernard Baruch 3. Eric Temple Bell 4. Raoul Bott
5. Paul Erdős 6. Richard Hamming 7. Godfrey Hardy 8. David Hilbert
9. Admiral Grace Hooper 10. Alan Kay 11. Donald Knuth 12. John von Neumann
13. Alan Perlis 14. Henri Poincaré 15. Srinivasa Ramanujan 16. Marcus du Sautoy
17. Raymond Smullyan 18. Alan Turing 19. Moshe Vardi 20. Andrew Wiles

Those who can imagine anything, can create the impossible.
—Kay Turing

I really didn’t foresee the Internet. But then, neither did the computer industry. Not that that tells us very much of course–the computer industry didn’t even foresee that the century was going to end.
— Adams Knuth

One man’s constant is another man’s variable.
—Perlis du Sautoy

The most damaging phrase in the language is: “It’s always been done that way.”
—-Hopper Perlis

The best way to predict the future is to invent it.
—Kay Turing

The purpose of computing is insight, not numbers.
Adams Hamming

Beware of bugs in the above code; I have only proved it correct, not tried it.
—Knuth Vardi

No, it is a very interesting number, it is the smallest number expressible as a sum of two cubes in two different ways.
—Bell Ramanujan

Beauty is the first test: there is no permanent place in the world for ugly mathematics.
—Erdős Hardy

Mathematics is the art of giving the same name to different things.
—Hooper Poincaré

There’s no sense in being precise when you don’t even know what you’re talking about.
—Bott von Neumann

I hope we’ll be able to solve these problems before we leave.
—Erdős Perlis

Some people are always critical of vague statements. I tend rather to be critical of precise statements; they are the only ones which can correctly be labeled ‘wrong’.
—Knuth Smullyan

Everything that humans can do a machine can do.
—Perlis Vardi

“Obvious” is the most dangerous word in mathematics.
— Bell Hooper

Just because we can’t find a solution, it doesn’t mean there isn’t one.
— Adams Wiles

Mathematics is a place where you can do things which you can’t do in the real world.
— du Sautoy Turing

Millions saw the apple fall, but Newton asked why.
— Baruch Hopper

The definition of a good mathematical problem is the mathematics it generates rather than the problem itself.
— Hilbert Wiles

There are two ways to do great mathematics. The first is to be smarter than everybody else. The second way is to be stupider than everybody else – but persistent.
— Bott Knuth

Open Problems

“I always have a quotation for everything—it saves original thinking.”
—Dorothy Sayers

We give an almost quadratic $n^{2-o(1)}$ lower bound on the space consumption of any $o(\sqrt{\log n})$-pass streaming algorithm solving the (directed) $s$-$t$ reachability problem. This means that any such algorithm must essentially store the entire graph. As corollaries, we obtain almost quadratic space lower bounds for additional fundamental problems, including maximum matching, shortest path, matrix rank, and linear programming.
Our main technical contribution is the definition and construction of set hiding graphs, that may be of independent interest: we give a general way of encoding a set $S \subseteq [k]$ as a directed graph with $n = k^{ 1 + o( 1 ) }$ vertices, such that deciding whether $i \in S$ boils down to deciding if $t_i$ is reachable from $s_i$, for a specific pair of vertices $(s_i,t_i)$ in the graph. Furthermore, we prove that our graph ``hides'' $S$, in the sense that no low-space streaming algorithm with a small number of passes can learn (almost) anything about $S$.

A month ago, UT Austin changed its email policies—banning auto-forwarding from university accounts to Gmail accounts, apparently as a way to force the faculty and other employees to separate their work email from their personal email, and thereby comply with various government regulations. Ever since that change, the email part of my life has been a total, unmitigated disaster. I’ve missed (or been late to see) dozens of important work emails, with the only silver lining being that that’s arguably UT’s problem more than it is mine!

And yes, I’ve already gone to technical support; the only answer I’ve gotten is that (in so many words) there is no answer. Other UT faculty are somehow able to deal with this because they are them; I am unable to deal with it because I am me. As a mere PhD in computer science, I’m utterly unqualified to set up a technical fix for this sort of problem.

So the bottom line is: from now on, if you want me to see an email, send it to scott@scottaaronson.com. Really. If you try sending it to aaronson@cs.utexas.edu, it will land in a separate inbox that I can access only with great inconvenience. And if, God forbid, you try sending it to aaronson@utexas.edu, the email will bounce and I’ll never see it at all. Indeed, a central purpose of this post is just to have a place to point the people who contact me every day, shocked that their emails to me bounced.

This whole episode has given me immense sympathy for Hillary Clinton, and for the factors that led her to set up clintonemail.com from her house. It’s not merely that her private email server was a laughably trivial reason to end the United States’ 240-year run of democratic government. Rather it’s that, even on the narrow question of emails, I now feel certain that Hillary was 100% right. Bureaucracy that impedes communication is a cancer on human civilization.

Update: Thanks so much to commenter Avraham and to my colleague Etienne Vouga, who quickly gave me the crucial information that tech support would not, and thereby let me solve this problem. I can once again easily read emails sent to aaronson@cs.utexas.edu … well, at least for now! I’m now checking about aaronson@utexas.edu. Again, though, scott@scottaaronson.com to be safe.

Promise Constraint Satisfaction Problems (PCSPs) are a generalization of Constraint Satisfaction Problems (CSPs) where each predicate has a strong and a weak form and given a CSP instance, the objective is to distinguish if the strong form can be satisfied vs. even the weak form cannot be satisfied. Since their formal introduction by Austrin, Guruswami, and Håstad, there has been a flurry of works on PCSPs, including recent breakthroughs in approximate graph coloring. The key tool in studying PCSPs is the algebraic framework developed in the context of CSPs where the closure properties of the satisfying solutions known as *polymorphisms* are analyzed.
The polymorphisms of PCSPs are significantly richer than CSPs---this is illustrated by the fact that even in the Boolean case, we still do not know if there exists a dichotomy result for PCSPs analogous to Schaefer's dichotomy result for CSPs. In this paper, we study a special case of Boolean PCSPs, namely Boolean *Ordered* PCSPs where the Boolean PCSPs have the predicate $x \leq y$. In the algebraic framework, this is the special case of Boolean PCSPs when the polymorphisms are *monotone* functions. We prove that Boolean Ordered PCSPs exhibit a computational dichotomy assuming the Rich $2$-to-$1$ Conjecture due to Braverman, Khot, and Minzer, which is a perfect completeness surrogate of the Unique Games Conjecture.
In particular, assuming the Rich $2$-to-$1$ Conjecture, we prove that a Boolean Ordered PCSP can be solved in polynomial time if for every $\epsilon >0$, it has polymorphisms where each coordinate has *Shapley value* at most $\epsilon$, else it is NP-hard. The algorithmic part of our dichotomy result is based on a structural lemma showing that Boolean monotone functions with each coordinate having low Shapley value have arbitrarily large threshold functions as minors. The hardness part proceeds by showing that the Shapley value is consistent under a uniformly random $2$-to-$1$ minor. As a structural result of independent interest, we construct an example to show that the Shapley value can be inconsistent under an adversarial $2$-to-$1$ minor.

In my post on Alex Trebek, see here, I noted that Jeopardy! is not a good name for the game show since it doesn't tell you much about the show. Perhaps Answers and Questions is a better name.

The following game shows have names that tell you something about the game and hence have better names:

Wheel of Fortune, The Price is Right, Lets make a Deal, Beautiful women have suitcases full of money (the original name for Deal-No Deal), Win Ben Stein's Money, Beat the Geeks.

In Math we often name a concept after a person. While this may be a good way to honor someone, the name does not tell us much about the concept and it leads to statements like:

A Calabi-Yau manifold is a compact complex Kahler manifold with a trivial first Chern class.

A Kahler manifold is a Hermitian manifold for which the Hermitian form is closed.

A Hermitian manifold is the complex analog of the Riemann manifold.

(These examples are from an article I will point to later---I do not understand any of these terms, though I once knew what a Riemann manifold was. I heard the term Kahler Manifold in the song Bohemian Gravity. It's at about the 4 minute 30 second place.)

While I am amused by the name Victoria Delfino Problems (probably the only realtor who has problems in math named after her, see my post here) it's not a descriptive way to name open problems in descriptive set theory.

Sometimes a name becomes SO connected to a concept that it IS descriptive, e.g.:

The first proof of VDW's theorem yields ACKERMAN-LIKE bounds.

but you cannot count on that happening AND it is only descriptive to people already somewhat in the field.

What to do? This article makes the ballian point that we should STOP DOING THIS and that the person who first proves the theorem should name it in a way that tells you something about the concept. I would agree. But this can still be hard to really do.

In my book on Muffin Mathematics (see here) I have a sequence of methods called

There was one more method that I didn't quite name, but I used the phrase `Scott Muffin Problem' to honors Scott Huddleton who came up with the method, in my description of it.

All but the last concept were given ballian names. Even so, you would need to read the book to see why the names make sense. Still, that would be easier than trying to figure out what a Calabi-Yau manifold is.

Sergei Voronin was an expert in number theory, who studied the Riemann zeta function, but who sadly died young over twenty years ago. We discussed his amazing 1975 result about the Riemann zeta function here. Others call the result the amazing theorem. I (Dick) am getting old—I almost forgot that we did a post on his theorem again over four years ago.

Today I thought we would recall his theorem, sketch why the theorem is true, and then discuss some extensions.

Starting with Alan Turing we have been interested in universal objects. Turing famously proved that there are universal machines: these can simulate any other machine on any input. Martin Davis has an entire book on this subject.

Universal objects are basic to complexity theory. Besides Turing’s notion, a universal property is key to the definition of NP-complete. A set in NP is NP-complete provided all other sets in NP can be reduced to in polynomial time. Michael Nielsen once began a discussion of universality in this amusing fashion:

Imagine you’re shopping for a new car, and the salesperson says, “Did you know, this car doesn’t just drive on the road.” “Oh?” you reply. “Yeah, you can also use it to do other things. For instance, it folds up to make a pretty good bicycle. And it folds out to make a first-rate airplane. Oh, and when submerged it works as a submarine. And it’s a spaceship too!”

Voronin’s Insight

In 1975 Voronin had the brilliant insight that the Riemann zeta function has an interesting universality property. Roughly speaking, it says that a wide class of analytic functions can be approximated by shifts with real . Recall

for , and it has an analytic extension for all other values but .

The intense interest in the function started in 1859 with Bernhard Riemann’s breakthrough article. This was the first statement of what we call the Riemann Hypothesis (RH).

In over a century of research on RH before Voronin’s theorem, many identities, many results, many theorems were proved about the zeta function. But none saw that the function was universal before Voronin. Given the zeta function’s importance in understanding the structure of prime numbers this seems to be surprising.

Before we define the universal property I thought it might be useful to state a related property that the function has:

Theorem 1 Suppose that is a polynomial so that for all ,

Then is identically zero.

Since is a single variable, this says that and its derivatives and do not satisfy any polynomial relationship. This means intuitively that must be hypertranscendental. Let’s now make this formal.

Theorem 2 Let . Let be an analytic function that never is zero for . Then for any there is a real so that

See the paper “Zeroes of the Riemann zeta-function and its universality,” by Ramunas Garunkstis, Antanas Laurincikas, and Renata Macaitiene, for a detailed modern discussion of his theorem.

Note that the theorem is not constructive. However, the values of that work have a positive density—there are lots of them. Also note the restriction that is never zero is critical. Otherwise one would be able to show that the Riemann Hypothesis is false. In 2003, Garunkstis et al. did prove a constructive version, in a paper titled, “Effective Uniform Approximation By The Riemann Zeta-Function.”

Voronin’s Proof

The key insight is to combine two properties of the zeta function: The usual definition with the Euler product. Recall the Riemann zeta-function has an Euler product expression

where runs over prime numbers. This is valid only in the region , but it makes sense in a approximate sense in the critical strip:

Then take logarithms and since are linearly independent over , we can apply the Kronecker approximation theorem to obtain that any target function can be approximated by the above finite truncation. This is the basic structure of the proof.

Open Problems

Voronin’s insight was immediately interesting to number theorists. Many found new methods for proving universality and for extending it to other functions. Some methods work for all zeta-functions defined by Euler products. See this survey by Kohji Matsumoto and a recent paper
by Hafedh Herichi and Michel Lapidus, the latter titled “Quantized Number Theory, Fractal Strings and the Riemann Hypothesis: From Spectral Operators to Phase Transitions and Universality.”

Perhaps the most interesting question is:

Can universality be used to finally unravel the RH?

See Paul Gauthier’s 2014 IAS talk, “Universality and the Riemann Hypothesis,” for some ideas.

Sergey Kitaev just shared with me an article he wrote with Anthony Mendes in Jeff Remmel's memory. Jeff Remmel was a distinguished mathematician with a very successful career in both logic and combinatorics.

The short biography at the start of the article paints a vivid picture of Jeff Remmel's personality, and will be of interest and inspiration to many readers. His hiring as "an Assistant Professor in the Department of Mathematics at UC San Diego at age 25, without officially finishing his Ph.D. and without having published a single paper" was, in Jeff Remmel's own words, a "fluke that will never happen again."

I had the pleasure of making Jeff Remmel's acquaintance when he visited Sergey in Reykjavik and thoroughly enjoyed talking to him about a variety of subjects. He was truly a larger-than-life academic.

An $(n,r,h,a,q)$-Local Reconstruction Code is a linear code over $\mathbb{F}_q$ of length $n$, whose codeword symbols are partitioned into $n/r$ local groups each of size $r$. Each local group satisfies `$a$' local parity checks to recover from `$a$' erasures in that local group and there are further $h$ global parity checks to provide fault tolerance from more global erasure patterns. Such an LRC is Maximally Recoverable (MR), if it offers the best blend of locality and global erasure resilience---namely it can correct all erasure patterns whose recovery is information-theoretically feasible given the locality structure (these are precisely patterns with up to `$a$' erasures in each local group and an additional $h$ erasures anywhere in the codeword).
Random constructions can easily show the existence of MR LRCs over very large fields, but a major algebraic challenge is to construct MR LRCs, or even show their existence, over smaller fields, as well as understand inherent lower bounds on their field size. We give an explicit construction of $(n,r,h,a,q)$-MR LRCs with field size $q$ bounded by $\left(O\left(\max\{r,n/r\}\right)\right)^{\min\{h,r-a\}}$. This improves upon known constructions in many relevant parameter ranges. Moreover, it matches the lower bound from Gopi et al. (2020) in an interesting range of parameters where $r=\Theta(\sqrt{n})$, $r-a=\Theta(\sqrt{n})$ and $h$ is a fixed constant with $h\le a+2$, achieving the optimal field size of $\Theta_{h}(n^{h/2}).$
Our construction is based on the theory of skew polynomials. We believe skew polynomials should have further applications in coding and complexity theory; as a small illustration we show how to capture algebraic results underlying list decoding folded Reed-Solomon and multiplicity codes in a unified way within this theory.

We show that computing the majority of $n$ copies of a boolean function $g$ has randomised query complexity $\mathrm{R}(\mathrm{Maj} \circ g^n) = \Theta(n\cdot \bar{\mathrm{R}}_{1/n}(g))$. In fact, we show that to obtain a similar result for any composed function $f\circ g^n$, it suffices to prove a sufficiently strong form of the result only in the special case $g=\mathrm{GapOr}$.

Proving circuit lower bounds has been an important but extremely hard problem for decades. Although one may show that almost every function $f:\mathbb{F}_2^n\to\mathbb{F}_2$ requires circuit of size $\Omega(2^n/n)$ by a simple counting argument, it remains unknown whether there is an explicit function (for example, a function in $NP$) not computable by circuits of size $10n$. In fact, a $3n-o(n)$ explicit lower bound by Blum (TCS, 1984) was unbeaten for over 30 years until a recent breakthrough by Find et al. (FOCS, 2016), which proved a $(3+\frac{1}{86})n-o(n)$ lower bound for affine dispersers, a class of functions known to be constructible in $P$.
In this paper, we prove a stronger lower bound $3.1n - o(n)$ for affine dispersers. To get this result, we strengthen the gate elimination approach for $(3+\frac{1}{86})n$ lower bound, by a more sophisticated case analysis that significantly decreases the number of bottleneck structures introduced during the elimination procedure. Intuitively, our improvement relies on three observations: adjacent bottleneck structures becomes less troubled; the gates eliminated are usually connected; and the hardest cases during gate elimination have nice local properties to prevent the introduction of new bottleneck structures.

We prove logarithmic depth lower bounds in Stabbing Planes for the classes of combinatorial principles known as the Pigeonhole principle and the Tseitin contradictions. The depth lower bounds are new, obtained by giving almost linear length lower bounds which do not depend on the bit-size of the inequalities and in the case of the Pigeonhole principle are tight.
The technique known so far to prove depth lower bounds for Stabbing Planes is a generalization of that used for the Cutting Planes proof system. In this work we introduce two new approaches to prove length/depth lower bounds in Stabbing Planes: one relying on Sperner's Theorem which works for the Pigeonhole principle and Tseitin contradictions over the complete graph; a second proving the lower bound for Tseitin contradictions over a grid graph, which uses a result on essential coverings of the boolean cube by linear polynomials, which in turn relies on Alon's combinatorial Nullenstellensatz

We study the complexity of proving that a sparse random regular graph on an odd number of vertices does not have a perfect matching, and related problems involving each vertex being matched some pre-specified number of times. We show that this requires proofs of degree $\Omega(n/\log n)$ in the Polynomial Calculus (over fields of characteristic $\ne 2$) and Sum-of-Squares proof systems, and exponential size in the bounded-depth Frege proof system. This resolves a question by Razborov asking whether the Lovász-Schrijver proof system requires $n^\delta$ rounds to refute these formulas for some $\delta > 0$. The results are obtained by a worst-case to average-case reduction of these formulas relying on a topological embedding theorem which may be of independent interest.

Weighted pseudorandom generators (WPRGs), introduced by Braverman, Cohen and Garg [BCG20], is a generalization of pseudorandom generators (PRGs) in which arbitrary real weights are considered rather than a probability mass. Braverman et al. constructed WPRGs against read once branching programs (ROBPs) with near-optimal dependence on the error parameter. Chattopadhyay and Liao [CL20] somewhat simplified the technically involved BCG construction, also obtaining some improvement in parameters.
In this work we devise an error reduction procedure for PRGs against ROBPs. More precisely, our procedure transforms any PRG against length n width w ROBP with error 1/poly(n) having seed length s to a WPRG with seed length s + O(log(w/?)loglog(1/?)). By instantiating our procedure with Nisan’s PRG [Nis92] we obtain a WPRG with seed length O(log(n)log(nw) + log(w/?)loglog(1/?)). This improves upon [BCG20] and is incomparable with [CL20].
Our construction is significantly simpler on the technical side and is conceptually cleaner. Another advantage of our construction is its low space complexity O(log nw)+ poly(loglog(1/?)) which is logarithmic in n for interesting values of the error parameter ?. Previous constructions (like [BCG20, CL20]) specify the seed length but not the space complexity, though it is plausible they can also achieve such (or close) space complexity.

A recent paper of Braverman, Cohen, and Garg (STOC 2018) introduced the concept of a pseudorandom pseudodistribution generator (PRPG), which amounts to a pseudorandom generator (PRG) whose outputs are accompanied with real coefficients that scale the acceptance probabilities of any potential distinguisher. They gave an explicit construction of PRPGs for ordered branching programs whose seed length has a better dependence on the error parameter $\epsilon$ than the classic PRG construction of Nisan (STOC 1990 and Combinatorica 1992).
In this work, we give an explicit construction of PRPGs that achieve parameters that are impossible to achieve by a PRG. In particular, we construct a PRPG for ordered permutation branching programs of unbounded width with a single accept state that has seed length $\tilde{O}(\log^{3/2} n)$ for error parameter $\epsilon=1/\text{poly}(n)$, where $n$ is the input length. In contrast, recent work of Hoza et al. (ITCS 2021) shows that any PRG for this model requires seed length $\Omega(\log^2 n)$ to achieve error $\epsilon=1/\text{poly}(n)$.
As a corollary, we obtain explicit PRPGs with seed length $\tilde{O}(\log^{3/2} n)$ and error $\epsilon=1/\text{poly}(n)$ for ordered permutation branching programs of width $w=\text{poly}(n)$ with an arbitrary number of accept states. Previously, seed length $o(\log^2 n)$ was only known when both the width and the reciprocal of the error are subpolynomial, i.e. $w=n^{o(1)}$ and $\epsilon=1/n^{o(1)}$ (Braverman, Rao, Raz, Yehudayoff, FOCS 2010 and SICOMP 2014).
The starting point for our results are the recent space-efficient algorithms for estimating random-walk probabilities in directed graphs by Ahmadenijad, Kelner, Murtagh, Peebles, Sidford, and Vadhan (FOCS 2020), which are based on spectral graph theory and space-efficient Laplacian solvers. We interpret these algorithms as giving PRPGs with large seed length, which we then derandomize to obtain our results. We also note that this approach gives a simpler proof of the original result of Braverman, Cohen, and Garg, as independently discovered by Cohen, Doron, Renard, Sberlo, and Ta-Shma (personal communication, January 2021).

We study monotone branching programs, wherein the states at each time step can be ordered so that edges with the same labels never cross each other. Equivalently, for each fixed input, the transition functions are a monotone function of the state.
We prove that constant-width monotone branching programs of polynomial size are equivalent in power to $AC^{0}$ circuits. This complements the celebrated theorem of Barrington, which states that constant-width branching programs, without the monotonicity constraint, are equivalent in power to $NC^{1}$ circuits.
Next we turn to read-once monotone branching programs of constant width, which we note are strictly more powerful than read-once $AC^0$. Our main result is an explicit pseudorandom generator that $\varepsilon$-fools length $n$ programs with seed length $\widetilde{O}(\log(n/\varepsilon))$. This extends the families of constant-width read-once branching programs for which we have an explicit pseudorandom generator with near-logarithmic seed length.
Our pseudorandom generator construction follows Ajtai and Wigderson's approach of iterated pseudorandom restrictions [AW89,GMRTV12]. We give a randomness-efficient width-reduction process which allows us to simplify the branching program after only $O(\log\log n)$ independent applications of the Forbes--Kelley pseudorandom restrictions [FK18].

A student in my graph algorithms class asked how self-loops in undirected graphs affect the vertex degrees and matchings of a graph. The standard answer is that a self-loop adds two to the degree (because each edge has two endpoints) and that they are useless in matching because matchings should have at most one incidence to each vertex, not two. But that’s just a convention; one could reasonably declare that the contribution of a self-loop to the degree is one, and I’m pretty sure I’ve seen sources that do just that. With that alternative convention, it should be possible to include a self-loop in a matching, and use it to match only a single vertex.

However, this turns out not to make much difference to many matching problems, because the following simple transformation turns a problem with self-loops (allowed in matchings in this way) into a problem with no self-loops (so it doesn’t matter whether they are allowed or not). Simply form a double cover\(^*\) of the given graph (let’s call it the “loopless double cover”) by making two copies of the graph and replacing all corresponding pairs of loops by simple edges from one copy to the other. In weighted matching problems, give the replacement edges for the loops the sum of the weights of the two loops they replace; all other edges keep their original weights.

Then (unlike the bipartite double cover, which also eliminates loops) the cardinality or optimal weight of a matching in the loopy graph can be read off from the corresponding solution in its loopless double cover. Any matching of the original loopy graph can be translated into a matching of the loopless cover by applying the same loopless cover translation to the matching instead of to the whole graph; this doubles the total weight of the matching and the total number of matched vertices. And among matchings on the loopless cover, when trying to optimize weight or matched vertices, it is never helpful to match the two copies differently, so there is an optimal solution that can be translated back to the original graph without changing its optimality.

This doesn’t quite work for the problem of finding a matching that maximizes the total number of matched edges, rather than the total number of matched vertices. These two problems are the same in simple graphs, but different in loopy graphs. However, in a loopy graph, if you are trying to maximize matched edges, you might as well include all loops in the matching, and then search for a maximum matching of the simple graph induced by the remaining unmatched vertices. Again, in this case, you don’t get a problem that requires any new algorithms to solve it.

In the case of my student, I only provided the conventional answer, because really all they wanted to know was whether these issues affected how they answered one of the homework questions, and the answer was that the question didn’t involve and didn’t need loops. However it seems that the more-complicated answer is that even if you allow loops to count only one unit towards degree, and to be included in matchings, they don’t change the matching problem much.

\(^*\) This is only actually a covering graph under the convention that the degree of a loop is one. For the usual degree-2 convention for loops, you would need to replace each loop by a pair of parallel edges, forming a multigraph, to preserve the degrees of the vertices.

In 1971/1972 academic year, I was an undergraduate student at the Hebrew University of Jerusalem and toward the end of the year I wrote a paper about Abel’s sums. I sent it to John Riordan the author of the books “Combinatorial Identities” and “Combinatorial Analysis”.

I received this letter shortly after my 17th birthday, and I was surely very happy to read the sentence “I think you have had a splendid idea”. Here is part of Riordan’s remarks. The full report is here.

It took me some time to revise the paper and get it printed. And here is the report for the second version.

And here is part of Riordan’s second round of remarks. The full report is here.

I was certainly happy to read the following sentence: “I would remark that the result for p = -1 is new and perhaps the simplest derivation of Abel’s result.”

In 1978 I actually visited John Riordan in his office at Rockefeller University, NYC. I remember him as very cheerful and he told me that when his first book appeared he was working at Bell Labs and his managers wanted to talk to him. He was a bit worried that they would not approve of him spending time and effort to write a book in pure mathematics. But actually, they gave him a salary raise!

(If you have a picture of John Riordan, please send me.)

Joe Biden is the 46th president of the USA. Note is called a centered triangular number. These numbers obey the formula:

and start with The previous one, the , was Herbert Hoover, hmmm. Biden has promised to make controlling the Covid-19 pandemic one of his top priorities.

Today I thought we would discuss how he might use computer technology to help get the virus under control.

First, we thank the drug companies since we now have vaccines that work against the virus. Without these we would have little chance to bring the pandemic under control at all.

Second, we must state that we are worried that the virus is mutating and this may render the current vaccines less useful, if not useless. We hope this is not happening, or that the drug companies will be able to respond with vaccine boosters. Today there seems to be good news and bad news.

Results will fluctuate, but in any case, vaccines will definitely play a key role in defeating the pandemic. We want to ask the same about computing technology.

Computing’s Role—I

There are many web sites that discuss how computing technology can play a role in defeating the pandemic. Here are some of the main points:

Tracking People: Many places are interested in tracking who are sick. Tracking can by itself help stop the spreading of the virus, and thus help save lives. For example, IEEE says:

“We believe software can help combat this global pandemic, and that’s why we’re launching the Code Against COVID-19 initiative…,” said Weiting Liu, founder and CEO of Arc. “From tracking outbreaks and reducing the spread to scaling testing and supporting healthcare, teams around the world are using software to flatten the curve. The eMask app (real-time mask inventory in Taiwan) and TraceTogether (contact tracing in Singapore) are just two of the many examples.”

Changing Behavior: A powerful idea is to avoid human to human contact and thus stop the spread of the virus. For example, here are examples from a longer list of ideas:

Robot Deliveries;

Digital and Contactless Payments;

Remote Work and Remote Learning and more.

Changing Health Delivery: An important idea is how can we reduce the risk of health delivery. A paradox is that health care may need to be avoided, since traditional delivery requires human contact. There are many examples of ways to make health care online, and therefore safer. Shwetak Patel won the 2018 ACM Prize in Computing for contributions to creative and practical sensing systems for sustainability and health. He outlined here CCC blog how health care could be made more online.

Computing’s Role—II

The above ideas are fine but I believe the real role for computing is simple:

Make signing up and obtaining an appointment for a vaccine easier, fairer, and sooner.

In the US each state is in charge of running web sites that allow people to try and get an appointment for a vaccine shot. Try is the key word. Almost all sites require an appointment to get a shot—walk-ins are mostly not allowed.

I cannot speak for all states and all web sites, but my direct experience is that the sites are terrible. Signing up for a vaccination shot is a disaster. The web sites that I have seen are poorly written, clumsy, and difficult to use. They are some of the worst sites I have ever needed to use, for anything. Some of the top issues:

The sites require you to sign in each time from scratch.

The sites require you to sign in each time from scratch.

The sites require you to sign in each time from scratch.

The sites rules are confusing and unclear.

You may need to search for particular vaccine locations, rather than for any locations.

And more

Repeating (1,2,3) is a poor joke, but one that reflects reality.

Open Problems

If Amazon, Google, Apple had sites that worked this way, they would be out of business quickly. Perhaps this is the key: Can our top companies help build the state sites? Is it too late to help? See here for a New York Times article on this issue:

When you start to pull your hair out because you can’t register for a vaccine on a local website, remember that it’s not (only) the fault of a bad tech company or misguided choices by government leaders today. It’s a systematic failure years in the making.

Also is the issue of algorithmic fairness relevant here? We know that it is unfortunately easy to have web sites that are unfair—that assign vaccine sign up dates unfairly, that favor one class of people over another.

This past week, I spent so much mental energy worrying about the fate of Scott Alexander that I almost forgot that right here in Texas, I’m surrounded by historic scenes of Third-World-style devastation: snowstorms and sub-freezing temperatures for which our infrastructure was completely unprepared; impassable roads; burst gas and water pipes; millions without electricity or heat or clean water; the UT campus a short walk from me converted into a giant refugee camp.

For all those who asked: my family and I are fine. While many we know were without power for days (or are still without power), we lucked out by living close to a hospital, which means that they can’t shut off the electricity to our block. We are now on a boil-water notice, like all of Austin, and we can’t take deliveries or easily go anywhere, and the university and schools and daycares are all closed (even for remote learning). Which means: we’re simply holed up in our house, eating through our stockpiled food, the kids running around being crazy, Dana and I watching them with one eye and our laptops with the other. Could be worse.

In some sense, it’s not surprising that the Texas infrastructure would buckle under weather stresses outside the envelope of anything it was designed for or saw for decades. The central problem is that our elected leaders have shown zero indication of understanding the urgent need, for Texas’ economic viability, to do whatever it takes to make sure nothing like this ever happens again. Ted Cruz, as everyone now knows, left for Cancun; the mayor of Colorado City angrily told everyone to fend for themselves (and then resigned); and Governor Abbott has been blaming frozen wind turbines, a tiny percentage of the problem (frozen gas pipes are a much bigger issue) but one that plays with the base. The bare minimum of a sane response might be, I dunno,

acknowledging the reality that climate change means that “once-per-century” weather events will be every couple years from now on,

building spare capacity (nuclear would be ideal … well, I can dream),

winterizing what we have now, and

connecting the Texas grid to the rest of the US.

If I were a Texas Democrat, I’d consider making Republican incompetence on infrastructure, utilities, and public health my only campaign issues.

Alright, now back to watching the Mars lander, which is apparently easier to build and deploy than a reliable electric grid.

Yesterday was Ash Wednesday, the beginning of Lent, the 40-day period that precedes Easter and that is observed by Catholics and other Christians as a period of reflection. It is, often, a period in which the faithful choose to give something up as a penance, such as giving up eating meat.

The period that immediately precedes Lent is known as Carnival, and, perhaps incongruously, it is a time for having fun, playing pranks, and eating special sweets, often deep-fried ones. Traditionally kids, and also grownups, dress up in costumes and attend costume parties. The idea being, let’s have fun and eat now, because soon we are “entirely voluntarily” going to fast and to reflect on sin and death, and stuff like that. The day before Ash Wednesday, indeed, is called “Fat Tuesday”.

In Milan, however, the tradition is to power through Ash Wednesday and to continue the Carnival festivities until the following Sunday. There are a number of legends that explain this unique tradition, that is apparently ancient. One such legend is that a plague epidemic had been ravaging Milan in the IV century around the time that should have been Carnival, and life was beginning to go back to normal right around Ash Wednesday. So people rebelled against Lent, and were like, haven’t we suffered enough, what more penance do we need, and celebrated Carnival later.

It has now been nearly a year since the first lockdown, and we still cannot travel between regions (for example, we cannot travel from Milan to Bologna, or to Venice), cannot eat dinner in a restaurant, cannot go see a movie, a play or a sporting event, cannot ski, and so on.

My proposal is that when (if?) we go back to a normal life, we shorten Lent to three days (start with “Ash Thursday” the day before Good Friday), and that we make Carnival start on Easter Monday and last for 361 days. Not because we have had it worse than a IV century plague epidemic: indeed, even in the best of times, IV century people in Milan did not usually eat in restaurants, travel to Venice, see movies, or ski. We, however, are spoiled XXI century people, we are not used to inconveniences, and when (if?) this is over we will need a lot of self-care, especially the eating-deep-fried-sweets-and-partying kind of self-care.

The Complexity group of Iddo Tzameret at Imperial College London invites expressions of interest for a postdoctoral position funded by the ERC. The position is for two years with a possible one-year extension. The start date is flexible, and the salary is generous and includes funding for equipment and travel. This position will be based at the South Kensington campus at the heart of London.

Imperial’s Computing is seeking up to two highly motivated PhD students interested in computational complexity. The positions are based at the South Kensington campus at the heart of London, and include a generous stipend, as well as funding for equipment and travel. The successful candidate will join the complexity group at Imperial College led by Iddo Tzameret.

In this lecture, we talk about what neural networks end up learning (in terms of their weights) and when, during training, they learn it.

In particular, we’re going to discuss

Simplicity bias: how networks favor “simple” features first.

Learning dynamics: what is learned early in training.

Different layers: do the different layers learn the same features?

The type of results we will discuss are:

Gradient-based deep learning algorithms have a bias toward learning simple classifiers. In particular this often holds when the optimization problem they are trying to solve is “underconstrained/overparameterized”, in the sense that there are exponentially many different models that fit the data.

Simplicity also affects the timing of learning. Deep learning algorithms tend to learn simple (but still predictive!) features first.

Such “simple predictive features” tend to be in lower (closer to input) levels of the network. Hence deep learning also tends to learn lower levels earlier.

On the other side, the above means that distributions that do not have “simple predictive features” pose significant challenges for deep learning. Even if there is a small neural network that works very well for the distribution, gradient-based algorithms will not “get off the ground” in such cases. We will see a lower bound for learning parities that makes this intuition formal.

What do neural networks learn, and when do they learn it?

As a first example to showcase what is learned by neural networks, we’ll consider the following data distribution where we sample points , with ( corresponding to orange points and corresponding to blue points).

If we train a neural network to fit this distribution, we can see below that the neurons that are closest to the input data end up learning features that are highly correlated with the input (mostly linear subspaces at 45-degree angle, which correspond to one of the stripes). In the subsequent layers, the features learned are more sophisticated and have increased complexity.

Neural networks have simpler but useful features in lower layers

Some people have spent a lot of time trying to understand what is learned by different layers. In a recent work, Olah et al. dig deep into a particular architecture for computer vision, trying to interpret the features learned by neurons at different layers.

They found that earlier layers learn features that resemble edge detectors.

However, as we go deeper, the neurons at those layers start learning more convoluted (for example, these features from layer resemble heads).

SGD learns simple (but still predictive) features earlier.

There is evidence that SGD learns simpler classifiers first. The following figure tracks how much of a learned classifier’s performance can be accounted for by a linear classifier. We see that up to a certain point in training all of the performance of the neural network learned by SGD (measured as mutual information with the label or as accuracy) can be ascribed to the linear classifier. They diverge only very near the point where the linear classifier “saturates,” in the sense that the classifier reachers the best possible accuracy for linear models. (We use the quantity – the mutual information of and conditioned on the prediction of the linear classifier – to measure how much of ‘s performance cannot be accounted for by .)

The benefits and pitfalls of simplicity bias

In general, simplicity bias is a very good thing. For example, the most “complex” function is a random function. However, if given some observed data , SFD were to find a random function that perfectly fits it, then it would never generalize (since for every fresh , the value of would be random).

At the same time, simplicity bias means that our algorithms might focus too much on simple solutions and miss more complex ones. Sometimes the complex solutions actually do perform better. In the following cartoon a person could go to the low-hanging fruit tree on the right-hand side and miss the bigger rewards on the left-hand side.

This can actually happen in neural networks. We also saw a simple example in class:

The two datasets are equally easy to represent, but on the righthand side, there is a very strong “simple classifier” (the 45-degree halfspace) that SGD will “latch onto.” Once it gets stuck with that classifier, it is hard for SGD to get “unstuck.” As a result, SGD has a much harder time learning the righthand dataset than the lefthand dataset.

Analysing SGD for over-parameterized linear regression

So, what can we prove about the dynamics of gradient descent? Often we can gain insights by studying linear regression.

Formally, given with we would like to find a vector such that .

In this setting, we can prove that running SGD (from zero or tiny initialization) on the loss will converge to solution of minimum norm. To see whym note that SGD performs updates of the form . However note that is a scalar. Therefore all of the updates keep the updated vector within . This implies that the converging solution will also lie in .

Geometrically this translates into being the projection of onto the the subspace which results in the least norm solution.

Analyzing the dynamics of descent, we can write the distance between consecutive weight updates and the converging solution as . We see that we are applying the linear operator at every step we take. As long as this operator is contractive, we will continue to progress and converge to . Formally, to make progress, we require . This directly translates into and then the progress we make is approximately , where is the condition number of .

What happens now if the matrix is random? Then, results from random matrix theory (specifically the Marchenko-Pastur distribution) state that

if , then the matrix has and the eigenvalues are bounded away from . This means that the matrix is well conditioned.

if , then the spectrum of starts shifting towards , with some eigenvalues being equal to zero, resulting in an ill-conditioned matrix.

if , then the spectrum has some zero eigenvalues, but is otherwise bounded away from zero. If we restrict to the subspace of positive eigenvalues, we achieve again a good condition number.

Deep linear networks

We now want to go beyond linear regression and talk about deep networks. As deep networks are very hard to understand, we will first start analyzing a depth network. We will also consider a linear network and omit the nonlinearity. This might seem strange, as we could consider the corresponding linear model, which has exactly the same expressiveness. However, note that these two models have a different parameter space. This means that gradient-based algorithms will travel on different paths when optimizing these two models.

Specifically, we can see that the minimum loss attained by the two models will coincide, i.e., , but the SGD path and the solution will be different.

We will analyze the gradient flow on these two networks (which is gradient descent with the learning rate ). We will make the simplifying assumption that and symmetric. Then, we can see that . We will try and compare the gradient flow of two different loss functions: (doing gradient flow on a linear model) and (doing gradient flow on a depth linear model).

Gradient flow on the linear model simply gives , whereas for the deep linear network we have (using the chain rule)

since is symmetric.

For simplicity, let’s denote and . We then have .

Another way to view the comparison between the models of interest, and is as follows: let , then . We can view this as follows: when we multiply the gradient with we end up making the “big bigger and the small smaller”. Basically, this accenuates the differences between the eigenvalues and is biasing to become a low-rank matrix.

To see why, you can think of a low rank matrix has one that has few large eigenvalues and the others small. If is already close low rank, then replacing a gradient by encourages the gradient steps to mostly happen in the top eigenspace of . This result generalizes to networks of greater depth, and the gradient evolves as , with .

This means that we end up doing gradient flow on a Riemannian manifold. An interesting result is that the flow induced by the operator is provably not equivalent to a regularized minimization problem for any .

What is learned at different layers?

Finally, let’s discuss what is learned by the different layers in a neural network. Some intuition people have is that learning proceeds roughly like the following cartoon:

We can think of our data as being “built up” as a sequence of choices from higher level to lower level features. For example, the data is generated by first deciding that it would be a photo of a dog, then that it would be on the beach, and finally low-level details such as the type of fur and light. This is also how a human would describe this photo. In contrast, a neural network builds up the features in the opposite direction. It starts from the simplest (lowest-level) features in the image (edges, textures, etc.) and gradually builds up complexity until it finally classifies the image.

How neural networks learn features?

To build a bit of intuition, consider an example of combining different simple features. We can see that if we try to combine two good edge detectors with different orientations, the end result will hardly be an edge detector.

So the intuition is that there is competitive/evolutionary pressure on neurons to “specialize” and recognize useful features. Initially, all the neurons are random features, which can be thought of as random linear combination of the various detectors. However, after training, the symmetry will break between the neurons, and they will specialize (in this simple example, they will either become vertical or horizontal edge detectors).

Raghu, Gilmer, Yosinski, and Sohl-Dickstein tracked the speed at which features learned by different layers reach their final learned state. In the figure below the diagonal elements denote the similarity of the current state of a layer to its final one, where lighter color means that the state is more similar. We can see that earlier layer (more to the left) reach their final state earlier (with th exception of the 2 layers closest to the output that also converge very early).

The “symmetry breaking” intuition is explored by a recent work of Frankle, Dziugaite, Roy, and Carbin. Intuitively, because the average of two good features is generally not a good feature, averaging the weights of two neural networks with small loss will likely result in a network with large loss. That is, if we start from two random initializations , and train two networks until we reach weights and with small loss, then we expect the average of and to result in a network with poor loss:

In contrast, Frankle et al showed that sometimes, when we start from the same initialization (especially after pruning) and use random SGD noise (obtained by randomly shuffling the training set) then we reach a “linear plateu” of the loss function in which averaging two networks yields a network with similar loss:

The contrapositive of simplicity: lower bounds for learning parities

If we believe that networks learn simple features first, and learn them in the early layers, then this has an interesting consequence. If the data has the form that simple features (e.g. linear or low degree) are completely uninformative (have no correlation with the label) then we may expect that learning cannot “get off the ground”. That is, even if there exists a small neural network that can learn the class, gradient based algorithms such as SGD will never find it. (In fact, it is possible that no efficient algorithm could find it.) There are some settings where we can prove such conjectures. (For gradient-based algorithms that is; proving this for all efficient algorithms would require settling the P vs NP question.)

We discuss one of the canonical “hard” examples for neural networks: parities. Formally, for , the distribution is the distribution over defined as follows: and . The “learning parity” problem is as follows: given samples drawn from , either recover or do the weaker task of finding a predictor such that with high probability over future samples .

It turns out that if we don’t restrict ourselves to deep learning, given samples we can recover . Consider the transformations and . If we let if and otherwise, we can write . Basically, we transformed the problem of parity to a problem of counting if we have an odd or an even number of . In this setting, we can think of every sample as providing a linear equation moudlo 2 over the unknown variables . When , these linear equations will be very likely to be of full rank, and hence we can use Gaussian elimination to find and hence .

Switching to the learning setting, we can express parities by using few ReLUs. In particular, we’ve shown that we can create a step function using ReLUs. Therefore for every , there is a combination of four ReLUs that computes the function such that outputs for , and outputs if . We can then write the parity function (for example for ) as . This will be a linear combination of at most ReLUs.

Parities are an example of a case where simple feature are uninformative. For example, if then for every linear function ,

in other words, there is no correlation between the linear function and the label. To see why this is true, write . By linearity of expectation, it suffices to show that $latex \mathbb{E}{(x,y) \sim D_I}[ L_ix_i y] = L_i \mathbb{E}{(x,y) \sim D_I}[ x_i y] = 0&bg=ffffff$. Both and are just values in . To evaluate the expectation we simply need to know the marginal distribution that induces on when we restrict it to these two coordinates. This distribution is just the uniform distribution. To see why this is the case, consider a coordinate and let’s condition on the values of all coordinates other than and . After conditioning on these values, for some and are chosen uniformly and independently from . For every choice of , if we flip then that would flip the value of , and hence the marginal distribution on and will be uniform.

This lack of correlation turns out to be a real obstacle for gradient-based algorithms. While small neural networks for parities exist, and Gaussian elimination can find them, it turns out that gradient-based algorithms such as SGD will fail to do so. Parities are hard to learn, and even if the capacity of the network is such that it can memorize the input, it will still perform poorly in a test set. Indeed, we can prove that for every neural network architecture , running SGD on will require steps. (Note that if we add noise to parities, then Gaussian elimination will fail and it is believed that no efficient algorithm can learn the distribution in this case. This is known as the learning parity with noise problem, which is also related to the learning with errors problem that is the foundation of modern lattice-based cryptography.)

We now sketch the proof that gradient-based algorithms require exponentially many steps to learn parities, following Theorem 1 of Shalev-Shwartz,Shamir and Shammah. We think of an idealized setting where we have an unlimited number of samples and only use a sample only once (this should only make learning easier). We will show that we make very little progress in learning , by showing that for any given , the expected gradient over will be exponentially small, and hence we make very little progress toward learning . Specifically, using the notation , for any ,

The term is independent of and so does not contribute toward learning . Hence intuitively to show we make exponentially small progress, it suffices to show that typically for every , will be exponentially small. (That is, even if for a fixed we make a large step, these all cancel out and give us exponentially small progress toward actually learning .)

Formally, we will prove the following lemma:

Lemma: For every ,

Proof: Let us fix and define . The quantity can be written as with respect to the inner product . However, is an orhtonormal basis with respect to this inner product. To see this note that since , for every , and for , where is the symmetric difference of and . The reason is that for all and so elements that appear in both and “cancel out”. Since the coordinates of are distributed independently and uniformly, the expectation of the product is the product of expectations. This means that as long as is not empty (i.e., ) this will be a product of one or more terms of the form . Since is uniform over , and so we get that if , .

Given the above

which means that (since there are subsets of ) on average . In other words, is typically exponentially small which is what we wanted to prove.

Elvira Mayordomo, Mitsu Ogihara, and Atri Rudra are going to be the editors of a special issue of the journal Theory of Computing Systems dedicated to Alan Selman. Alan passed away this January 2021.

Today we circulate their request for contributions.

The details of the call say: This special issue celebrates Alan’s life and commemorate his extraordinary contributions to the field. The topics of interest include but are not limited to:

average-case complexity

circuit complexity

comparison of reducibilities

complexity theoretic characterizations of models

function complexity

hierarchy theorems

parameterized complexity

promise problems and disjoint NP-pairs

public-key cryptography

relativization

semi-feasible algorithms

sparse sets

structure of complete sets

Open Problems

Please look at this for details—the deadline for submission is 31st July 2021. You have 164 days to write your paper. Which is 3936 hours or 236160 minutes.

June 21-25, 2021 Online http://acm-stoc.org/stoc2021/callforworkshops.html Submission deadline: March 15, 2021 STOC 2021 will hold workshops during the conference week, June 21–25, 2021. We invite groups of interested researchers to submit workshop proposals. The due date for proposals is March 15.

A lot of topology is finding ways to prove things that are really obvious but where explaining why they’re obvious can be difficult. So I want to do this for a discrete analogue of ropelength, the length of the shortest lattice representation, for the Borromean rings. You can find several pretty lattice (and non-lattice) representations of the Borromean rings in a paper by Verhoeff & Verhoeff, “Three families of mitered Borromean ring sculptures” [Bridges, 2015]; the one in the middle of their figure 2, thinned down to use only lattice edges and not thick solid components, is the one I have in mind. It is formed by three \(2\times 4\) rectangles, shown below next to Jessen’s icosahedron which has the same vertex coordinates. (You can do the same thing with a regular icosahedron but then you get non-lattice golden rectangles.)

Each of the three rectangles has perimeter \(12\), so the total length of the whole link is \(36\). Why should this be the minimum possible? One could plausibly run a brute force search over all small-enough realizations, but this would be tedious and some effort would be needed to prune the search enough to get it to run at all. Instead, I found an argument based on the lengths of the individual components of the link, allowing me to analyze them (mostly) separately.

Each component is unknotted, so it can be the boundary of a disk in space. Importantly, for the Borromean rings, every disk spanned by one of the components must be crossed at least twice by other components. If we could find a disk spanned by one component that was not crossed at all by other components, then we could shrink the first component topologically within its disk down to a size so small that it could easily be pulled apart from the other two components, something that is not possible with the Borromean rings. And if we could find a disk that was only crossed once by another component, then the linking number of the two components would be one, something that doesn’t happen for the Borromean rings.

If you travel in some consistent direction around a cycle in a 3d lattice, every step in one direction along a coordinate axis must be cancelled by a step in the opposite direction elsewhere along the ring. So if a lattice cycle has length \(\ell\), there must be \(\ell/2\) pairs of opposite steps, partitioned somehow among the three dimensions. If the bounding box of the cycle has size \(a\times b\times c\), then we must have \(a+b+c\le\ell/2\), and we can classify the possible shapes of lattice cycles of length \(\ell\) by the possible shapes of their bounding boxes. This gives us the following cases:

A lattice cycle of length \(\ell=4\) can only be a square, with bounding box dimensions \(1\times 1\times 0\) (the zero means that it lies in a single plane in 3d, not that it doesn’t exist at all). The square itself is a disk not crossed by any other lattice path, unusable as a component of the Borromean rings.

A lattice cycle of length \(\ell=6\) can be a rectangle with bounding box \(2\times 1\times 0\), or fully 3-dimensional with bounding box \(1\times 1\times 1\). There are two fully 3-dimensional cases, one that avoids two opposite vertices of the bounding box and one that avoids two adjacent vertices. The rectangle can be its own spanning disk, and in the 3-dimensional cases we can use a spanning disk connecting the center of the bounding cube by a line segment to each point along the ring. Neither of these types of disks is crossed by any other lattice path.

A lattice cycle of length \(\ell=8\) can be a rectangle with bounding box \(3\times 1\times 0\), a square with bounding box \(2\times 2\times 0\) or fully 3-dimensional with bounding box \(2\times 1\times 1\). It can also double back on itself and cover all vertices of a cube, with bounding box \(1\times 1\times 1\). All cases except the \(2\times 2\times 0\) square can be handled as in the length \(6\) cases; for instance, for the \(2\times 1\times 1\) bounding box we form a disk at the center of the box, connected by a line segment to all points of the ring. These cases cannot be crossed by any other lattice path. The \(2\times 2\times 0\) square can be crossed by a lattice path, through its center point, but only by one path. We can see from this that the shortest lattice representation of the Hopf link (two linked circles) is the obvious one formed from two length-\(8\) squares. However, these squares are still too small to be used in the Borromean rings.

A lattice cycle of length \(\ell=10\) can be a rectangle with bounding box \(4\times 1\times 0\) or \(3\times 2\times 0\), or fully 3-dimensional with bounding box \(3\times 1\times 1\) or \(2\times 2\times 1\). The \(4\times 1\times 0\) rectangle and the center-point spanning disk of the \(3\times 1\times 1\) box cannot be crossed by any other lattice path, and the center-point spanning disk of the \(2\times 2\times 1\) can be crossed only by one, through its center edge. Using even-smaller bounding boxes doesn’t help.

That leaves only one problematic case, the \(3\times 2\times 0\) rectangle, of perimeter \(10\), which is shorter than the rectangles of the optimal representation but can nevertheless be crossed by two other lattice paths. In fact, this rectangle can be used as a component in a representation of the Borromean rings. It is even possible to use two of them! (I’ll leave this as an exercise.) So we need some other argument to prove that, when we use one or two of these short rectangles, we have to make up for it elsewhere by making something else extra-long.

If a \(3\times 2\times 0\) rectangle is a component of the Borromean rings, it must be twice by one of the other components, because if its two crossings were from different components it would have nonzero linking number with both of them, different from what happens in the Borromean rings. And the crossings must happen at the two interior lattice points of the rectangle, through paths that (to avoid each other and the boundary of the rectangle) must pass straight across the rectangle, at least for one unit on each side. The component that crosses the rectangle in this way consists of two loops connecting the pairs of ends of these two straight paths; any other connection pattern would lead to linking number \(2\), not zero. We can think of these two loops as being separate cycles, shortcut by the lattice edges between the endpoints of the two straight paths. And any disk that spans either of these two loops must itself be crossed by another component of the Borromean rings, because if one of the loops had an uncrossed spanning disk then we could wrap a spanning disk for the rectangle around it (like a glove around a hand) and create an uncrossed spanning disk for the rectangle as well.

By the analysis above, in order to be crossed by something else, both of the two shortcut loops of the component that crosses the \(3\times 2\times 0\) rectangle must have length at least \(8\). Adding in the two straight paths (and removing the two shortcut edges) shows that the component itself must have length at least \(18\). And if we have one component of length \(18\) and two more of length \(10\), we get total length at least \(38\), more than the length of the minimal representation. Since all representations that use components of length less than \(12\) are too long, the representation in which all three component lengths are exactly \(12\) must be the optimal one, QED.

Researching all this led me to an interesting paper by Dai, Ernst, Por, and Ziegler, “The ropelengths of knots are almost linear in terms of their crossing numbers” [J. Knot Theory and its Ramifications, 2019]. Ropelength is the minimum length of a 3d representation that can be thickened to a radius-1 tube without self-intersections. (Some sources use diameter in place of radius; this changes the numeric values by a factor of two but does not change the optimizing representations.) Doubling the dimensions of a lattice representation gives you such a representation, and on the other hand one can find short lattice representations by following the thickened tubes of a ropelength representation, so ropelength and lattice length are within constant factors of each other. Dai et al. use this to show that knots that can be drawn in the plane with few crossings also have small ropelength. It doesn’t work to use the plane embedding directly, adding a third coordinate to handle the crossings, because some planar graphs (like the nested triangles graph) have nonlinear total edge length in any planar lattice drawing. Instead, Dai et al show how to crumple up a planar drawing of any degree-four planar graph into a 3d integer lattice embedding of the graph, with near-linear total edge length, so that the faces of the drawing can also be embedded as disks that are not crossed by each other or the graph edges. One can then modify the lifted drawing to turn the degree-four vertices into crossings in the lifted topologically-planar surface formed by these faces, giving a grid representation of the original knot with near-linear total length.

The ropelength of the Borromean rings has also been the subject of some study. Doubling the grid rectangles and rounding off their corners produces three stadia with total perimeter \(12\pi+24\approx 61.7\). The same argument as above shows that each curve must be at least long enough for all its spanning disks to be crossable by two disjoint radius-1 tubes. Intuitively the smallest curve that can surround two tubes is a smaller stadium corresponding to the \(2\times 3\) rectangle, with length \(4\pi+4\). If so, this would give a lower bound of \(12\pi+12\approx 49.7\) for the total ropelength of the Borromean rings. The conjectured-optimal configuration, used for the logo of the International Mathematical Union, uses three copies of a complicated two-lobed planar curve in roughly the same positions as the three rectangles or stadia; it is described carefully by Cantarella, Fu, Kusner, Sullivan, and Wrinkle, “Criticality for the Gehring link problem” [Geometry & Topology 2006] (section 10), and has length \(\approx 58.006\). The intuition that the \(2\times 3\) stadium is the shortest curve that can surround two others also appears to be stated as proven in this paper, in section 7.1. But they state that the best lower bound for the Borromean ropelength is \(12\pi\) so maybe the \(12\pi+12\) argument above is new?

Update, February 17: In email, John Sullivan pointed me to a paper by Uberti, Janse van Rensburg, Orlandinit, Tesi, and Whittington, “Minimal links in the cubic lattice” [Topology and Geometry in Polymer Science, 1998; see table 2, p. 97], which does the tedious computer search and comes up with the same result, that the shortest length for a lattice representation of the Borromean rings is 36. (I had searched for papers on lattice representations of the Borromean rings but didn’t find this one, probably failing because it identifies the Borromean rings only by the Alexander–Briggs notation \(6_2^3\), which is hard to search for.) John also tells me that the proof of ropelength-minimality of the \(2\times 3\) stadium is only for links in which it is linked with two other components, different enough from the situation here in which every spanning disk is crossed twice that the same proof doesn’t apply. So the question of whether this stadium really is the ropelength minimizer for components satisfying this crossed-twice condition seems to fall into the category of obvious topological facts that are difficult to prove, rather than being already known.