Theory of Computing Blog Aggregator

We introduced state machines and state machine replication in an earlier post. In this post, we elaborate on the exact notion of safety and liveness that can be obtained by an ideal state machine when there are multiple clients. First, these definitions highlight the challenges of serving multiple clients. Second,...

at October 16, 2021 07:45 PM UTC

  • I was interested to see a familiar-looking graph drawing as one of the answers to the prompt “multiplicity” for the first entry in Mathober 2021 (\(\mathbb{M}\)). It’s a multigraph formed by a triangle with tripled edges, and looks a lot like the drawing I made for the Wikipedia Shannon multigraph article, prettied up by making an infinitely recessing sequence of these drawings rather than just one. Good choice for multiplicity.

  • Non-concentration of the chromatic number of random graphs (\(\mathbb{M}\)). Uniformly random graphs, \(G(n,1/2)\) in the Erdős–Rényi–Gilbert model, turn out to have chromatic numbers that, for infinitely many \(n\), are spread over roughly \(\sqrt{n}\) values. But there are weird fluctuations so that, conjecturally, for some \(n\) the concentration is much tighter, more like \(n^{1/4}\).

  • University System of Georgia moves to gut tenure (\(\mathbb{M}\)). The proposed new policy includes procedures for removal of tenure under certain enumerated grounds, including failure to perform their jobs (this is pretty normal) but then adds a massive escape clause in which the board of regents can remove tenured faculty at any time as long as their reason for doing so is not one of the enumerated ones.

  • The first physical models of the hyperbolic plane, made in 1868 by Beltrami (\(\mathbb{M}\), via), blog post by Daina Taimiņa from 2010. Maybe you could make something like this by wrapping and stretching a disk of wet paper in a roll around a pseudosphere ( The rolled-up photo of Beltrami’s model suggests that he did that. The via link shows this as a tangent to a story about triangulated polygons, frieze patterns, and the Farey tessellation.

  • Why do bees make rhombic dodecahedrons (\(\mathbb{M}\))? Nice video from Matt Parker (Stand-up Maths) on why bees usually end the cells of their honeycombs with rhombic dodecahedron faces, why this isn’t the optimal solution to fitting two layers of cells together (in terms of minimum wax usage), and why it isn’t reasonable to expect bees to find exact optima for this problem. If I have to quibble with something, though, it’s his plural. It’s not wrong, but see Google ngrams.

  • Mathematicians prove melting ice stays smooth (\(\mathbb{M}\), see also). The headline is a little overstated: you’re probably familiar with thin necks of ice melting to sharp points at the instant they separate. But these singularities are instantaneous: mathematical models of ice stay smooth for all but a measure-zero set of times. Original paper: “The singular set in the Stefan problem”, Alessio Figalli, Xavier Ros-Oton, and Joaquim Serra.

  • Discussion of the recent meme telling programmers and mathematicians that summation notation and for loops are the same thing. They’re not quite the same, though: summations don’t have an order of evaluation. But which is easier for people who don’t already know what they are to search and find out about? And why do programmers get angry at non-programming notational conventions?

  • Mathematics, morally (\(\mathbb{M}\), via), Eugenia Cheng, 2004. Somehow I hadn’t run across this before. It argues that much philosophy of mathematics is irrelevant to practice (“You can’t tell from somebody’s mathematics if they are a fictionalist, a rationalist, a platonist, a realist, an operationalist, a logicist, a formalist, structuralist, nominalist, intuitionist.”) and instead considerations of the right way of handling certain topics are more central.

  • The SIGACT Committee for the Advancement of Theoretical Computer Science is collecting information on women in theoretical computer science (\(\mathbb{M}\)). If this is you, please see their announcement for details of how to be counted.

  • Cynthia Rudin wins major award with silly name (\(\mathbb{M}\)), for her work on machine learning systems that learn to predict behavior using simple, interpretable, and transparent formulas.

  • According to the SODA web site, SIAM has decided that their conferences will be hybrid through July (\(\mathbb{M}\)). So if (like me) you wanted to participate in SODA/SOSA/ALENEX/APOCS, but were worried about planning a trip to Virginia with another predicted winter wave of Covid, now you can stay home and conference safely. Or, if you feel hybrid conferences are problematic and organizers should do one or the other but not both, now you have another reason to be annoyed.

  • Rarely is the question asked: Are math papers getting longer? (\(\mathbb{M}\)). Following earlier work by Nick Trefethen, Edward Dunne provides some data suggesting that (for certain journals, at least, and not the ones with page limits) the answer is yes. I’m not convinced by the suggested explanation that it’s because they are taking more effort to explain “connections with other work”, though: is that really a big enough fraction of most papers?

  • I haven’t been using my office desktop Mac much because I haven’t been into my office much, so it took me a while to pay attention to the fact that much of its networking had recently broken. Here’s why (\(\mathbb{M}\)). It was still running OS X El Capitan (10.11.6) and a crucial top-level certificate expired. The machine is too old (late 2009) for the latest OS X but it looks like I can and should upgrade to High Sierra, 10.13. So much for getting anything else accomplished today…

by David Eppstein at October 15, 2021 03:43 PM UTC

When László Babai first announced his graph isomorphism in quasipolynomial time result, I wrote

We think of theory as a young person's game, most of the big breakthroughs coming from researchers early in their careers. Babai is 65, having just won the Knuth Prize for his lifetime work on interactive proofs, group algorithms and communication complexity. Babai uses his extensive knowledge of combinatorics and group theory to get his algorithm. No young researcher could have had the knowledge base or maturity to be able to put the pieces together the way that Babai did.

Babai's proof is an exceptional story, but it is exceptional. Most CS theorists have done their best work early in their career. I got myself into a twitter discussion on the topic. For me, I'm proud of the research I did through my forties, but I'll always be best known, research wise, for my work on interactive proofs around 1990. It would be hard to run a scientific study to determine cause and effect but here are some reasons, based on my own experiences, on why we don't see research dominated by the senior people in theory.

The field changes - Computation complexity has moved from a computational-based discipline to one now dominated by combinatorics, algebra and analysis. I'm not complaining, a field should evolve over time but it plays less to my strengths. It's hard to teach this old dog new tricks.
The fruit hanged lower - there were important problems with easier proofs available then not available now
Responsibilities - You have fewer as a PhD student, postdoc or assistant professor.
Family - becomes more of a focus.
Taking on new jobs - Many academics, though not all, take on administrative roles at their university or , or leave academics completely. 
The young people have the new ideas - And older people get settled in their ways
The thrill is gone or at least decays - Your first theorem, your first talk, your first conference paper gives you a level of excitement that's hard to match.
Existentialism - The realization that while computing has no doubt changed the world, my research, for the most part, hasn't.
Cognitive Decline - Probably the most controversial but for me I find it hard to focus on problems like I used to. Back in the day I prided myself on knowing all the proofs of my theorems, now I can't even remember the theorems.

Honestly there is just nothing wrong with taking on new roles, writing books, surveys and blogs, focusing on teaching and mentorship and service and leaving the great research to the next generation.

by Lance Fortnow ( at October 15, 2021 01:32 PM UTC

York University in Toronto, Canada is inviting applications for five tenured or tenure-track positions in the theory of computing or data science (both broadly interpreted). The review of applications will begin on November 15, and full applications are due by November 30.


by shacharlovett at October 15, 2021 01:01 PM UTC

In this post, we will cover Nakamoto’s Consensus protocol presented in the Bitcoin whitepaper. There are a lot of posts and videos that explain this seminal and simple protocol. Our goal in this post will be to intuitively piece out the need for different aspects of the protocol, such as...

at October 15, 2021 04:00 AM UTC

The NSF Quantum Leap Challenge Institute for Robust Quantum Simulation ( seeks exceptional candidates for the RQS Postdoctoral Fellowship. Applications should be submitted through AcademicJobsOnline at


by shacharlovett at October 15, 2021 12:45 AM UTC

Authors: Steven Delong, Alireza Farhadi, Rad Niazadeh, Balasubramanian Sivan
Download: PDF
Abstract: We study the classic online bipartite matching problem with a twist: offline nodes are reusable any number of times. Every offline node $i$ becomes available $d$ steps after it was assigned to. Nothing better than a $0.5$-approximation, obtained by the trivial deterministic greedy algorithm, was known for this problem. We give the first approximation factor beating $0.5$, namely a $0.505$ approximation, by suitably adapting and interpreting the powerful technique of Online Correlated Selection.

at October 15, 2021 10:39 PM UTC

Authors: Daniel Grier, Daniel J. Brod, Juan Miguel Arrazola, Marcos Benicio de Andrade Alonso, Nicolás Quesada
Download: PDF
Abstract: Gaussian boson sampling is a model of photonic quantum computing that has attracted attention as a platform for building quantum devices capable of performing tasks that are out of reach for classical devices. There is therefore significant interest, from the perspective of computational complexity theory, in solidifying the mathematical foundation for the hardness of simulating these devices. We show that, under the standard Anti-Concentration and Permanent-of-Gaussians conjectures, there is no efficient classical algorithm to sample from ideal Gaussian boson sampling distributions (even approximately) unless the polynomial hierarchy collapses. The hardness proof holds in the regime where the number of modes scales quadratically with the number of photons, a setting in which hardness was widely believed to hold but that nevertheless had no definitive proof.

Crucial to the proof is a new method for programming a Gaussian boson sampling device so that the output probabilities are proportional to the permanents of submatrices of an arbitrary matrix. This technique is a generalization of Scattershot BosonSampling that we call BipartiteGBS. We also make progress towards the goal of proving hardness in the regime where there are fewer than quadratically more modes than photons (i.e., the high-collision regime) by showing that the ability to approximate permanents of matrices with repeated rows/columns confers the ability to approximate permanents of matrices with no repetitions. The reduction suffices to prove that GBS is hard in the constant-collision regime.

at October 15, 2021 10:37 PM UTC

Authors: Yu Tong, Victor V. Albert, Jarrod R. McClean, John Preskill, Yuan Su
Download: PDF
Abstract: Quantum many-body systems involving bosonic modes or gauge fields have infinite-dimensional local Hilbert spaces which must be truncated to perform simulations of real-time dynamics on classical or quantum computers. To analyze the truncation error, we develop methods for bounding the rate of growth of local quantum numbers such as the occupation number of a mode at a lattice site, or the electric field at a lattice link. Our approach applies to various models of bosons interacting with spins or fermions, and also to both abelian and non-abelian gauge theories. We show that if states in these models are truncated by imposing an upper limit $\Lambda$ on each local quantum number, and if the initial state has low local quantum numbers, then an error at most $\epsilon$ can be achieved by choosing $\Lambda$ to scale polylogarithmically with $\epsilon^{-1}$, an exponential improvement over previous bounds based on energy conservation. For the Hubbard-Holstein model, we numerically compute a bound on $\Lambda$ that achieves accuracy $\epsilon$, obtaining significantly improved estimates in various parameter regimes. We also establish a criterion for truncating the Hamiltonian with a provable guarantee on the accuracy of time evolution. Building on that result, we formulate quantum algorithms for dynamical simulation of lattice gauge theories and of models with bosonic modes; the gate complexity depends almost linearly on spacetime volume in the former case, and almost quadratically on time in the latter case. We establish a lower bound showing that there are systems involving bosons for which this quadratic scaling with time cannot be improved. By applying our result on the truncation error in time evolution, we also prove that spectrally isolated energy eigenstates can be approximated with accuracy $\epsilon$ by truncating local quantum numbers at $\Lambda=\textrm{polylog}(\epsilon^{-1})$.

at October 15, 2021 10:40 PM UTC

The Department of Computer Science at CSUEB invites applications for 2 tenure-track appointments as Assistant Professor in Computer Science (considering all areas of computer science, capable of teaching in emerging areas) starting Fall 2022.


by shacharlovett at October 14, 2021 05:42 PM UTC

    Paper    Code

In a series of two blog posts, we dive into how to build practical certified defenses against adversarial patches. In Part I below, we give an overview of smoothing-based defenses, a popular class of certified defenses, and walk through a specific example of using de-randomized smoothing to defend against adversarial patch attacks. In Part II, we discuss our latest work that demonstrates how using vision transformers with de-randomized smoothing (smoothed ViTs) can significantly improve these defenses. Our approach not only provides better certified defenses against adversarial patches but also maintains standard accuracy and inference times at levels comparable to that of regular (non-robust) models. Thanks to these improvements, smoothed ViTs become a truly viable alternative to regular models.

Adversarial examples are small modifications of inputs that can induce machine learning models to misclassify those inputs. These attacks can be roughly categorized into two broad classes: small, synthetic perturbations that are visually indistinguishable from the original input, and physically realizable attacks that can break deployed machine learning systems in the wild. One popular attack in the latter category, known as adversarial patches, maliciously perturbs a bounded region in the image (typically a square).

Clean image Predicted class: Duck
Attacked image Predicted class: Boat
An example of an adversarial patch attack. A square patch is added to the image in order to cause the classifier to misclassify the image as something else, such as a boat instead of a duck.

Adversarial patches can be used to undermine a variety of vision-based tasks—in particular, have been used to deceive image classifiers, manipulate object detectors, and disrupt optical flow estimation. Such attacks tend to be relatively easy to implement too as they just require, e.g., printing adversarial stickers, creating adversarial pieces of clothing,or manufacturing adversarial glasses.

While several heuristic defenses have been proposed to protect against adversarial patches, some of these defenses were shown to not be fully effective. This is unfortunately a fairly common theme in adversarial examples research.

Certified defenses attempt to alleviate such empirical evaluation problems by providing robustness guarantees that are provable. However, certified guarantees tend to be modest and come at a steep cost. For instance, recently proposed certified defenses against adversarial patches can guarantee only 14% robust accuracy on ImageNet. Furthermore, certified models are often substantially slower (by even 2-4 orders of magnitude) and significantly less accurate (in terms of standard accuracy) than non-robust models, which severely limits their practicality.

Smoothing models for provable robustness

We begin our discussion of certified defenses with an overview of smoothing, a general strategy for creating models that are provably robust to adversarial attacks. They revolve around the idea of deriving a (robust) prediction by aggregating many variations of the original input (e.g., ones that correspond to adding Gaussian noise, masking parts of the input, or augmenting it with generative models). The underlying intuition is that while it might be easy for an adversary to attack a single input, it is much harder for them to simultaneously attack all of the input variations. Thus, instead of directly classifying the original input, we classify each of such variations of that input and then output the most frequently predicted class—if the number of variations that agree with this class is high enough then the prediction is guaranteed to be robust.

Let’s dive into the details! To create a smoothed classifier, we change the typical forward pass of our classifier (often referred to as the base classifier) into a three-step procedure:

  1. Construct variations of the input.
  2. Classify all input variations individually.
  3. Output the most frequently predicted class.

Now, once we have such a smoothed classifier, how can we certify its robustness? The key assumption for certification is that an adversary can only affect a limited number of input variations. If this is true, certification amounts to a simple counting problem: if the difference between the first- and second-most frequently predicted classes is larger than what an adversary could ever change via their manipulation, then the prediction is guaranteed to be robust! Importantly, for this approach to work well, it is key that these variations remain informative of the correct class while remaining hard for the adversary to perturb many of them simultaneously. This gives rise to a natural trade-off between robustness and accuracy.

By choosing different types of input variations, smoothing techniques can defend against different types of adversarial attacks. For example, smoothing classifiers with random additive noise can provide certified defenses against adversarial L2 and L1 attacks. On the other hand, smoothing over “pixel flows” can provide certified defenses against Wasserstein perturbations. One can even “smooth” the entire pipeline of both training and inference (treating it as a single module), to get robustness to data poisoning attacks. In general, if we can construct a set of input variations so that the adversary can only affect some fraction of them, we can use smoothing to defend against the corresponding attacks!

Derandomized Smoothing: A Smoothing Defense against Patch Attacks

How can we use smoothing to defend against adversarial patches specifically? Recall that to be able to create a smoothing defense, we need to come up with a way to generate variations of an image that limit the effect of an adversarial patch and still be indicative of the correct class label. Here, we exploit a key property of adversarial patches: they are limited to a small, contiguous region of the input image.

Specifically, for each variation, we mask out all of the image except for a small region. In most of these variations, the adversarial patch will be fully masked, and thus cannot affect the model’s prediction by construction. This technique was originally introduced in a certified defense known as derandomized smoothing (Levine & Feizi, 2020). Typically, we leave a column of the image unmasked as shown in the figure below, otherwise known as a column ablation (check out the original paper for some other ablation types).

Example of column ablations for an input image. Most of the image is masked out except for a column of the image. An adversarial patch can only affect a column ablation if the patch itself is located within the ablation, thus limiting the effect of an adversary. Click on the image to see other column ablations for the same image.

Certified guarantee against patch attacks

Now that we have our smoothed classifier, how can we prove that the prediction is robust to patch attacks? Since a patch is of limited size, an adversary can only affect the few image variations that physically overlap with the patch. Consequently, if the margin of the winning prediction is large enough, then we can guarantee robustness to adversarial patches.

No patch
32x32 patch
64x64 patch
Classified column ablations
A simplified demonstration of the derandomized smoothing defense for patch robustness. Here, an image of a duck is split into column ablations (our choice of input variation), each of which are classified individually. Since 6 out of 8 column ablations are classified as duck, the smoothed model predicts that this is a duck. A 32x32 patch is only large enough to affect at most 2 column ablations, which is not enough to change the prediction. Thus, this prediction is certifiably robust to 32x32 patches. In contrast, a 64x64 patch can affect up to 3 column ablations and can potentially flip the most frequent class to boat, and so the model is not certifiably robust to 64x64 patches.

To make this more precise, let $\Delta$ be the number of ablations that a patch could simultaneously affect (in the worst case). If the winning margin is at least $2\Delta$, then the smoothing defense guarantees that the predicted class is robust to this patch size. The fraction of examples which meet this certification threshold is typically referred to as certified accuracy.

For adversarial patch attacks of size m x m, and column ablation size $b$, we have $\Delta=m+b-1$

Challenges for Certified Patch Defenses

Although certified patch defenses can provide guarantees against adversarial attacks, like many other certified defenses, they face several major challenges that limit their practicality:

  1. They provide relatively modest guarantees, and can only guarantee robustness for a relatively small fraction of inputs and/or only a small patch size.
  2. The standard accuracy is typically much lower than that of a standard, non-robust model, forcing on the practitioners an unfavorable “trade-offs” between robustness and accuracy.
  3. Inference times tend to be several orders of magnitude larger than that of a standard, non-robust model, making certified defenses difficult to deploy in real-time settings.

In the next post, we discuss our latest work on making certified defenses much more practical. Specifically, we show how leveraging vision transformers (ViT) enables better handling of input variations and significant improvement of the margins for certification. With some straightforward modifications to the ViT architecture, we are also able to develop certifiably robust models that not only outperform previous certified defenses but also have practical inference times and standard accuracies.

at October 14, 2021 12:00 AM UTC

    Paper    Code

In our previous post, we gave an overview of smoothing-based approaches to certified defenses, and described a certified patch defense known as de-randomized smoothing. In our latest work, we show how to leverage vision transformers (ViTs) to significantly improve such certified patch defenses along all possible axes: standard accuracy, certified accuracy, and speed. In fact, with some straightforward modifications, our certified ViTs not only enable significantly better certified robustness but they also maintain standard accuracy and inference times that are comparable to that of regular (non-robust) models. This progress marks a substantial step forward for certified defenses as a practical alternative to regular models.

An overview of the Smoothed ViT.

The main appeal of certified defense is their ironclad guarantees: a certified result leaves no room for doubt and is not prone to blind-spots of empirical evaluation. However, practitioners care not only about certified performance: the accuracy and the inference speed of the model are equally important factors too. Unfortunately, nearly all certified defenses come at a severe price: (1) the standard accuracy plummets in comparison to a standard model and/or (2) inference time is orders of magnitude slower than that of a standard model.

A standard ResNet-50 can achieve about 76% accuracy on the ImageNet benchmark, and takes less than 1 second to make a prediction on a typical GPU. In contrast, top-performing certified defenses such as Levine & Feizi (2020) reports 44% standard accuracy and takes 150 seconds to make a prediction on similar hardware.

This trade-off between robustness and performance can be untenable even for safety-critical applications. For example, object detection systems deployed in autonomous vehicles need to recognize obstacles and signs quickly in real time. Consequently, in addition to being robust, these systems also need to be fast and accurate.

To approach this tradeoff between robustness and accuracy, we take a closer look at certified defenses from an architectural perspective. Specifically, smoothing defenses use a backbone architecture that is typically implemented using classic convolutional networks, the default choice across much of computer vision. But are these architectures really the best tool for the job?

In our work, we show that when moving from standard models to certified defenses, it may be worth it to rethink the core architecture behind them. Specifically, we find that smoothing-based patch defenses can drastically improve when using vision transformers (ViTs) instead of convolutional networks as their backbone. With some additional adjustments to the transformer architecture, we not only achieve superior certified accuracy to prior work but also reach standard accuracies and inference times comparable to a standard, non-robust ResNet!

Rethinking smoothing backbones

Recall from the previous post that smoothing defenses use a base classifier to make predictions on many variations of the original input. For patch defenses, this amounts to using the base classifier to classify column ablations (i.e., masked images that only reveal a small column of the original image). Consequently, the accuracy of the base classifier at classifying columns of an image directly translates into the performance of the certified defense.

In the light of this observation, it is crucial to realize that convolutional networks typically process complete (i.e., not masked) images. So, this type of architecture might not actually be best suited for classifying the highly masked images used in smoothing defenses.

Why might we expect convolutions to be suboptimal for masked images? CNNs slide filters across the image and grow the receptive field across layers of the network. This large receptive field, while usually being a strength of CNNs for standard settings, forces the network to process a large amount of masked pixels when processing ablated images. Since these pixels carry no information, the network must learn to detect and ignore these masked regions (in addition to performing unnecessary computation).

Vision transformers for ablated images

Unlike CNNs, vision transformers (ViTs) use attention modules instead of convolutions as their main building block. Crucially, attention modules can ignore masked regions instead of processing them.

To test out this intuition, we measured the accuracy of various convolutional and ViT architectures at classifying column ablations. It turns out that ViTs are significantly better at this task—for example, on ImageNet, a ViT-S has 12% higher accuracy at classifying column ablations than a similarly sized ResNet-50.

Ablation accuracy of ViTs vs. ResNets: A comparison of the accuracy of ViTs with similarly sized ResNets at classifying image ablations.

These improvements in ablation accuracy directly boosts both the certified and standard accuracies for the smoothed model. For example, the increase in ablation accuracy improves the certified accuracy of a smoothed ViT-S by 13% (as compared to a similarly sized ResNet-50) and the standard accuracy by 12%.

With some additional adjustments to the certification procedure described in our paper, our largest certifiably robust vision transformer can reach 73% standard accuracy—only 3% off from the standard accuracy of a regular, non-robust ResNet-50, and almost 30% higher than previous versions of de-randomized smoothing!

Certified accuracy of ViTs vs. ResNets: A comparison of the certified accuracy of smoothed ViTs/ResNets when using different ablation sizes.
Standard accuracy of ViTs vs. ResNets: A comparison of the standard accuracy of smoothed ViTs/ResNets when using different ablation sizes.

Modifying ViTs to Speed up Inference Time

Also, recall that a smoothed classifier tends to be significantly slower than a standard model because it needs to classify a large number of input variations for each input image. Indeed, traditional randomized smoothing approaches have used upwards of 100,000 variations, resulting in a forward pass that is 4-5 orders of magnitude slower than a standard model. These costs are infeasible in scenarios where decisions need to be made in a fraction of a second, such as autonomous driving. As Levine & Feizi have shown, for patch defenses, a move towards deterministic smoothing (as opposed to randomized) can result in a great speed up (by reducing the number of variations needed to be classified). However, the resulting solution is still two orders of magnitude slower than standard models.

In order to speed up inference (much) further, we leverage the token-based nature of ViTs to gracefully handle image ablations. Specifically, note that ViTs process images as tokens, where each token represents a contiguous patch of the input image. Since the runtime of a ViT is proportional to the number of tokens, a natural approach to speed up inference is to simply drop the masked out tokens, as shown in the following figure:

Vision transformers split images into patches, which are processed by the transformer architecture. When processing column ablated images, we can simply pass only the tokens in the ablation to significantly reduce the computational complexity of smoothing.

We modify the ViT architecture to drop fully masked tokens in such a manner. As expected, this significantly speeds up the inference time of the smoothed model! Indeed, with this optimization, our smoothed ViT-S is faster than a smoothed ResNet-50 by an order of magnitude. In our paper, we discuss how additional improvements to the certification procedure can make our ViT architectures another order of magnitude faster. In total, our fastest smoothed ViT is actually comparable in terms of inference speed to a standard, non-robust ResNet.


Smoothed ViTs present a simple but effective way to achieve state-of-the art certified accuracy. These models are also realistically deployable, as they maintain similar accuracy and inference speed as non-robust models. The key here is the vision transformer’s inherent ability to quickly and accurately process image ablations. These advancements are thus a step towards certified defenses that can be practically deployed in real-world scenarios.

at October 14, 2021 12:00 AM UTC

I have a new preprint, “Finding Relevant Points for Nearest-Neighbor Classification”, arXiv:2110.06163, to appear in January at the SIAM Symposium on Simplicity in Algorithms (SOSA22). It’s about points in Euclidean spaces of dimension three or more, but I thought it would make a good warm-up to discuss here the one-dimensional version of the same problem, solved (together with the 2d version) by Bremner, Demaine, Erickson, Iacono, Langerman, Morin, and Toussaint in their paper “Output-sensitive algorithms for computing nearest-neighbour decision boundaries”, Discrete Comput. Geom. 2005.

So in this problem, you have a collection of real-valued data points with known discrete classifications (say, a finite set of colors), and you want to guess the color of new points whose color is not already given. Nearest neighbor classification means simply that, to guess the color of \(x\), you find the closest known point \(y\) and guess that \(x\) has the same color as \(y\). One easy way to do this would be to store the known points in a sorted list and use binary search. There’s lots more to say about this (for instance its use in combination with random projections for high-dimensional approximate nearest neighbors) but for today I want to focus on the size of this sorted list. We can store a list that is potentially much smaller, but always produces the same results, by keeping only points that have a neighbor with a different classification.

One-dimensional nearest neighbor classification on a full data set and the data set trimmed to its relevant points

These points with differently-classified neighbors are the “relevant points” of the preprint title. Another way of describing them is that a point is relevant if deleting it would change the classification of some other (unknown) points in space that might later be queried. Among the set of decision boundaries, the ones separating parts of space with different classifications are the ones defined by relevant points. So if we store a nearest-neighbor data structure with only relevant points, we will get the same answer as if we store all known points. But because we’re storing fewer points, it will take less memory and less time (especially if the reduced memory allows it to fit into cache).

If you have the points given to you in sorted order, then it’s easy enough to scan through them in that order, keeping track of which pairs have different classifications, and produce a filtered sequence of the relevant ones. Here it is in Python (with apologies for the low-contrast syntax coloring):

def relevant(seq,classify):
    """Filter points whose classification differs from a neighbor.
    Arguments are a sequence of points, and a function to classify
    each point. The return value is an iterator for the filtered sequence."""
    prevpoint = prevclass = prevlisted = None
    for x in seq:
        xclass = classify(x)
        if prevlisted != None and xclass != prevclass:
            if not prevlisted:
                yield prevpoint
            yield x
            prevlisted = True
            prevlisted = False
        prevpoint = x
        prevclass = xclass

However, as Bremner et al observed, sorting an input to make it usable by this scan does more work than necessary, because we don’t need the sorted ordering of all the other points between the relevant ones. Instead, we can use an idea resembling quickselect, where we modify the quicksort algorithm to stop recursing in subproblems where sorting is unnecessary. For finding relevant points, these subproblems are the homogeneous ones, in which all points have the same classification as each other. Bremner et al combined this idea with a version of quicksort that always partitions its subproblems at the exact median, in order to achieve a good worst-case time bound, but if we’re happy with expected analysis we can use the same random-pivoting idea as the more usual form of quicksort:

def quickrelevant(seq,classify):
    """Same output as relevant(sorted(list(seq)),classify).
    We assume the input sequence is sortable and has no repeat values."""

    def supersequence():
        """Generate sorted supersequence of relevant points by quicksort-like
        recursive subdivision, stopping at homogeneous subproblems.
        We include the endpoints of each subproblem, even though some might
        not be relevant, so the results should be cleaned up using relevant().
        We use an explicit stack to handle the recursion, avoiding the need
        to pass yielded outputs back through a call stack."""

        liststack = [list(seq)]
        while liststack:
            L = liststack.pop()
            # Base cases of recursion: lists of zero or one item
            if not L:
            if len(L) == 1:
                yield L[0]
            # Test whether L is homogeneous
            # and if so only generate its extreme values
            homogeneous = True
            firstclass = classify(L[0])
            for i in range(1,len(L)):
                if firstclass != classify(L[i]):
                    homogeneous = False
            if homogeneous:
                yield min(L)
                yield max(L)
            # Divide and conquer with random pivot
            pivot = L[random.randrange(0,len(L))]
            low = []
            high = []
            for x in L:
                if x < pivot:

    return relevant(supersequence(),classify)

The time complexity of this algorithm can be analyzed much like the analysis of quickselect, by observing that the time is proportional to the number of comparisons with pivots, computing the probability that each possible comparison happens, and summing. In quicksort, two distinct elements at distance \(d\) from each other in the final sorted output are compared whenever one of them is the first to be chosen as a pivot in the interval between them, which happens with probability exactly \(2/(d+1)\). In quickrelevant, this same probability holds for pairs that are separated by one of the decision boundaries. But pairs of elements that are within the same homogeneous block are less likely to be compared, because of the possibility that the recursion will stop before it separates or compares them.

If a pair of elements lies within a single block, has distance \(d\) separating them, and is $e$ units from both ends of the block, then it will only be compared if one of the two elements is first to be chosen in at least one of the two intervals of length \(d+e\) extending from the two elements towards an end of their block. This happens with probability at most \(4/(d+e)\), because there are two ways of choosing which element to pick first as the pivot and two ways of choosing which extended interval it is first in.

If we sum up these probabilities, for pairs involving a single element that is \(e\) units from its nearest block boundary among a set of \(n\) elements, we get \(O\bigl(\log(n/e)\bigr)\) as the expected contribution to the total time for that one element. If we sum the contributions from the elements within a block, for a block of length \(\ell_i\), we get a total expected contribution from that block of \(O\bigl(\ell_i\log(n/\ell_i)\bigr)\). And if we have \(k\) relevant points and \(O(k)\) blocks, and we sum over all blocks, the total time is maximized when all the blocks are of equal size, \(\Theta(n/k)\), for which we get total time \(O(n\log k)\).

For quicksort and quickselect, it’s possible to be more careful in the analysis, derive exact formulas for the probability of making each comparison, and from them get an analysis of the expected number of comparisons that does not use \(O\)-notation for its leading term; see my linked post on the quickselect analysis. Probably it’s possible here too but it looks messier. Maybe it would make a good undergraduate research project. One thing to be careful of is that the comparisons are not over in the homogeneous case; finding the min and max simultaneously, in a block of length \(\ell_i\), takes roughly \(3\ell_i/2\) comparisons. But that should only be a lower-order term compared to the \(O(n\log k)\) leading term of the analysis.

(Discuss on Mastodon)

by David Eppstein at October 13, 2021 11:05 PM UTC

We are looking for exceptional junior scientists to work collaboratively with a group of faculty from Computer Science (Boaz Barak), Statistics (Lucas Janson), Electrical Engineering (Demba Ba) and Applied Mathematics (Cengiz Pehlevan) towards a theory of representations in artificial and natural systems.


by shacharlovett at October 13, 2021 10:38 PM UTC

We are looking for junior scientists in theoretical computer science, broadly construed.
The normal duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an allocation for research and travel expenses.
While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests.


by shacharlovett at October 13, 2021 10:36 PM UTC

We construct an explicit $\varepsilon$-hitting set generator (HSG) for regular ordered branching programs of length $n$ and $\textit{unbounded width}$ with a single accept state that has seed length \[O(\log(n)(\log\log(n)+\log(1/\varepsilon))).\] Previously, the best known seed length for regular branching programs of width $w$ with a single accept state was by Braverman, Rao, Raz and Yehudayoff (FOCS 2010, SICOMP 2014) and Hoza Pyne and Vadhan (ITCS 2021), which gave \[O(\log(n)(\log\log(n)+\min\{\log(w),\log(n)\}+\log(1/\varepsilon))).\] We also give a simple co-HSG for the model with optimal seed length $O(\log n)$. For the more restricted model of $\textit{permutation}$ branching programs, Hoza Pyne and Vadhan (ITCS 2021) constructed a PRG with seed length matching our HSG, and then Pyne and Vadhan (CCC 2021) developed an error-reduction procedure that gave an HSG (in fact a ``weighted PRG'') with seed length $\widetilde{O}(\log(n)\sqrt{\log(n/\varepsilon)}+\log(1/\varepsilon)).$ We show that if an analogous error-reduction result could be obtained for our HSG, there is an explicit HSG for general ordered branching programs of width $w=n$ with seed length $\widetilde{O}(\log^{3/2}n)$, improving on the $O(\log^2n)$ seed length of Nisan (Combinatorica 1992).

at October 13, 2021 03:27 PM UTC

October 19, 2021 Virtual Registration deadline: October 18, 2021 Today’s data pose unprecedented challenges to statisticians. It may be incomplete, corrupted or exposed to some unknown source of contamination or adversarial attack. Robustness is one of the revived concepts in statistics and machine learning that can accommodate such complexity and glean useful information from … Continue reading IDEAL mini-workshop on “Statistical and Computational Aspects of Robustness in High-dimensional Estimation”

by shacharlovett at October 13, 2021 03:05 AM UTC

Many times in my career, I’ve been told by respected statisticians that machine learning is nothing more than nonparametric statistics. The longer I work in this field, the more I think this view is both misleading and unhelpful. Not only can I never get a consistent definition of what “nonparametric” means, but the jump from statistics to machine learning is considerably larger than most expect. Statistics is an important tool for understanding machine learning and randomness is valuable for machine learning algorithm design, but there is considerably more to machine learning than what we learn in elementary statistics.

Machine learning at its core is the art and science of prediction. By prediction, I mean the general problem of leveraging regularity of natural processes to guess the outcome of yet unseen events. As before, we can formalize the prediction problem by assuming a population of $N$ individuals with a variety of attributes. Suppose each individual has an associated variable $X$ and $Y$. The goal of prediction is to guess the value of $Y$ from $X$ that minimizes some error metric.

A classic prediction problem aims to find a function that makes the fewest number of incorrect predictions across the population. Think of this function like a computer program that takes $X$ as input and outputs a prediction of $Y$. For a fixed prediction function, we can sum up all of the errors made on the population. If we divide by the size of the population, this is the mean error rate of the function.

A particularly important prediction problem is classification. In classification, the attribute $Y$ takes only two values: the input $X$ could be some demographic details about a person, and $Y$ would be whether or not that person was taller than 6 feet. The input $X$ could be an image, and $Y$ could be whether or not the image contains a cat. Or the input could be a set of laboratory results about a patient, and $Y$ could be whether or not the patient is afflicted by a disease. Classification is the simplest and most common prediction problem, one that forms the basis of most contemporary machine learning systems.

For classification problems, it is relatively straightforward to compute the best error rate achievable. First, for every possible value of the attribute $X$, collect the subgroup of individuals of the population with that value. Then, the best assignment for the prediction function is the one that correctly labels the majority of this subgroup. For example, in our height example, we could take all women, aged 30, born in the United States, and reside in California. Then the optimal label for this group would be decided based on whether there are more people in the group who are taller than 6 feet or not. (Answer: no).

This minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. What can we do if we work from a subsample? Just as was the case in experiment design, we’d like to be able to design good prediction functions from a small sample of the population so we don’t have to inspect all individuals. For a fixed function, we could use the same law-of-large-numbers approximations to estimate the best decision. That is, if we decide in advance upon a prediction function, we could estimate the percentage of mistakes on the population by gathering a random sample and computing the proportion of mistakes on this subset. Then we could apply a standard confidence interval analysis to extrapolate to the population.

However, what if we’d like to find a good predictor on the population using only a set of examples sampled from the population. We immediately run into an issue: to find the best prediction function, we needed to observe all possible values of $X$. What if we’d like to make predictions about an individual with a set of attributes that was not observed in our sample?

How can we build accurate population-level predictors from small subsamples? In order to solve this problem, we must make some assumptions about the relationship between predictions at related, but different values of $X$. We can restrict our attention to a set of functions that respect regularity properties that we think the predictions should have. Then, with a subsample from the population, we find the function that minimizes the error on the sample and obeys the prescribed regularity properties.

This optimization procedure is called ”empirical risk minimization” and is the core predictive algorithm of machine learning. Indeed, for all of the talk about neuromorphic deep networks with fancy widgets, most of what machine learning does is try to find computer programs that make good predictions on the data we have collected and that respect some sort of rudimentary knowledge that we have about the broader population.

The flexibility in defining what “knowledge” or “regularity” means complicates the solution of such empirical risk minimization problems. What does the right set of functions look like? There are three immediate concerns:

  1. What is the right representation? The set needs to contain enough functions to well approximate the true population prediction function. There are a variety of ways to express complex functions, and each expression has its own benefits and drawbacks.

  2. The set of functions needs to be simple to search over, so we don’t have to evaluate every function in our set as this would be too time consuming. Efficient search for high quality solutions is called optimization.

  3. How will the predictor generalize to the broader population? The functions cannot be too complex or else they will fail to capture the regularity and smoothness of the prediction problem (estimating functions of too high complexity is colloquially called “overfitting”).

Balancing representation, optimization, and generalization gets complicated quickly, and this is why we have a gigantic academic and industrial field devoted to the problem.

I’m repeating myself at this point, but I again want to pound my fist on the table and reiterate that nothing in our development here requires that the relationship between the variables $X$ and $Y$ be probabilistic. Statistical models are often the starting point of discussion in machine learning, but such models are just a convenient way to describe populations and their proportions. Prediction can be analyzed in terms of a deterministic population, and, just as we discussed in the case of randomized experiments, randomness can be introduced as a means of sampling the population to determine trends. Even generalization, which is usually studied as a statistical phenomenon, can be analyzed in terms of the randomness of the sampling procedure with no probabilistic modeling of the population.

On the other hand, some sort of knowledge about the population is necessary. The more we know about how prediction varies based on changes in the covariates, the better a predictor we can build. Engineering such prior knowledge into appropriate function classes and optimization algorithms form the art and science of contemporary machine learning.

This discussion highlights that while we can view prediction through the lens of statistical sampling, pigeonholing it as simply “nonparametric statistics” does not do the subject justice. While the jump from mean estimation to causal RCTs is small, the jump from mean estimation to prediction is much less immediate. And in machine learning practice, the intuitions from statistics often don’t apply. For example, conventional wisdom from statistics tells us that evaluating multiple models on the same data set amounts to multiple hypothesis testing, and will lead to overfitting on the test set. However, there is more and more evidence that using a train-test split leads does not lead to overfitting. Instead, the phenomena we see is that dataset benchmarks can remain useful for decades. Another common refrain from statistics is that model complexity must be explicitly constrained in order to extrapolate to new data, but this also does not seem to apply at all to machine learning practice.

Prediction predates probability and statistics by centuries. As Moritz and I chronicle in the introduction to Patterns, Predictions, and Actions astronomers were using pattern matching to predict celestial motions, and the astronomer Edmund Halley realized that similar techniques could be used to predict life expectancy when pricing annuities. Moreover, even though modern machine learning embraced contemporary developments in statistics by Neyman, Pearson, and Wald, the tools quickly grew more sophisticated and separate from core statistical practice. In the next post, I’ll discuss an early example of this divergence between machine learning and statistics, describing some of the theoretical understanding of the Perceptron in the 1960s and how its analysis was decidedly different from the theory advanced by statisticians.

at October 13, 2021 12:00 AM UTC

Today I would like to discuss the Khot-Naor approximation algorithm for the 3-XOR problem, and an open question related to it.

In 3-XOR, we have a system of linear equations modulo 2, with three variables per equation, that might look something like

\displaystyle  \begin{array}{ll} x_1 + x_2 + x_4 & \equiv 0 \pmod 2\\ x_1 + x_5 + x_6 & \equiv 1 \pmod 2\\ x_2 + x_3 + x_4 & \equiv 1 \pmod 2\\ x_5 + x_3 + x_6 & \equiv 1 \pmod 2 \end{array}

The above system is not satisfiable (if we add up the left-hand sides we get 0, if we add up the right-hand sides we get 1), but it is possible to satisfy {3/4} of the equations, for example by setting all the variables to 1. In Max 3-XOR problem (which we will simply refer to as “3-XOR” from now on), given a system of equations mod 2 with three variables per equation, we want to find an assignment that satisfies as many equations as possible.

Either setting all the variables to zero or setting all the variables to one will satisfy half of the equations, and the interesting question is how much better than 1/2 it is possible to do on a given instance. Khot and Naor provide an algorithm that, given an instance in which it is possible to satisfy an {1/2 + \epsilon} fraction of equations, finds a solution that satisfies at least {1/2 + \epsilon \cdot O \left( \sqrt{\frac {\log n} n} \right)} fraction of equations, where {n} is the number of variables. The algorithm is randomized, it runs in polynomial time, and it succeeds with high probability. I believe that it is still the state of the art in terms of worst-case approximation guarantee.

Like the approximation algorithm for sparsest cut in Abelian Cayley graphs implied by the result of Bauer et al. that was the subject of the last two posts, the result of Khot and Naor does not prove a bound on the integrality gap of a relaxation of the problem.

I will describe the Khot-Naor algorithm and describe how it manages to use convex optimization to provide an approximation algorithm, but without establishing an integrality gap bound. I thank my student Lucas Pesenti for explaining the algorithm to me and for thinking together about this problem.

If our 3-XOR instance has {m} equations and {n} variables, then the problem of maximizing the number of satisfied equations can be rewritten as

\displaystyle  \frac m2 + \max_{x \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} c_{i,j,k} x_i x_j x_k

so that our goal is to approximate the combinatorial optimization problem

\displaystyle  \max_{x \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} c_{i,j,k} x_i x_j x_k

Up to a constant factor loss in the approximation guarantee, Khot and Naor show that the above is equivalent to

\displaystyle   \max_{x,y,z \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} T_{i,j,k} x_i y_j z_k \ \ \ \ \ (1)

where {T_{i,j,k}} is a symmetric 3-tensor with entries in {\{ -1,0,1\}} and with {O(m)} non-zero entries.

Before continuing, let us recall that if {M} is an {n \times n} matrix, then its {\ell_\infty}-to-{\ell_1} operator norm has the characterization

\displaystyle  || M ||_{\infty \rightarrow 1 } = \max_{x,y \in [-1,1]^n} x^T M y = \max_{x,y \in \{ -1,1 \}^n} x^T M y \ \ \ \ \ (2)

We could also define the “Grothendieck norm” {|| M ||_{Grot}} of a matrix as the following natural semidefinite programming relaxation of the {\ell_\infty}-to-{\ell_1} norm:

\displaystyle  \begin{array}{lll} \max & \sum_{i,j} M_{i,j} \langle x_i , y_j\rangle \\ {\rm s.t.}\\ & || x_i ||^2 = 1 & i = 1,\ldots,n\\ & ||y_j ||^2 = 1 & i = j,\ldots,n \end{array}

where the {x_i} and {y_j} are arbitrary vectors. The Grothendieck inequality is

\displaystyle  || M||_{\infty \rightarrow 1 } \leq || M ||_{Grot} \leq O(1) \cdot || M ||_{\infty \rightarrow 1 }

where the {O(1)} is an absolute constant, known to be less than {1.8}. Furthermore, the above inequality has a constructive proof, and it leads to a polynomial time constant factor approximation for the problem of finding values {x_i} and {y_i} in {\pm 1} that maximize (2).

Basically, we can see problem (1) as the natural generalization of (2) to tensors, and one would like to see a semidefinite programming relaxation of (1) achieving something resembling the Grothendieck inequality, but with a loss of something like {\tilde O(\sqrt n)}. As I mentioned above, this remains an open question, as far as I know.

The idea of Khot and Naor is the following. Suppose that we are given an instance {T} of problem (1), and suppose that {x^*,y^*,z^*} is an optimal solution, and let us call

\displaystyle  \epsilon = \sum_{i,j,k} T_{i,j,k} x^*_i y^*_j z^*_k

the value of the optimum (the algorithm will not need to know or guess {\epsilon}).

The key step is now to see that if we pick a random {x \in \{-1,1\}^n}, there is at least a {1/n^{O(1)}} probability that

\displaystyle  \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{\log n}{n}} \right)

This is a bit difficult, but it is really easy to see that with {1/n^{O(1)}} probability we have

\displaystyle   \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right) \ \ \ \ \ (3)

and we can do that by seeing that by defining a vector {w} such that {w_i = \sum_{j,k} T_{i,j,k} y^*_j z^*_k}, so that

\displaystyle  \langle x, w \rangle = \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k

So we have

\displaystyle  \langle x^*,w \rangle = \epsilon

which, using Cauchy-Schwarz, gives

\displaystyle  || w||^2 \geq \epsilon^2 /n

Now, for a random {x\sim \{ -1,1\}^n}, we have

\displaystyle  \mathop{\mathbb E} \langle x,w \rangle^2 = || w||^2 \geq \epsilon^2 /n


\displaystyle  \mathop{\mathbb E} \langle x,w \rangle^4 \leq n^{O(1)}

so by Paley–Zygmund we have, let’s say

\displaystyle  \mathop{\mathbb P} \left[ \langle x,w \rangle^2 \geq \frac {\epsilon^2}{2n} \right] \geq \frac 1 {n^{O(1)}}

which, together with the definition of {w} and the fact that the distribution of {\langle x, w\rangle} is symmetric around zero, gives us the claim (3).

Now suppose that we have been lucky and that we have found such an {x}. We define the matrix

\displaystyle  X_{j,k} = \sum_i T_{i,j,k} x_i

and we see that our claim can be written as

\displaystyle  y^{*T} X z^* \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right)

At this point we just apply the algorithm implied by the Grothendieck inequality to the matrix {X}, and we find {y} and {z} in {\{ -1,1 \}^n} such that

\displaystyle  y^T X z \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right)

meaning that

\displaystyle  \sum_{i,j,k} T_{i,j,k} x_i y_j z_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}}\right)

Summarizing, our algorithm is to pick a random vector {x\sim \{-1,1\}} and to find a constant-factor approximation for the problem

\displaystyle   \max_{y,z \in \{-1,1\}^n} \ \sum_{i,j,k} T_{i,j,k} x_iy_jz_k \ \ \ \ \ (4)

using semidefinite programming. We do that {n^{O(1)}} times, and take the best solution.

The analysis can be turned into an upper bound certificate in the following way. For the (suboptimal) analysis using Paley-Zygmund, we only need the entries of the random {x} to be 4-wise independent, and there are distributions on {\{-1,1\}^n} where the entries are unbiased and 4-wise independent, and such that the sample space is of size polynomial in {n}. Thus, one could write an SDP relaxation of (4) for each {x} in the support of such a distribution, and then take the maximum of these SDPs, multiply it by {O(\sqrt n)}, and it would be a certified upper bound. Such an upper bound, however, would not come from a relaxation of the 3-XOR problem, and I find it really strange that it is not clear how to turn these ideas into a proof that, say, the standard degree-4 sum-of-squares semidefinite programming relaxation of 3-XOR has an integrality gap at most {\tilde O(\sqrt n)}.

by luca at October 12, 2021 12:55 PM UTC

The School of Mathematical Sciences at Queen Mary University of London is seeking to appoint a permanent faculty member in the area of optimisation, with a focus on applying rigorous research methods to address emerging problems in the general field of combinatorial optimisation. We especially welcome applicants who would complement existing expertise in the School’s Combinatorics group.


by shacharlovett at October 11, 2021 04:34 PM UTC

The Department of Computer Science at Williams College invites applications for two faculty positions beginning July 1,2022. One is a tenure-track position at the rank of assistant professor with a three-year initial term. The other is an open rank position with a preference for more advanced candidates. That position will have terms commensurate with prior experience.


by shacharlovett at October 11, 2021 02:51 PM UTC

Lance: How come you haven't blogged on your muffin book? You've blogged about two books by Harry Lewis (see here and here) one book by the lesswrong community (see here), and you even did a mashup of a post by two different Scott A's (see here),  but not on your own work.

Bill: I thought I did a post on my muffin book.

Lance: No. You have blogged about the muffin problem, and sometimes you mention either the book or the problem in passing, but you haven't had a post that says

HEY, I wrote a book!

And this is all the more strange since you asked me to have the book on our blog page. 

Bill: (Searches blog with keyword muffin and finds no ref to muffin book). Well pierce my ears and call be drafty! I have not posted on the muffin book! Do you recall my thoughts on when to tell people you are working on a book?

Lance: No

Bill:  I had a college roommate who was an aspiring science fiction writer who told me there are two kinds of people: Those who talk about writing a book, and those who write a book. I have adapted this to:

Do not tell people you are writing a book until you are picking out the cover art.

Lance: I posted about my book when I hadn't even decided on the title. But your cover art is picked out (see here).  And, by the way, its very nice, though it makes me hungry. So I think you can begin talking about the book.

Bill: Indeed! I will!


Hey I have a book! (See here to buy it on amazon.) 

Title: Mathematical Muffin Morsels: Nobody Wants a Small Piece

by Gasarch, Metz, Prinz, Smolyak

(The other authors were undergraduates when we wrote the book. Prinz and Smolyak are now grad students in CS, Metz is in Finance.) 


Martin Gardner wrote a Mathematics Recreational column for Scientific American for many years, starting in 1956 and ending in the early 1980s. For many STEM people of my generation (Using my fake birthday of Oct 1, 1960, I am 62 years old) Martin Gardner's columns were both an inspiration and an early exposure to mathematics. His columns also made the line between Mathematical Recreation and so-called serious mathematics thin or nonexistent. (See here for a review of Martin Gardner in the 21st century, a book about the kind of math Gardner wrote of. The book makes a mockery of the distinction between recreational and serious mathematics.) He passed away in 2010 at the age of 95.

There is a gathering in his honor that is hold roughly every 2 years, called Gathering For Gardner. (It was cancelled in Spring 2020 and Spring 2021 because of COVID- though its in Atlanta where the CDC is, so they could have had it as an experiment and told the CDC the results). You have to be invited to goto it. I got an invite for 2016 from my contact at World Scientific who published my previous book, Problems with a Point: Exploring Math and Computer Science co-authored with Clyde Kruskal  (I had two blogs on it, here and here, and you can buy it on amazon here.) I did three posts on G4G-2016 (herehere, and here).

Aside from seeing some great talks that I understood and liked, I also picked up a pamphlet titled:

The Julia Robinson Math Festival

A Sample of Mathematical Puzzles

Compiled By Nancy Blackman

One of the problems, credited to Alan Frank, was

How can you divide and distribute 5 muffins for 3 students so that everyone gets 5/3 and the smallest piece is as big as possible?

They had some other values for muffins and students as well. 

I solved the (5,3) problem and the other ones as well. That was fun. 

When I got home I began looking at the problem for m muffins and s students. I let f(m,s) be the biggest smallest piece possible for giving out m muffins to s students. I proved a general theorem, called the Floor-Ceiling theorem, that always gives an upper bound, FC(m,s) on f(m,s). I worked out formulas for 

f(m,1) (trivial), 

f(m,2) (trivial), 

f(m,3) (its always FC(m,3),

 f(m,4) (its always FC(m,4)).

While working on f(m,5) I found that  f(m,5) was always FC(m,5) EXCEPT for m=11. So what's up with f(11,5)?  

By the Floor Ceiling theorem f(11,5) \le 11/25. We (at that point several ugrads and HS students had joined the project)  were unable to find a protocol that would show f(11,5)\ge 11/25. Personally I thought there WAS such an protocol but perhaps it was more complicated than the ones we had found (We were finding them by hand using some easy linear algebra.) Perhaps a computer program was needed. We did find a protocol for f(11,5)\ge 13/30, which surely was not optimal. 

While on an Amtrak I began working out the following train of thought: The protocol for f(11,5)\le 11/25 MUST have 

(1) every muffin cut into two pieces,

(2) 3 students get 4 pieces, 

(3) 2 students get 5 pieces. 

While working on getting a protocol for f(11,5)\le 11/25 with these properties I found that... there could be no such protocol! Then by reworking what I did I found that f(11,5)\le 13/30. So it was done! and we had a new technique, which we call The Half Method. To see the full proof see my slides here

The story above is typical: We get f(m,k) for all 1\le k\le SOMETHING, we get stuck, and then we find ANOTHER technique to show upper bounds (which in this case are limits on how well we can do). This happened about 8 times depending on how you count.  After a while we realized that this could not just be an article, this was a book! World Scienfiic agreed to publish it, and its out now.

Misc Notes

1) I got a conference paper out of it, in the Fun with Algorithms Conference, with some of the co-authors on the book, and some other people. here is the conf paper.

2) Early on we realized that f(m,s) = (m/s)f(s,m) so we only had to look at the m>s case.

3) The fact that f(m,s) exists and is rational is not obvious, but is true. In fact, f(m,s) can be found by a mixed-int program. 

4) Late on in the process I found that there was a by-invite-only math newsgroup that had discussed the problem, and in fact was where Alan Frank first posted it. I obtained their materials and found that they had already shown f(m,s)=(m/s)f(s,m) and also that the answer is always rational and exists. Aside from that our results did not overlap.

5) Even later in the process Scott Huddleston emailed me (out of the blue) that he had a program that solved the muffin problem quickly. I was skeptical at first, but he did indeed have a whole new way to look at the problem and his code was very fast (I had Jacob Prinz, one of the co-authors on the book, recode it). Later Richard Chatwin (see here) seems to have proven that Scott's method always works. The approach of Scott and Richard is where to go if you want to do serious further research on Muffins. My book is where you want to go if you want to learn some easy and fun math (a HS student could read it). 

6) I co-authored a column with Scott H, Erik Metz, Jacob Prinz on Muffins, featuring his technique, in Lane's complexity column, here.

7) I had an REU student, Stephanie Warman, write a muffin package based on the book.

8) I gave a talk an invited talk on The Muffin Problem  at a Joint AMS-MAA meeting. 

9) I gave a talk at Gathering for Gardner 2018 on The Muffin Problem. 

10) I often give talks on it to groups of High School students.

11) When I teach Discrete Math Honors I talk about it and assign problems on it- it really is part of the course. As such its a good way to reinforce the pigeon hole principle. 

12) I contacted Alan Frank about my work. We arranged to meet at an MIT combinatorics seminar where I was to give a talk on muffins. He brought 11 muffins, with 1 cut (1/2,1/2), 2 cut (14/30,16/30),

and 8 cut (13/30,17/30) so that the 11 of us could each get 11/5 with smallest piece 13/30. 

13) Coda: 

Why did I keep working on this problem?  I kept working on it because I kept hitting barriers and (with co-authors) breaking them with new techniques that were interesting.  If early on a barrier was not breakable then I would have stopped. If (say) Floor-ceiling solved everything than I might have gotten a paper out of  this, but surely not a book.

Lesson for all of us: look around you! Its not clear what is going to inspire a project!

Lasting effect: I am reluctant to throw out old math magazines and pamphlets since you never know when one will lead to a book.

by gasarch ( at October 11, 2021 01:18 AM UTC

In my last post, I wrote (among other things) about an ongoing scientific debate between the group of Chaoyang Lu at USTC in China, which over the past year has been doing experiments that seek to demonstrate quantum supremacy via Gaussian BosonSampling; and the group of Sergio Boixo at Google, which had a recent paper on a polynomial-time classical algorithm to sample approximately from the same distributions.  I reported the facts as I understood them at the time.  Since then, though, a long call with the Google team gave me a new and different understanding, and I feel duty-bound to share that here.

A week ago, I considered it obvious that if, using a classical spoofer, you could beat the USTC experiment on a metric like total variation distance from the ideal distribution, then you would’ve completely destroyed USTC’s claim of quantum supremacy.  The reason I believed that, in turn, is a proposition that I hadn’t given a name but needs one, so let me call it Hypothesis H:

The only way a classical algorithm to spoof BosonSampling can possibly do well in total variation distance, is by correctly reproducing the high-order correlations (correlations among the occupation numbers of large numbers of modes) — because that’s where the complexity of BosonSampling lies (if it lies anywhere).

Hypothesis H had important downstream consequences.  Google’s algorithm, by the Google team’s own admission, does not reproduce the high-order correlations.  Furthermore, because of limitations on both samples and classical computation time, Google’s paper calculates the total variation distance from the ideal distribution only on the marginal distribution on roughly 14 out of 144 modes.  On that marginal distribution, Google’s algorithm does do better than the experiment in total variation distance.  Google presents a claimed extrapolation to the full 144 modes, but eyeballing the graphs, it was far from clear to me what would happen: like, maybe the spoofing algorithm would continue to win, but maybe the experiment would turn around and win; who knows?

Chaoyang, meanwhile, made a clear prediction that the experiment would turn around and win, because of

  1. the experiment’s success in reproducing the high-order correlations,
  2. the admitted failure of Google’s algorithm in reproducing the high-order correlations, and
  3. the seeming impossibility of doing well on BosonSampling without reproducing the high-order correlations (Hypothesis H).

Given everything my experience told me about the central importance of high-order correlations for BosonSampling, I was inclined to agree with Chaoyang.

Now for the kicker: it seems that Hypothesis H is false.  A classical spoofer could beat a BosonSampling experiment on total variation distance from the ideal distribution, without even bothering to reproduce the high-order correlations correctly.

This is true because of a combination of two facts about the existing noisy BosonSampling experiments.  The first fact is that the contribution from the order-k correlations falls off like 1/exp(k).  The second fact is that, due to calibration errors and the like, the experiments already show significant deviations from the ideal distribution on the order-1 and order-2 correlations.

Put these facts together and what do you find?  Well, suppose your classical spoofing algorithm takes care to get the low-order contributions to the distribution exactly right.  Just for that reason alone, it could already win over a noisy BosonSampling experiment, as judged by benchmarks like total variation distance from the ideal distribution, or for that matter linear cross-entropy.  Yes, the experiment will beat the classical simulation on the higher-order correlations.  But because those higher-order correlations are exponentially attenuated anyway, they won’t be enough to make up the difference.  The experiment’s lack of perfection on the low-order correlations will swamp everything else.

Granted, I still don’t know for sure that this is what happens — that depends on whether I believe Sergio or Chaoyang about the extrapolation of the variation distance to the full 144 modes (my own eyeballs having failed to render a verdict!).  But I now see that it’s logically possible, maybe even plausible.

So, let’s imagine for the sake of argument that Google’s simulation wins on variation distance, even though the experiment wins on the high-order correlations.  In that case, what would be our verdict: would USTC have achieved quantum supremacy via BosonSampling, or not?

It’s clear what each side could say.

Google could say: by a metric that Scott Aaronson, the coinventor of BosonSampling, thought was perfectly adequate as late as last week — namely, total variation distance from the ideal distribution — we won.  We achieved lower variation distance than USTC’s experiment, and we did it using a fast classical algorithm.  End of discussion.  No moving the goalposts after the fact.

Google could even add: BosonSampling is a sampling task; it’s right there in the name!  The only purpose of any benchmark — whether Linear XEB or high-order correlation — is to give evidence about whether you are or aren’t sampling from a distribution close to the ideal one.  But that means that, if you accept that we are doing the latter better than the experiment, then there’s nothing more to argue about.

USTC could respond: even if Scott Aaronson is the coinventor of BosonSampling, he’s extremely far from an infallible oracle.  In the case at hand, his lack of appreciation for the sources of error in realistic experiments caused him to fixate inappropriately on variation distance as the success criterion.  If you want to see the quantum advantage in our system, you have to deliberately subtract off the low-order correlations and look at the high-order correlations.

USTC could add: from the very beginning, the whole point of quantum supremacy experiments was to demonstrate a clear speedup on some benchmark — we never particularly cared which one!  That horse is out of the barn as soon as we’re talking about quantum supremacy at all — something the Google group, which itself reported the first quantum supremacy experiment in Fall 2019, again for a completely artificial benchmark — knows as well as anyone else.  (The Google team even has experience with adjusting benchmarks: when, for example, Pan and Zhang pointed out that Linear XEB as originally specified is pretty easy to spoof for random 2D circuits, the most cogent rejoinder was: OK, fine then, add an extra check that the returned samples are sufficiently different from one another, which kills Pan and Zhang’s spoofing strategy.) In that case, then, why isn’t a benchmark tailored to the high-order correlations as good as variation distance or linear cross-entropy or any other benchmark?

Both positions are reasonable and have merit — though I confess to somewhat greater sympathy for the one that appeals to my doofosity rather than my supposed infallibility!

OK, but suppose, again for the sake of argument, that we accepted the second position, and we said that USTC gets to declare quantum supremacy as long as its experiment does better than any known classical simulation at reproducing the high-order correlations.  We’d still face the question: does the USTC experiment, in fact, do better on that metric?  It would be awkward if, having won the right to change the rules in its favor, USTC still lost even under the new rules.

Sergio tells me that USTC directly reported experimental data only for up to order-7 correlations, and at least individually, the order-7 correlations are easy to reproduce on a laptop (although sampling in a way that reproduces the order-7 correlations might still be hard—a point that Chaoyang confirms, and where further research would be great). OK, but USTC also reported that their experiment seems to reproduce up to order-19 correlations. And order-19 correlations, the Google team agrees, are hard to sample consistently with on a classical computer by any currently known algorithm.

So then, why don’t we have direct data for the order-19 correlations?  The trouble is simply that it would’ve taken USTC an astronomical amount of computation time.  So instead, they relied on a statistical extrapolation from the observed strength of the lower-order correlations — there we go again with the extrapolations!  Of course, if we’re going to let Google rest its case on an extrapolation, then maybe it’s only sporting to let USTC do the same.

You might wonder: why didn’t we have to worry about any of this stuff with the other path to quantum supremacy, the one via random circuit sampling with superconducting qubits?  The reason is that, with random circuit sampling, all the correlations except the highest-order ones are completely trivial — or, to say it another way, the reduced state of any small number of output qubits is exponentially close to the maximally mixed state.  This is a real difference between BosonSampling and random circuit sampling—and even 5-6 years ago, we knew that this represented an advantage for random circuit sampling, although I now have a deeper appreciation for just how great of an advantage it is.  For it means that, with random circuit sampling, it’s easier to place a “sword in the stone”: to say, for example, here is the Linear XEB score achieved by the trivial classical algorithm that outputs random bits, and lo, our experiment achieves a higher score, and lo, we challenge anyone to invent a fast classical spoofing method that achieves a similarly high score.

With BosonSampling, by contrast, we have various metrics with which to judge performance, but so far, for none of those metrics do we have a plausible hypothesis that says “here’s the best that any polynomial-time classical algorithm can possibly hope to do, and it’s completely plausible that even a noisy current or planned BosonSampling experiment can do better than that.”

In the end, then, I come back to the exact same three goals I would’ve recommended a week ago for the future of quantum supremacy experiments, but with all of them now even more acutely important than before:

  1. Experimentally, to increase the fidelity of the devices (with BosonSampling, for example, to observe a larger contribution from the high-order correlations) — a much more urgent goal, from the standpoint of evading classical spoofing algorithms, than further increasing the dimensionality of the Hilbert space.
  2. Theoretically, to design better ways to verify the results of sampling-based quantum supremacy experiments classically — ideally, even ways that could be applied via polynomial-time tests.
  3. For Gaussian BosonSampling in particular, to get a better understanding of the plausible limits of classical spoofing algorithms, and exactly how good a noisy device needs to be before it exceeds those limits.

Thanks so much to Sergio Boixo and Ben Villalonga for the conversation, and to Chaoyang Lu and Jelmer Renema for comments on this post. Needless to say, any remaining errors are my own.

by Scott at October 10, 2021 06:13 PM UTC

The CATCS is compiling a list of Women in Theoretical Computer Science (WinTCS), hoping to facilitate engagement and opportunities for Women in TCS in the future. 

We cordially invite women (broadly defined as anyone who self-identifies as a woman) as well as gender non-conforming TCS researchers to participate the following simple google-form survey to provide your information. 

Please submit your information before Oct. 31, 2021. After that, information can still be provided through the CATCS website, and will be added to the list monthly. 
**Please feel free to share this solicitation broadly within your networks.**

If you have any questions regarding this survey, please feel free to contact us at or  

– Kira Goldner and Yusu Wang (on behalf of the SIGACT CATCS)

by shuchic at October 08, 2021 09:56 PM UTC

Potbelly, a local sandwich chain, made me an offer I couldn't refuse: change my password and earn a free (and quite tasty) oatmeal chocolate chip cookie. A free cookie is a great motivator, and checking that this wasn't some clever phishing attack, changed my password and got my cookie. Not sure why Potbelly wanted me to change my password but happy to take their cookie.

Potbelly likely didn't make this offer to everyone so what if you want a cookie?

  1. Use an app to get a cookie delivered.
  2. Visit a specialty cookie store.
  3. Go to your local supermarket and pick up a package of Chip's Ahoy.
  4. Buy some pre-made cookie dough and put it in the oven.
  5. Buy some cookie mix, add ingredients and bake.
  6. Find a cookie recipe, buy the ingredients and get cooking
  7. Get fresh ingredients direct from a farm stand
  8. Grow and gather your own ingredients, ala Pancakes Pancakes
In machine learning we seem to be heading into a similar set of choices
  1. Not even realize you are using machine learning, such as recommendations on Netflix or Facebook.
  2. Using ML implicitly, like talking to Alexa
  3. Using pre-trained ML through an app, like Google Translate
  4. Using pre-trained ML through an API
  5. Using a model like GPT-3 with an appropriate prompt
  6. Use an easily trained model like Amazon Fraud Detector
  7. An integrated machine learning environment like Sagemaker
  8. Use pre-built ML tools like TensorFlow or PyTorch
  9. Code up your own ML algorithms in C++
  10. Build your own hardware and software
and probably missing a few options.

When you want cookies or learning, do you buy it prepackaged or do you roll your own? And when people offer it to you for free, how wary should you be?

by Lance Fortnow ( at October 08, 2021 03:10 PM UTC

Continuing from the previous post, we are going to prove the following result: let {G} be a {d}-regular Cayley graph of an Abelian group, {\phi(G)} be the normalized edge expansion of {G}, {ARV(G)} be the value of the ARV semidefinite programming relaxation of sparsest cut on {G} (we will define it below), and {\lambda_2(G)} be the second smallest normalized Laplacian eigenvalue of {G}. Then we have

\displaystyle   \lambda_2 (G) \leq O(d) \cdot (ARV (G))^2 \ \ \ \ \ (1)

which, together with the fact that {ARV(G) \leq 2 \phi(G)} and {\phi(G) \leq \sqrt{2 \lambda_2}}, implies the Buser inequality

\displaystyle   \lambda_2 (G) \leq O(d) \cdot \phi^2 (G) \ \ \ \ \ (2)

and the approximation bound

\displaystyle   \phi(G) \leq O(\sqrt d) \cdot ARV(G) \ \ \ \ \ (3)

The proof of (1), due to Shayan Oveis Gharan and myself, is very similar to the proof by Bauer et al. of (2).

1. Ideas

For a positive integer parameter {t}, call {G^t} the multigraph whose adjacency matrix is {A^t}, where {A} is the adjacency matrix of {G}. So {G^t} is a {d^t}-regular graph, and each edge in {G^t} corresponds to a length-{t} walk in {G}. Our proof boils down to showing

\displaystyle   ARV(G^t) \leq O(\sqrt{dt}) \cdot ARV(G) \ \ \ \ \ (4)

which gives (1) after we note that

\displaystyle  \lambda_2 (G^t) = 1 - (1 - \lambda_2(G))^t


\displaystyle  ARV(G^t) \geq \lambda_2 (G^t)

and we combine the above inequalities with {t := 1/\lambda_2 (G)}:

\displaystyle  \Omega(1) \leq 1 - (1 - \lambda_2(G))^t = \lambda_2 (G^t) \leq ARV(G^t) \leq O(\sqrt{dt}) \cdot ARV(G)

The reader will see that our argument could also prove (roughly as done by Bauer et al.) that {\phi(G^t) \leq O(\sqrt d) \cdot \phi(G)}, simply by reasoning about distributions of cuts instead of reasoning about ARV solutions, which would give (2) more directly. By reasoning about ARV solutions, however, we are also able to establish (3), which we think is independently interesting.

It remains to prove (4). I will provide a completely self-contained proof, including the definition of Cayley graphs and of the ARV relaxation, but, first, here is a summary for the reader already familiar with this material: we take an optimal solution of ARV for {G}, and bound how well it does for {G^t}. We need to understand how much bigger is the fraction of edges cut by the solution in {G^t} compared to the fraction of edges cut in {G}; a random edge in {G^t} is obtained by randomly sampling {t} generators and adding them together, which is roughly like sampling {t/d} times each generator, each time with a random sign. Because of cancellations, we expect that the sum of these {t} random generators can be obtained by summing roughly {\sqrt{t/d}} copies of each of the {d} generators, or {\sqrt{td}} generators in total. So a random edge of {G^t} corresponds roughly to {\sqrt{td}} edges of {G}, and this is by how much, at most, the fraction of cut edges can grow.

2. Definitions

Now we present more details and definition. Recall that if {\Gamma} is a group, for which we use additive notation, and {S} is a set or multiset of group elements, then the Cayley graph {Cay(\Gamma,S)} is the graph that has a vertex for every group element and an edge {(x,x+s)} for every group element {x\in \Gamma} and every element {s\in S}. We restrict ourselves to undirected graphs, so we will always assume that if {s} is an element of {S}, then {-s} is also an element of {S} (with the same multiplicity, if {S} if a multiset). Note that the resulting undirected graph is {|S|}-regular.

Several families of graphs can be seen to be Cayley graphs, including cycles, cliques, balanced complete bipartite graphs, hypercubes, toruses, and so on. All the above examples are actually Cayley graphs of Abelian groups. Several interesting families of graphs, for example several families of expanders, are Cayley graphs of non-Abelian groups, but the result of this post will apply only to Abelian groups.

To define the ARV relaxation, let us take it slowly and start from the definition of the sparsest cut problem. The edge expansion problem {\phi(G)} is closely related to the sparsest cut problem, which can be defined as

\displaystyle  \sigma(G) := \min_{x \in \{0,1\}^V} \ \ \frac{x^T L_G x}{x^T L_K x }

where {L_G = I - A/d} is the normalized Laplacian of {G} and {L_K = I - J/n} is the normalized Laplacian of the clique on {n} vertices. We wrote it this way to emphasize the similarity with the computation of the second smallest normalized Laplacian eigenvalue, which can be written as

\displaystyle  \lambda_2 (G) = \min_{x \in R^V : \ x \perp {\bf 1}} \ \ \frac {x^T L_G x}{x^Tx } = \min_{x\in R^V} \frac {x^T L_G x}{x^T L_K x}

where the second equality is perhaps not obvious but is easily proved (all the solutions {x\perp {\bf 1}} have the same cost function on both sides, and the last expression is shift invariant, so there is no loss in optimizing over all {{\mathbb R}^V} or just in the space {\perp {\bf 1}}). We see that the computation of {\lambda_2} is just a relaxation of the sparsest cut problem, and so we have

\displaystyle  \lambda_2(G) \leq \sigma(G)

We can write the sparsest cut problem in a less algebraic version as

\displaystyle  \sigma(G) = \min_{S\subseteq V} \ \ \frac{ cut(S)}{\frac dn \cdot |V| \cdot |V-S| }

and recall that

\displaystyle  \phi(G) = \min_{S\subseteq V: |S| \leq \frac {|V|} 2} \ \ \frac{ cut(S)}{d |S|}

The cost function for {\sigma(G)} does not change if we switch {S} with {V-S}, so {\sigma(G)} could be defined equivalently as an optimization problem over subsets {S\subseteq V: |S| \leq |V|/2}, and at this point {\sigma(G)} and {\phi(G)} are the same problem except for an extra factor of {|V-S|/n} in the denominator of {\sigma(G)}. Such a factor is always between {1/2} and {1}, so we have:

\displaystyle  \phi(G) \leq \sigma(G)\leq 2 \phi(G)

(Note that all these definitions have given us a particularly convoluted proof that {\lambda_2(G) \leq 2 \phi(G)}.)

Yet another way to characterize {\lambda_2} is as

\displaystyle  \lambda_2 (G) = \min_{x\in R^V} \frac {x^T L_G x}{x^T L_K x} = \min_{X \succeq {\bf 0}} \frac{L_G \bullet X}{L_K\bullet X}

Where {A\bullet B = \sum_{i,j} A_{i,j} B_{i,j}} is the Frobenius inner product between matrices. This is also something that is not obvious but that it is not difficult to prove, the main point being that if we write a PSD matrix {X} as {\sum_i x_ix_i^T}, then

\displaystyle  \frac{L_G \bullet X}{L_K\bullet X} = \frac{\sum_i x_i^T L_G x_i}{\sum_i x_i^T L_Kx_i} \geq \min_i \frac{ x_i^T L_G x_i}{x_i^T L_Kx_i}

and so there is no loss in passing from an optimization over all PSD matrices versus all rank-1 PSD matrices.

We can rewrite {\lambda_2 \min_{X \succeq {\bf 0}} \frac{L_G \bullet X}{L_K\bullet X} } in terms of the Cholesky decomposition of {X} as

\displaystyle  \begin{array}{llr} \min & \frac {\sum_{(u,v)\in E} || x_u - x_v||^2 }{\frac dn \sum_{u < v} ||x_u - x_v ||^2 } \\ \mbox{s.t}\\ & x_v \in {\mathbb R}^m & \ \ \forall v\in V\\ & m\geq 1 \end{array}

where the correspondence between PSD matrices {X} and vectors {\{ x_v \}_{v\in V}} is that we have {X_{u,v} = \langle x_u,x_v \rangle} (that is, {\{ x_v \}_{v\in V}} is the Cholesky decomposition of {X} and {X} is the Gram matrix of the {\{ x_v \}_{v\in V}}). An integral solution of the sparsest cut problem corresponds to choosing rank {m=1} and solutions in which each 1-dimensional {x_v} is either 1 or 0, corresponding on whether {v} is in the set {S} or not. The ARV relaxation is

\displaystyle  \begin{array}{llr} \min & \frac {\sum_{(u,v)\in E} || x_u - x_v||^2 }{\frac dn \sum_{u < v} ||x_u - x_v ||^2 } \\ \mbox{s.t}\\ & ||x_u - x_v||^2 \leq ||x_u - x_z||^2 + ||x_z - x_v||^2 &\ \ \forall u,v,z \in V\\ & x_v \in {\mathbb R}^m & \forall v\in V\\ & m\geq 1 \end{array}

which is a relaxation of sparsest cut because the “triangle inequality” constraints that we introduced are satisfied by 1-dimensional 0/1 solutions. Thus we have

\displaystyle  \lambda_2 (G) \leq ARV(G) \leq \phi(G)

3. The Argument

Let us take any solution for ARV, and let us symmetrize it so that the symmetrized solution {x} satisfies

\displaystyle  || x_u - x_{u + s} ||^2 = || x_v - x_{v+s} ||^2 \ \ \forall s\in S \ \forall u,v\in V

That is, make sure that the contribution of each edge to the numerator of the cost function depends only on the generator that defines the edge, and not on the pair of endpoints.

It is easier to see that this symmetrization is possible if we view our solution as a PSD matrix {X}. In this case, for every group element {g}, we can let {X_g} be the solution (with the same cost) obtained by permuting rows and columns according to the mapping {v \rightarrow g+ v}; then we can consider the solution {\frac 1n \sum_{g\in V} X_g}, which satisfies the required symmetry condition.

Because of this condition, the cost function applied to {\{ x_v \}_{v\in V}} in {G} is

\displaystyle  \frac { \frac n2 \cdot \sum_{s\in S} || x_s - x_0 ||^2}{ \frac dn \sum_{u < v} ||x_u - x_v ||^2 }

and the cost function applied to {\{ x_v \}_{v\in V}} in {G^t} is

\displaystyle  \frac { \frac n2 \cdot \sum_{s_1,\ldots,s_t\in S^t} || x_{s_1 + \cdots + s_t} - x_0 ||^2}{ \frac {d^t} n \sum_{u < v} ||x_u - x_v ||^2 }

meaning that our goal is now simply to prove

\displaystyle  \frac 1{d^t} \sum_{(s_1,\ldots,s_t)\in S^t} || x_{s_1 + \cdots + s_k} - x_0 ||^2 \leq O(\sqrt {dt}) \cdot \frac 1d \sum_{s\in S} || x_s - x_0 ||^2

or, if we take a probabilistic view

\displaystyle  E_{(s_1,\ldots,s_t) \sim S^t} || x_{s_1 + \cdots + s_k} - x_0 ||^2 \leq O(\sqrt {dt}) \cdot \mathop{\mathbb E}_{s\sim S} || x_s - x_0 ||^2 \

If we let {c_s} be the number of times that generator {s} appears in the sum {s_1 + \ldots + s_t}, counting cancellations (so that, if {s} appears 4 times and {-s} appears 6 times we let {c_s = 0} and {c_{-s} = 2}) we have

\displaystyle  s_1 + \ldots + s_ t = \sum_{s\in S} c_s \cdot s

where multiplying an integer by a generator means adding the generator to itself that many times. Using the triangle inequality and the symmetrization we have

\displaystyle  || x_{s_1 + \ldots + s_t} - x_0 ||^2 \leq \sum_{s\in S} c_s \cdot || x_s - x_0 ||^2

The next observation is that, for every {s},

\displaystyle  \mathop{\mathbb E} c_s \leq O(\sqrt {t/d})

where the expectation is over the choice of {s_1,\ldots,s_t}. This is because {c_s} can be seen as {\min \{ 0 , X_1 + \ldots X_t \}}, where we define {X_i = +1} if {s_i = s}, {X_i = -1} if {s_i = -s} and {X_i = 0} otherwise. We have

\displaystyle  \mathop{\mathbb E} c_s = \mathop{\mathbb E} \min \{ 0 , X_1 + \ldots X_t \} \leq \mathop{\mathbb E} | X_1 + \ldots + X_t | \leq \sqrt {\mathop{\mathbb E} (X_1 + \ldots + X_t)^2} = \sqrt{2t/d}

Combining everything, we have

\displaystyle  E_{(s_1,\ldots,s_t) \sim S^t} || x_{s_1 + \cdots + s_k} - x_0 ||^2

\displaystyle  \leq \sum_{s\in S} \mathop{\mathbb E} c_s || x_s - x_0 ||^2

\displaystyle  \leq O\left( \sqrt {\frac td} \right) \cdot \sum_{s\in S} || x_s - x_0 ||^2

\displaystyle  = O(\sqrt{dt}) \mathop{\mathbb E}_{s\sim S} || x_s - x_0 ||^2

So every solution of ARV for {G} is also a solution for {G^t} with a cost that increases at most by a {O(\sqrt {dt})} factor, which proves (4) as promised.

4. A Couple of Interesting Questions

We showed that ARV has integrality gap at most {O(\sqrt d)} for every {d}-regular Abelian Cayley graph, but we did not demonstrate a rounding algorithm able to achieve an {O(\sqrt d)} approximation ratio.

If we follow the argument, starting from an ARV solution of cost {\epsilon}, choosing {t} of the order of {1/d\epsilon^2} we see that {G^t} has an ARV solution (the same as before) of cost at most, say, {1/2}, and so {\lambda_2(G^t) \leq 1/2}, implying that {\lambda_2(G) \leq O(1/t)} and so Fiedler’s algorithm according to a Laplacian eigenvalue finds a cut of sparsity at most {O(\sqrt{1/t}) = O(\sqrt d \cdot \epsilon )}.

We can also see that if {X} is the matrix corresponding to an ARV solution of value {\epsilon} for {G}, then one of the eigenvectors {x} of {X} must be a test vector of Rayleigh quotient at most {1/2} for the Laplacian {G^t}, where {t} is order of {1/d\epsilon^2}. However it is not clear how to get, from such a vector, a test vector for the Laplacian of {G} of Rayleigh quotient at most {O(\sqrt {1/t} )} for the Laplacian of {G}, though one such vector should be {A^k x} for a properly chosen {k} in the range {1,\ldots, t}.

If this actually work, then the following, or something like that, would be a rounding algorithm: given a PSD solution {X} of ARV of cost {\epsilon}, consider PSD matrices of the form {A^t X A^t}, which do not necessarily satisfy the triangle inequalities any more, for {t = 1,\ldots, 1/d\epsilon^2}, and try to round them using a random hyperplane. Would this make sense for other classes of graphs?

It is plausible that sparsest cut in Abelian Cayley graphs has actually a constant-factor approximation in polynomial or quasi-polynomial time, and maybe either that ARV achieves constant-factor approximation.

by luca at October 08, 2021 02:58 PM UTC

If I could explain it to the average person, I wouldn’t have been worth the Nobel Prize— Richard Feynman

Composite from front-page source

Giorgio Parisi has just been awarded the 2021 Physics Nobel Prize for his work on the disorder in systems of all kinds. At the same time Syukuro Manabe and Klaus Hasselmann won for their joint work on the physical modeling of the Earth’s climate and reliably predicting global warming.

Today Ken and I look at some aspects of this year’s prizes.

Small issue: I do believe in human-caused global warming, but shouldn’t the citation read: for reliably predicting the future temperature of the planet? Currently it reads a bit like for reliably predicting the rise in the global stock market, as if the outcome not the model were primary.

Oh well. Every year there are Nobel prizes awarded for things that have at least some computational aspect. This year’s prize is certainly for something related to computation. One aspect that we are hoping for is Aspect—Alain Aspect. Ken was among those tipping him, John Clauser, and Anton Zeilinger for this year’s prize—for their work demonstrating quantum entanglement and non-classical experimental outcomes. This work spans substantial areas of quantum computing.

General Comments

The Nobel committee’s technical review of the prizewinning trio’s accomplishments ends with an expansive comment:

Clearly this year’s Laureates have made groundbreaking contributions to our understanding of complex physical systems in their broadest sense, from the microscopic to the global. They show that without a proper accounting of disorder, noise and variability, determinism is just an illusion. Indeed, the work recognized here reflects in part the comment ascribed to Richard Feynman (Nobel Laureate 1965), that he “Believed in the primacy of doubt, not as a blemish on our ability to know, but as the essence of knowing.”

Recognizing the work of this troika reflects the importance of understanding that no single prediction of anything can be taken as inviolable truth, and that without soberly probing the origins of variability we cannot understand the behavior of any system. Therefore, only after having considered these origins do we understand that global warming is real and attributable to our own activities, that a vast array of the phenomena we observe in nature emerge from an underlying disorder, and that embracing the noise and uncertainty is an essential step on the road towards predictability.

Ken says about the global-warming part of the last sentence that all you need to do is ask a wine grower, of which there are many in the Niagara region. The latitudes at which certain grape strains thrive have been shifting recently and steadily north.

For some general comments by Parisi, see his interview. He says lots of neat stuff including:

Well, things that {\dots} Well, my mentor Nicola Cabibbo was usually saying that we should work on a problem only if working on the problem is fun. So, I mean, fun is not very clear what it means, but it’s something that we find deeply interesting, and that we strongly believe that it is {\dots} I mean you won’t find fun in unclear because one gets a new idea of something unexpected and so on. So I tried to work on something that was interesting and which I believed that had some capacity to add something.

His Trick

The Nobel citation says: “One of the many theoretical tools Professor Parisi has used to establish his theory is the so-called ‘replica trick’—a mathematical method which takes a disordered system, replicates it multiple times, and compares how different replicas of the system behave. You can do this, for instance, by compressing marbles in a box, which will form a different configuration each time you make the compression. Over many repetitions, Parisi knew, telling patterns might emerge.” They point to a paper from talks by Parisi in 2013 also involving Flaviano Morone, Francesco Caltagirone, and Elizabeth Harrison.

The trick has a Wikipedia page, which says that its crux “is that while the disorder averaging is done assuming {n} to be an integer, to recover the disorder-averaged logarithm one must send {n} continuously to zero. This apparent contradiction at the heart of the replica trick has never been formally resolved, however in all cases where the replica method can be compared with other exact solutions, the methods lead to the same results.”

The mathematical identity underlying the replica trick is

\displaystyle  \ln Z = \lim_{n\rightarrow 0}\frac{Z^n - 1}{n}.

The {Z} is the partition function of a physical system or some related thermodynamical measure. The power {Z^n} represents {n} independent copies of the system—if {n} is an integer, that is. But instead we treat {n} as a real number going to zero. The formal justification is best seen via a Taylor series expansion:

\displaystyle  \begin{array}{rcl}  \lim_{n \rightarrow 0} \frac{Z^n - 1}{n} &=& \lim_{n \rightarrow 0} \frac{-1 + e^{n \ln Z}}{n}\\ &=& \lim_{n \rightarrow 0} \frac{-1 + 1 + n \ln Z + \frac{1}{2!} (n \ln Z)^2 + \frac{1}{3!} (n \ln Z)^3 + \dots}{n}\\ &=& \ln Z + \lim_{n \rightarrow 0} \frac{\frac{1}{2!} (n \ln Z)^2 + \frac{1}{3!} (n \ln Z)^3 + \dots}{n}\\ &=& \ln Z. \end{array}

But on the whole, this strikes me as a silly idea. Suppose that you have a nasty function {f(n)} where {n=1,2,3,\dots}. How do you try to understand the behavior of {f(n)}? Several ideas come to mind:

  1. Try to see how {f(n)} grows as {n \rightarrow \infty}?

  2. Try to get an approximate formula for {f(n)} as {n} grows?

  3. Try to understand {f(0)}?

Wait, this is nuts. How can the value {f(0)} help us understand the limit of

\displaystyle  f(1),f(2),\dots,f(1000000),\dots

Indeed. Another form is that

\displaystyle  \frac{\partial Z^n}{\partial n} = \frac{\partial e^{n\ln Z}}{\partial n} = (\ln Z)Z^n \to \ln Z \text { as } n \to 0.

The trick here is that instead of letting {n} go to infinity they set {n} to zero. This is crazy. But it is so crazy that it yields a ton of insight into the behavior for large {n}. I wish I could understand this better. Perhaps it could help us with our problems like {\mathsf{P = NP}}?

Indeed, one of Parisi’s main applications is to spin glass models, which are related to the Ising model and likewise have associated problems that are {\mathsf{NP}}-hard. This applies even to a spin-glass model for which Parisi showed that exact solutions are computable.

Manabe and Hasselmann

The prize for Manabe is well summed by the title to this Italian news article, which translates as, “The scientist who put the climate into computers.” The article’s subhead makes a point that deserves further reflection: “Without his models it would have been impossible to do experiments on the climate.” Stated positively and picking up “reliable” from the Nobel citation, the point is:

Having reliable climate models has made it possible to do experiments on the climate.

Now we cannot actually do experiments on the climate—short of setting off a nuclear winter or erecting a Dyson sphere, maybe. We can observe changes caused by El Niño events and shifts in the jet stream, for instance, and try to build the framework of a controlled experiment around them. But that’s all.

So what is meant is that the models are robust enough, and have proven themselves on predictions at smaller or broader scales, that we can confidently regard computational experiments with them as indicative of real-world outcomes. The article mentions being able to tell what would happen if we made mountains disappear or shuffled the continents around. But in line with what we said in the intro, the logic sounds circular or self-confirming unless its reasonableness is explained more. The Washington Post article on the prizes stops short of saying that Manabe’s modeling of the effect of atmospheric carbon dioxide is a truly confirmed prediction, but it does say that his 50 years of modeling choices have had an almost 1.000 batting average.

Hasselmann is hailed for supplying a different piece that promotes confidence and accounting of causality. This is to demonstrate consistent effects of short-term local weather events on longer-term global climate. To those like us not versed in the background, this might again sound circular: didn’t the global configuration cause the local event? One particular effect Hasselmann traced is of weather events on ocean currents. The chain from atmosphere to storm to ocean sub-surface is non-circular.

Perhaps all this can be put as pithily as John Wheeler’s non-circular explanation of general relativity: “Matter tells space-time how to curve, and curved space-time tells matter how to move.” But until then, we feel a need to hallow a more fundamental story of how computational modeling works and why it is effective.

Open Problems

One day we will see a more computationally-oriented Nobel Prize, but how soon? Until then, best to all working on theory. You are Laureates in our eyes.

[added equation with derivatives to “Trick” section]

by RJLipton+KWRegan at October 08, 2021 04:34 AM UTC

Place of work: Warsaw, Poland

The newly established IDEAS NCBR institute is looking for a position of Research Team Leader dealing with formal modeling and proving the security of cryptographic protocols used in blockchain technology. The research will be carried out in cooperation with the cryptography and blockchain laboratory headed by prof. Stefan Dziembowski at the University of Warsaw.


by shacharlovett at October 07, 2021 01:43 PM UTC

As life is tentatively returning to normal, I would like to once again post technical material here. Before returning to online optimization, I would like to start with something from 2015 that we never wrote up properly, that has to do with graph curvature and with Buser inequalities in graphs.

The starting point is the Cheeger inequalities on graphs.

If {G= (V,E)} is a {d}-regular graph, and {A} is its adjacency matrix, then we define the normalized Laplacian matrix of {G} as {L: = I - A/d}, and we call {\lambda_2} the second smallest (counting multiplicities) eigenvalue of {G}. It is important in the spectral theory of graphs that this eigenvalue has a variational characterization as the solution of the following optimization problem:

\displaystyle  \lambda_2 = \min_{x \in {\mathbb R}^V : \langle x, {\bf 1} \rangle=0} \ \frac { x^T L x}{||x||^2} = \min_{x \in {\mathbb R}^V : \langle x,{\bf 1} \rangle=0} \ \frac { \sum_{(u,v) \in E} (x_u - x_v)^2}{d\sum_v x_v^2 }

The normalized edge expansion of {G} is defined as

\displaystyle  \phi(G) = \min_{S : |S| \leq |V|/2} \ \frac{ cut(S) }{d|S|}

where {cut(S)} denotes the number of edges with one endpoint in {S} and one endpoint outside {S}. We have talked about these quantities several times, so we will just jump to the fact that the following Cheeger inequalities hold:

\displaystyle  \frac {\lambda_2 }2 \leq \phi(G) \leq \sqrt {2 \lambda_2}

Those inequalities are called Cheeger inequality because the upper bound is the discrete analog of a result of Cheeger concerning Riemann manifolds. There were previous posts about this here and here, and for our current purposes it will be enough to recall that, roughly speaking, a Riemann manifold defines a space on which we can define real-valued and complex-valued functions, and such functions can be integrated and differentiated. Subsets {S} of a Riemann manifold have, if they are measurable, a well defined volume, and their boundary {\partial S} a well defined lower-dimensional volume; it possible to define a Cheeger constant of a manifold in a way that is syntactically analogous to the definition of edge expansion:

\displaystyle  \phi (M) = \min_{S : vol(S) \leq vol(M)/2} \ \ \ \frac{vol(\partial S)}{vol(S)}

It is also possible to define a Laplacian operator {L} on smooth functions {f: M \rightarrow {\mathbb C}}, and if {\lambda_2} is the second smallest eigenvalue of {L} we have the characterization

\displaystyle  \lambda_2 = \min_{f : M \rightarrow {\mathbb R} \ \ {\rm smooth}, \int_M f = 0} \ \ \frac{ \langle Lf , f\rangle}{\langle f,f\rangle} = \min_{f : M \rightarrow {\mathbb R} \ \ {\rm smooth}, \int_M f = 0} \ \ \frac{ \int_M ||\nabla f||^2}{\int_M f^2}

and Cheeger proved

\displaystyle  \phi(M) \leq 2 \sqrt{\lambda_2 }

which is syntactically almost the same inequality and, actually, has a syntactically very similar proof (the extra {\sqrt 2} factor comes from the fact that things are normalized slightly differently).

The “dictionary” between finite graphs and compact manifolds is that vertices correspond to points, degree correspond to dimensionality, adjacent vertices {v} and {w} correspond to “infinitesimally close” points along one dimension, and the set of edges in the cut {(S,V-S)} correspond to the boundary {\partial S} of a set {S}, “volume” corresponds to number of edges (both when we think of volume of the boundary and volume of a set), vectors {x\in {\mathbb R}^V} correspond to smooth functions {f: M \rightarrow {\mathbb R}}, and the collection of values of {x} at the neighbors of a vertex {\{ x_v : (u,v) \in E \}} corresponds to the gradient {\nabla f(u)} of a function at point {u}.

Having made this long premise, the point of this post is that the inequality

\displaystyle  \phi(G) \geq \frac {\lambda_2}2 \ ,

which is the easy direction to show in the graph case, does not hold for manifolds.

The easy proof for graphs is that if {(S,V-S)} is the cut that realizes the minimum edge expansion, we can take {x = {\bf 1}_S}, or rather the projection of the above vector on the space orthogonal to {{\bf 1}}, and a two-line calculation gives the inequality.

If, however, {S} is a subset of points of the manifold that realizes the Cheeger constant, we cannot define {f(\cdot)} to be the indicator of {S}, because the function would not be smooth and its gradient would be undefined.

We could think of rescuing the proof by taking a smoothed version of the indicator of {S}, something that goes to 1 on one side and to 0 on the other side of the cut, as a function of the distance from the boundary. The “quadratic form” of such a function, however, would depend not just on the area of the boundary of {S}, but also on how quickly the volume of the subset of the manifold at distance {\delta} from the boundary of {S} grows as a function of {\delta}.

If the Ricci curvature of the manifold is negative, this volume grows very quickly, and the quadratic form of the smoothed function is very bad. The graph analog of this would be a graph made of two expanders joined by a bridge edge, and the problem is that the analogy between graphs and manifolds breaks down because “infinitesimal distance” in the manifold corresponds to any {o(\log n)} distance on the graph, and although the bridge edge is a sparse cut in the graph, it is crossed by a lot of paths of length {o(\log n)}, or at least this is the best intuition that I have been able to build about what breaks down here in the analogy between graphs and manifolds.

If the Ricci curvature of the manifold is non-negative and the dimension is bounded, there is however an inequality in manifolds that goes in the direction of the “easy graph Cheeger inequality,” and it has the form

\displaystyle  \lambda_2 \leq 10 \phi^2 (M)

This is the Buser inequality for manifolds of non-negative Ricci curvature, and note how strong it is: it says that {\phi(M)} and {\sqrt{\lambda_2}} approximate each other up to the constant factor {2\sqrt{10}}.

If the Ricci curvature {R} is negative, and {n} is the dimension of the manifold, then one has the inequality

\displaystyle  \lambda_2 \leq 2 \sqrt{|R| \cdot (n-1)} \cdot \phi(M) + 10 \phi^2 (M)

Given the importance that curvature has in the study of manifolds, there has been considerable interest in defining a notion of curvature for graphs. For example curvature relates to how quickly balls around a point grow with the diameter the ball, a concept that is of great importance in graphs as well; curvature is a locally defined concept that has a global consequences, and in graph property testing one is precisely interested in understanding global properties based on local ones; and a Buser-type inequality in graphs would be very interesting because it would provide a class of graphs for which Fiedler’s algorithm provides constant-factor approximation for sparsest cut.

There have been multiple attempts at defining notions of curvature for graphs, the main ones being Olivier curvature and Bakry-Emery curvature. Each captures some but not all of the useful properties of Ricci curvature in manifolds. My (admittedly, poorly informed) intuition is that curvature in manifold is defined “locally” but, as we mentioned above, “local” in graphs can have multiple meanings depending on the distance scale, and it is difficult to come up with a clean and usable definition that talks about multiple distance scales.

Specifically to the point of the Buser inequality, Bauer et al and Klartag et al prove Buser inequalities for graphs with respect to the Bakry-Emery definition. Because Cayley graphs of Abelian groups happen to have curvature 0 according to this definition, their work gives the following result:

Theorem 1 (Buser inequality for Cayley graphs of Abelian groups) If {G=(V,E)} is a {d}-regular Cayley graph of an Abelian group, {\lambda_2} is the second smallest normalized Laplacian eigenvalue of {G} and {\phi(G)} is the normalized edge expansion of {G}, we have

\displaystyle  \lambda_2 \leq O(d) \cdot \phi^2 (G)

(Klartag et al. state the result as above; Bauer et al. state it only for certain groups, but I think that their proof applies to all Abelian groups)

In particular, the above statement implies that if we have a {d}-regular Abelian Cayley graph then {\sqrt {\lambda_2}} provides an approximation of the sparsest cut up to a multiplicative error {O(\sqrt d)} and that Fiedler’s algorithm is an {O(\sqrt d)} approximate algorithm for sparsest cut in Abelian Cayley graphs.

At some point in 2015 (or maybe 2014, during the Simons program on spectral graph theory?), I read the paper of Bauer at al. and wondered about a few questions. First of all, is there a way to prove the above theorem without using any notion of curvature?

Secondly, the theorem implies that Fiedler’s algorithm has a good approximation, but it does so in a very unusual way: we have {\lambda_2} that is a continuous relaxation of the sparsest cut problem, and we have Fiedler’s algorithm that, as analyzed by Cheeger’s inequality, provides a somewhat bad rounding, and finally the Buser inequality is telling us that the relaxation has very poor integrality gap on all Abelian Cayley graphs, and so even though the rounding seems bad it is actually good compared to the integral optimum. There should be an underlying reason for all this, meaning some other relaxation that has an actual integrality gap for sparsest cut that is at most {O(\sqrt d)}.

I mentioned these questions to Shayan Oveis Gharan, and we were able to prove that if {ARV(G)} is the value of the Arora-Rao-Vazirani (or Goemans-Linial) semidefinite programming relaxation of sparsest cut, then, for all {d}-regular Cayley graphs, we have

\displaystyle  \lambda_2 \leq O(d) \cdot (ARV(G))^2 \leq O(d) \cdot \phi^2 (G)

which proves the above theorem and, together with the Cheeger inequality {\lambda_2 \geq \phi^2(G)/2}, also shows

\displaystyle  \phi (G) \leq O(\sqrt d) \cdot ARV(G)

that is, the ARV relaxation has integrality gap at most {O(\sqrt d)} on Abelian Cayley graphs. I will discuss the proof (which is a sort of “sum-of-squares version” of the proof of Bauer et al.) in the next post.

by luca at October 07, 2021 10:38 AM UTC

We have 6 tenure/tenure track position out of which one is focused on theory. We are looking for folks in both algorithms and complexity. (The Nov 30 deadline is a suggestion, we will be considering applications a bit after that as well.)


by shacharlovett at October 07, 2021 02:10 AM UTC

I’m looking for a postdoctoral researcher from Jan 22 to work on an EPSRC-funded project on password-hashing algorithms and idealized models of computation at Durham Uni. Applicants with backgrounds in Cryptography, Algorithms and Complexity are very welcome. Durham is one of the top universities in the UK, & the CS dept. hosts one of the strongest theory groups across the ACiD and NESTiD groups.


by shacharlovett at October 06, 2021 04:26 PM UTC

A nice array of papers for this month, ranging from distribution testing, dense graph property testing, sublinear graph algorithms, and various mixtures of them. Let us now sample our spread! (Adding another semistreaming paper who achieved similar results to another posted paper. -Ed)

A Lower Bound on the Complexity of Testing Grained Distributions by Oded Goldreich and Dana Ron (ECCC). A discrete distribution is called \(m\)-grained if all probabilities are integer multiples of \(1/m\). This paper studies the complexity of testing this property of distributions. For simplicity, consider the property of being \(n/2\)-grained, where the support size is \(n\). The classic lower bound for testing uniformity shows that \(\Omega(\sqrt{n})\) samples are required to distinguish the uniform distribution from a distribution uniform on \(n/2\) elements. Thus, we get a lower bound of \(\Omega(\sqrt{n})\) for testing \(n/2\)-grainedness (if I am permitted to use that word). This paper proves a lower bound of \(\Omega(n^c)\), for all constant \(c < 1\). It is conjectured that the lower bound is actually \(\Omega(n/\log n)\), which would match the upper bound (for any label-invariant property).

Testing Distributions of Huge Objects by Oded Goldreich and Dana Ron (ECCC). This paper introduced a new model that marries distribution testing with property testing on strings. The “object” of interest is a distribution \(\mathcal{D}\) over strings of length \(n\). We wish to test if \(\mathcal{D}\) possesses some property. The tester can get a random string \(x\) from the distribution, and can query any desired index of \(x\). The distance between distributions is defined using the earthmover distance (where we use the Hamming distance between strings). This model is called the DoHO (Distributions of Huge Objects) model. There are many questions posed and connections drawn to classical property testing and distribution testing. What I find interesting is a compelling application: the distribution \(\mathcal{D}\) may represent noisy or perturbed versions of a single object. The DoHO model gives a natural generalization of standard property testing to noisy objects. This paper considers problems such as testing if \(\mathcal{D}\) is: a random perturbation of a string, or a random cyclic shift, or a random isomorphism of a graph.

Sublinear Time and Space Algorithms for Correlation Clustering via Sparse-Dense Decompositions by Sepehr Assadi and Chen Wang (arXiv). Correlation clustering is a classic problem where edges in a graph are labeled ‘+’ or ‘-‘, denoting whether these edges should be uncut or cut. The aim is to cluster the graph minimizing the total “disagreements” (cut ‘+’ edges or uncut ‘-‘ edges). This paper gives an \(O(1)\)-approximation algorithm that runs in \(O(n\log^2n)\) time; this is the first sublinear time approximation algorithm for this problem. Correlation clustering has seen results for the property testing/sublinear algorithms community, first by Bonchi, Garcia Soriano, and Kutzkov. But previous results were essentially on the dense graph model, giving \(O(\epsilon n^2)\) error assuming adjacency matrix input. This paper considers access to the adjacency list of ‘+’ edges. Interestingly (from a technical standpoint), the key tool is a new sparse-dense decomposition. Such decompositions emerged from the seminal work of Assadi-Chen-Khanna for sublinear \((\Delta+1)\)-colorings, and it is great to see applications beyond coloring.

Sublinear-Time Computation in the Presence of Online Erasures by Iden Kalemaj, Sofya Raskhodnikova, and Nithin Varma (arXiv). Can property testing be done when portions of the input are hidden? This question was first raised by Dixit-Raskhodnikova-Thakurta-Varma, who gave a model of erasure-resilient testing. There is an adversary who hides (erases) part of the input function; queries to those parts just yield a dummy symbol. This paper defines an online version of this model. There is an erasure parameter \(t\). On each query by the property tester, the adversary can erase \(t\) values of the function. Consider the property of classic linearity of functions \(f:\{0,1\}^d \rightarrow \{0,1\}\). The BLR tester queries triples of pairs \((x,y, x \oplus y)\). Observe how this tester is easily defeated by our adversary, by erasing the value \(f(x\oplus y)\). One of the main results of this paper is a \(O(1/\varepsilon)\)-query tester for linearity, that works for any constant erasure parameter \(t\). Note that this matches the bound for the standard setting. There are a number of results for other classic properties, such as monotonicity (sortedness) and Lipschitz.

Sublinear Time Eigenvalue Approximation via Random Sampling by Rajarshi Bhattacharjee, Cameron Musco, and Archan Ray (arXiv). Consider the problem of estimating all the eigenvalues of a real, symmetric \(n \times n\) matrix \(M\) with bounded entries, in sublinear time. The main result shows that the eigenvalues of a uniform random \(O(\epsilon^{-4}\log n)\) principal submatrix can be used to approximate all eigenvalues of \(M\) up to additive error \(\epsilon n\). One can think of this as a sort of concentration inequality for eigenvalues. This result follows (and builds upon) work of Bakshi-Chepurko-Jayaram on property testing semidefiniteness. The key idea is that eigenvectors corresponding to large eigenvalues have small infinity norm: intuitively, since all entries in \(M\) are bounded, such an eigenvector must have its mass spread out among many coordinates. Hence, we can get information about it by randomly sampling a few coordinates. The paper also shows that approach of taking principal submatrices requires taking \(\Omega(\epsilon^{-2})\) columns/rows.

Deterministic Graph Coloring in the Streaming Model by Sepehr Assadi, Andrew Chen, and Glenn Sun (arXiv). This is technically not a sublinear algorithms paper (well ok, streaming is sublinear, but we tend not to cover the streaming literature. Maybe we should? – Ed.) But, I think that the connections are of interest to our community of readers. The main tool of sublinear \((\Delta+1)\)-coloring algorithm of Assadi-Chen-Khanna is the palette sparsification lemma (\(\Delta\) is the maximum degree). This lemma shows that vertices can randomly shorten their ”palette” of colors, after which all colorings from these palettes lead to very few monochromatic edges. This is an immensely powerful tool, since one can get immediately sublinear complexity algorithms in many models: adjacency list, streaming, distributed. Is the randomness necessary? Note that these algorithms run in \(\Omega(n)\) time/space, so it is conceivable that deterministic sublinear algorithms exists. This paper shows that randomization is necessary in the semi-streaming model (space \(O(n poly(\log n))\)). Indeed, there exist no deterministic semi-streaming algorithms that can achieve even \(\exp(\Delta^{o(1)})\) colorings.

Adversarially Robust Coloring for Graph Stream by Amit Chakrabarti, Prantar Ghosh, and Manuel Stoeckl (arXiv). This paper studies the same problem as the above, but presents the results in a different way. In a randomized algorithm, we normally think of an adversary that fixes a (hard) input, and the algorithm then makes its random decisions. An adaptive adversary is one that changes the input (stream) based on the decisions of an algorithm. In this definition, a robust algorithm is one that can give correct answers, even for adversarially generated output. A deterministic algorithm is automatically robust. This paper show that there do not exist semi-streaming algorithms that can achieve \((\Delta+1)\)-colorings. The quantitative lower bound is weaker (\(\Omega(\Delta^2)\) colors), but it is against a stronger adversary.

by Seshadhri at October 06, 2021 04:31 AM UTC

The next TCS+ talk will take place next Wednesday, October 13th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC: check your time zone!). Nutan Limaye from IT University of Copenhagen will speak about “Superpolynomial Lower Bounds Against Low-Depth Algebraic Circuits” (abstract below).

You can reserve a spot as an individual or a group to join us live by signing up on the online form. Registration is not required to attend the interactive talk, and the link will be posted on the website the day prior to the talk; however, by registering in the form, you will receive a reminder, along with the link. (The recorded talk will also be posted on our website afterwards) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.

Abstract: Every multivariate polynomial P(X) can be written as a sum of monomials, i.e. a sum of products of variables and field constants. In general, the size of such an expression is the number of monomials that have a non-zero coefficient in P.

What happens if we add another layer of complexity, and consider sums of products of sums (of variables and field constants) expressions? Now, it becomes unclear how to prove that a given polynomial P(X) does not have small expressions. In this result, we solve exactly this problem.

More precisely, we prove that certain explicit polynomials have no polynomial-sized “Sigma-Pi-Sigma” (sums of products of sums) representations. We can also show similar results for Sigma-Pi-Sigma-Pi, Sigma-Pi-Sigma-Pi-Sigma and so on for all “constant-depth” expressions.

The talk is based on a joint work of Nutan Limaye, Srikanth Srinivasan, and Sébastien Tavenas.

by plustcs at October 05, 2021 09:34 PM UTC


Today the Italian academic community, along with lots of other people, was delighted to hear that Giorgio Parisi is one of the three recipients of the 2021 Nobel Prize for Physics.

Parisi has been a giant in the area of understanding “complex” and “disordered” systems. Perhaps, his most influential contribution has been his “replica method” for the analysis of the Sherrington-Kirkpatrick model. His ideas have led to several breakthroughs in statistical physics by Parisi and his collaborators, and they have also found applications in computer science: to tight analyses on a number of questions about combinatorial optimization on random graphs, to results on random constraint satisfaction problems (including the famous connection with random k-SAT analyzed by Mezard, Parisi and Zecchina) and random error correcting codes, and to understanding the solution landscape in optimization problems arising from machine learning. Furthermore these ideas have also led to the development and analysis of algorithms.

The news was particularly well received at Bocconi, where most of the faculty of the future CS department has done work that involved the replica method. (Not to be left out, even I have recently used replica methods.)

Mezard and Montanari have written a book-length treatment on the interplay between ideas from statistical physics, algorithms, optimization, information theory and coding theory that arise from this tradition. Readers of in theory looking for a shorter exposition aimed at theoretical computer scientists will enjoy these notes posted by Boaz Barak, or this even shorter post by Boaz.

In this post, I will try to give a sense to the reader of what the replica method for the Sherrington-Kirkpatrick model looks like when applied to the average-case analysis of optimization problems, stripped of all the physics. Of course, without the physics, nothing makes any sense, and the interested reader should look at Boaz’s posts (and to references that he provides) for an introduction to the context. I did not have time to check too carefully what I wrote, so be aware that several details could be wrong.

What is the typical value of the max cut in a {G_{n,\frac 12}} random graph with {n} vertices?

Working out an upper bound using union bounds and Chernoff bound, and a lower bound by thinking about a greedy algorithm, we can quickly convince ourselves that the answer is {\frac {n^2}{4} + \Theta(n^{1.5})}. Great, but what is the constant in front of the {n^{1.5}}? This question is answered by the Parisi formula, though this fact was not rigorously established by Parisi. (Guerra proved that the formula gives an upper bound, Talagrand proved that it gives a tight bound.)

Some manipulations can reduce the question above to the following question: suppose that I pick a random {n\times n} symmetric matrix {M}, say with zero diagonal, and such that (up to the symmetry requirement) the entries are mutually independent and each entry is equally likely to be {+1} or {-1}, or perhaps each entry is distributed according to a standard normal distribution (the two versions can be proved to be equivalent), what is the typical value of

\displaystyle  \max _{x \in \{+1,1\}^n } \ \ \frac 1{n^{1.5}} x^T M x

up to {o_n(1)} additive terms,?

As a first step, we could replace the maximum with a “soft-max,” and note that, for every choice of {\beta>0}, we have

\displaystyle  \max _{x \in \{+1,1\}^n } \ \ x^T M x \leq \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

The above upper bound gets tighter and tighter for larger {\beta}, so if we were able to estimate

\displaystyle  \mathop{\mathbb E} \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

for every {\beta} (where the expectation is over the randomness of {M}) then we would be in good shape.

We could definitely use convexity and write

\displaystyle  \mathop{\mathbb E} \max _{x \in \{+1,1\}^n } \ \ x^T M x \leq \frac 1 \beta \mathop{\mathbb E} \log \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx} \leq \frac 1 \beta \log \mathop{\mathbb E} \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

and then use linearity of expectation and independence of the entries of {M} to get to

\displaystyle  \leq \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j }

Now things simplify quite a bit, because for all {i<j} the expression {M_{i,j} x_i x_j}, in the Rademacher setting, is equally likely to be {+1} or {-1}, so that, for {\beta = o(1)}, we have

\displaystyle  \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } = cosh (2\beta) \leq 1 + O(\beta^2) \leq e^{O(\beta^2)}


\displaystyle  \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } \leq 2^n \cdot e^{O(\beta^2 n^2)}

so that

\displaystyle  \frac 1 \beta \log \sum_{x \in \{+1,1\}^n } \prod_{1\leq i < j\leq n} \mathop{\mathbb E} e^{2\beta M_{i,j} x_i x_j } \leq \frac {O(n)}{\beta} + O(\beta n^2)

which, choosing {\beta = 1/\sqrt n}, gives an {O(n^{1.5})} upper bound which is in the right ballpark. Note that this is exactly the same calculations coming out of a Chernoff bound and union bound. If we optimize the choice of {\beta} we unfortunately do not get the right constant in front of {n^{1.5}}.

So, if we call

\displaystyle  F := \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx}

we see that we lose too much if we do the step

\displaystyle  \mathop{\mathbb E} \log F \leq \log \mathop{\mathbb E} F

But what else can we do to get rid of the logarithm and to reduce to an expression in which we take expectations of products of independent quantities (if we are not able to exploit the assumption that {M} has mutually independent entries, we will not be able to make progress)?

The idea is that if {k>0} is a small enough quantity (something much smaller than {1/\log F}), then {F^k} is close to 1 and we have the approximation

\displaystyle  \log F^k \approx F^k-1

and we obviously have

\displaystyle  \log F^k = k \log F

so we can use the approximation

\displaystyle  \log F \approx \frac 1k (F^k - 1)


\displaystyle  \mathop{\mathbb E} \log F \approx \frac 1k (\mathop{\mathbb E} F^k - 1)

Let’s forget for a moment that we want {k} to be a very small parameter. If {k} was an integer, we would have

\displaystyle  \mathop{\mathbb E} F^k = \mathop{\mathbb E} \left( \sum_{x \in \{+1,1\}^n } e^{\beta x^T Mx} \right)^k = \sum_{x^{(1)},\ldots x^{(k)} \in \{+1,-1\}^n} \mathop{\mathbb E} e^{\beta \cdot ( x^{(1) T} M x^{(1)} + \cdots + x^{(k)T} M x^{(k)}) }

\displaystyle  = \sum_{x^{(1)},\ldots x^{(k)} \in \{+1,-1\}^n} \ \ \prod_{i< j} \ \mathop{\mathbb E} e^{2\beta M_{i,j} \cdot ( x^{(1)}_i x^{(1)}_j + \cdots + x^{(k)}_i x^{(k)}_j )}

Note that the above expression involves choices of {k}-tuples of feasible solutions of our maximization problem. These are the “replicas” in “replica method.”

The above expression does not look too bad, and note how we were fully able to use the independence assumption and “simplify” the expression. Unfortunately, it is actually still very bad. In this case it is preferable to assume the {M_{i,j}} to be Gaussian, write the expectation as an integral, do a change of variable and some tricks so that we reduce to computing the maximum of a certain function, let’s call it {G(z)}, where the input {z} is a {k \times k} matrix, and then we have to guess what is an input of maximum value for this function. If we are lucky, the maximum is equivalent by a {z} in which all entries are identical, the replica symmetric solution. In the Sherrington-Kirkpatrick model we don’t have such luck, and the next guess is that the optimal {z} is a block-diagonal matrix, or a replica symmetry-breaking solution. For large {k}, and large number of blocks, we can approximate the choice of such matrices by writing down a system of differential equations, the Parisi equations, and we are going to assume that such equations do indeed describe an optimal {z} and so a solution to the integral, and so they give as a computation of {(\mathop{\mathbb E} F^k - 1)/k}.

After all this, we get an expression for {(\mathop{\mathbb E} F^k - 1)/k} for every sufficiently large integer {k}, as a function of {k} up to lower order errors. What next? Remember how we wanted {k} to be a tiny real number and not a sufficiently large integer? Well, we take the expression, we forget about the error terms, and we set {k=0\ldots}

by luca at October 05, 2021 08:53 PM UTC

1. Huge congratulations to the winners of this year’s Nobel Prize in Physics: Syukuro Manabe and Klaus Hasselmann for climate modelling, and separately, Giorgio Parisi for statistical physics. While I don’t know the others, I had the great honor to get to know Parisi three years ago, when he was chair of the committee that awarded me the Tomassoni-Chisesi Prize in Physics, and when I visited Parisi’s department at Sapienza University of Rome to give the prize lecture and collect the award. I remember Parisi’s kindness, a lot of good food, and a lot of discussion of the interplay between theoretical computer science and physics. Note that, while much of Parisi’s work is beyond my competence to comment on, in computer science he’s very well-known for applying statistical physics methods to the analysis of survey propagation—an algorithm that revolutionized the study of random 3SAT when it was introduced two decades ago.

2. Two weeks ago, a group at Google put out a paper with a new efficient classical algorithm to simulate the recent Gaussian BosonSampling experiments from USTC in China. They argued that this algorithm called into question USTC’s claim of BosonSampling-based quantum supremacy. Since then, I’ve been in contact with Sergio Boixo from Google, Chaoyang Lu from USTC, and Jelmer Renema, a Dutch BosonSampling expert and friend of the blog, to try to get to the bottom of this. Very briefly, the situation seems to be that Google’s new algorithm outperforms the USTC experiment on one particular metric: namely, total variation distance from the ideal marginal distribution, if (crucially) you look at only a subset of the optical modes, say 14 modes out of 144 total. Meanwhile, though, if you look at the kth-order correlations for large values of k, then the USTC experiment continues to win. With the experiment, the correlations fall off exponentially with k but still have a meaningful, detectable signal even for (say) k=19, whereas with Google’s spoofing algorithm, you choose the k that you want to spoof (say, 2 or 3), and then the correlations become nonsense for larger k.

Now, given that you were only ever supposed to see a quantum advantage from BosonSampling if you looked at the kth-order correlations for large values of k, and given that we already knew, from the work of Leonid Gurvits, that very small marginals in BosonSampling experiments would be easy to reproduce on a classical computer, my inclination is to say that USTC’s claim of BosonSampling-based quantum supremacy still stands. On the other hand, it’s true that, with BosonSampling especially, more so than with qubit-based random circuit sampling, we currently lack an adequate theoretical understanding of what the target should be. That is, which numerical metric should an experiment aim to maximize, and how well does it have to score on that metric before it’s plausibly outperforming any fast classical algorithm? One thing I feel confident about is that, whichever metric is chosen—Linear Cross-Entropy or whatever else—it needs to capture the kth-order correlations for large values of k. No metric that’s insensitive to those correlations is good enough.

3. Like many others, I was outraged and depressed that MIT uninvited Dorian Abbot (see also here), a geophysicist at the University of Chicago, who was slated to give the Carlson Lecture in the Department of Earth, Atmospheric, and Planetary Sciences about the atmospheres of extrasolar planets. The reason for the cancellation was that, totally unrelatedly to his scheduled lecture, Abbot had argued in Newsweek and elsewhere that Diversity, Equity, and Inclusion initiatives should aim for equality for opportunity rather than equality of outcomes, a Twitter-mob decided to go after him in retaliation, and they succeeded. It should go without saying that it’s perfectly reasonable to disagree with Abbot’s stance, to counterargue—if those very concepts haven’t gone the way of floppy disks. It should also go without saying that the MIT EAPS department chair is free to bow to social-media pressure, as he did, rather than standing on principle … just like I’m free to criticize him for it. To my mind, though, cancelling a scientific talk because of the speaker’s centrist (!) political views completely, 100% validates the right’s narrative about academia, that it’s become a fanatically intolerant echo chamber. To my fellow progressive academics, I beseech thee in the bowels of Bertrand Russell: why would you commit such an unforced error?

Yes, one can imagine views (e.g., open Nazism) so hateful that they might justify the cancellation of unrelated scientific lectures by people who hold those views, as many physicists after WWII refused to speak to Werner Heisenberg. But it seems obvious to me—as it would’ve been obvious to everyone else not long ago—that no matter where a reasonable person draws the line, Abbot’s views as he expressed them in Newsweek don’t come within a hundred miles of it. To be more explicit still: if Abbot’s views justify deplatforming him as a planetary scientist, then all my quantum computing and theoretical computer science lectures deserve to be cancelled too, for the many attempts I’ve made on this blog over the past 16 years to share my honest thoughts and life experiences, to write like a vulnerable human being rather than like a university press office. While I’m sure some sneerers gleefully embrace that implication, I ask everyone else to consider how deeply they believe in the idea of academic freedom at all—keeping in mind that such a commitment only ever gets tested when there’s a chance someone might denounce you for it.

Update: Princeton’s James Madison Program has volunteered to host Abbot’s Zoom talk in place of MIT. The talk is entitled “Climate and the Potential for Life on Other Planets.” Like probably hundreds of others who heard about this only because of the attempted cancellation, I plan to attend!

Unrelated Bonus Update: Here’s a neat YouTube video put together by the ACM about me as well as David Silver of AlphaGo and AlphaZero, on the occasion of our ACM Prizes in Computing.

by Scott at October 05, 2021 07:23 PM UTC

On behalf of the FOCS 2021 PC, I am delighted to announce the Best Paper Awards.

Best paper:

Nutan Limaye, Srikanth Srinivasan and Sébastien Tavenas. Superpolynomial Lower Bounds Against Low-Depth Algebraic Circuits

This paper makes a fundamental advance by proving super-polynomial lower bounds against algebraic circuits of arbitrary constant depth.


Machtey Award for Best Student Paper:

Xiao Mao. Breaking the Cubic Barrier for (Unweighted) Tree Edit Distance

This paper shows how, unlike the weighted case, the unweighted tree edit distance problem has a sub-cubic time algorithm.


Congratulations to the winners and see you all in Denver from Feb 7-10, 2022!

by nisheethvishnoi at October 04, 2021 06:58 PM UTC

The classic FLM lower bound says that in Synchrony, Byzantine Agreement is impossible when $n \leq 3f$. We discussed this important bound in a previous post. In this post we strengthen the FLM lower bound in two important ways: Maybe randomization allows circumventing the FLM lower bound? No! Even allowing...

at October 04, 2021 02:46 PM UTC

 (Disclosure - Harry Lewis was my PhD advisor.)

It seems like just a few weeks ago I I blogged about a book of Harry Lewis's that was recently available (see here).  And now I am blogging about another one. Writing two books in two years seems hard! I can only think of one other computer scientist who has done that recently (see here and here).

In 2008 Abelson, Ledeen, and Lewis wrote 

Blown to Bits: Your Life, Liberty, and Happiness after the Digital Explosion

which I reviewed in SIGACT news, see here

Both computers and society have changed since 2008. Hence an update was needed. 

In 2021 Adelson, Ledeen, Lewis, and Seltzer wrote a second edition.

Should you buy the new version if you bought the old version? 

1) Not my problem- I got them both for free since I reviewed them. 

2) Not your problem- The second edition is available free-on-line here. Is that a link to some dark corner of the dark web? No, its the formal webpage about the book. So the book is available free-on-line legally, if you care (and even if you don't care). 

3) If you like paper, the book is on amazon. (If you don't like paper, the book is still on amazon). 

I reviewed it in SIGACT news. A non-paywalled link: here (is that link legal? I have no idea.) 

In this post I'll just mention two things that changed since the last book

1) Shared Music and pirating were an issue back in 2008.  It does not seem to be anymore since there is now a variety of services that seem to make pirating not worth it: itunes, streaming services, and some bands give it away for free and ask you to pay what its worth. Movies are still struggling with this issue. 

2) AI systems that reinforce existing bias is a new problem.

by gasarch ( at October 04, 2021 04:19 AM UTC

I just added to Wikipedia two articles on the Jordan–Pólya numbers and fibbinary numbers, two integer sequences used in my recent paper on Egyptian fractions. Jordan–Pólya numbers are the products of factorials, while the fibbinary numbers are the ones with binary representations having no two consecutive 1’s. The OEIS page on the fibbinary numbers, A003714, lists many ways of generating this sequence algorithmically, of which most are boring or slow (generate all binary numbers and test which ones belong to the sequence; you can test if a variable x is fibbinary by checking that x&(x>>1) is zero). I thought it might be interesting to highlight two of those methods that are a little more clever and generate these numbers in small numbers of operations.

Some functional languages, and in part Python even though it’s mostly not functional, have a notion of a stream, a potentially infinite sequence of values generated by a coroutine. In Python, you can program these using simple generators and the yield keyword. I wrote here long ago about methods for using generators recursively: a generator can call itself, manipulate the resulting sequence of values, and pass them on to its output. It’s actually a very old idea, used for instance to generate regular numbers by Dijkstra in his 1976 book A Discipline of Programming. Reinhard Zumkeller used the same idea to generate the fibbinary numbers in Haskell, based on the observation that the sequence of positive fibbinary numbers can be generated, starting from the number 1, by two operations, doubling smaller values or replacing a smaller value \(x\) with \(4x+1\). Here is is, translated into Python:

from heapq import merge

def affine(stream,a,b):
    for x in stream:
        yield x*a+b

def fibbinary():
    yield 1
    for x in merge(affine(fibbinary(),2,0),affine(fibbinary(),4,1)):
        yield x

It’s elegant, but has a couple of minor flaws. First, it omits the number \(0\), and while it can be modified to include \(0\), the modifications make the code messier. But second, it takes more than a constant amount of time per element to generate each sequence element. A fibbinary number \(x\) has to be generated from a sequence of smaller elements by repeated doubling and quadrupling, and that takes \(\log x\) steps per element. Even if we assume those steps to take constant time each, generating the first \(n\) elements in this way takes time \(\Theta(n\log n)\). It’s better than the \(\Theta(n^{\log_\varphi 2})\approx n^{1.44}\) that you would get from generate-and-test, but still not as good as we might hope for. One way to fix this would be to memoize the generator, so that the recursive calls look at a stored copy of the sequence generated by the outer call rather than generating the same sequence redundantly, but this again makes the code messier and also takes more storage than necessary.

Instead, Jörg Arndt observed that you can generate each fibbinary number directly from the previous one by a process closely resembling binary addition. Adding one to a binary number sets the first available bit from zero to one, and zeros out all the smaller bits; here, a bit is available if it is already zero. Finding the next fibbinary number does the same thing, but with a different definition of availability: a bit is available if both it and the next larger bit are zero. We can find the available bit using binary addition on a modified word that fills in bits whose neighbor is nonzero. Using this idea, we can generate the fibbinary numbers in a constant number of bitwise binary word-level operations per number. Here it is again in Python, translated from Arndt’s C++ and simplified based on a comment by efroach76:

def fibbinary():
    x = 0
    while True:
        yield x
        y = ~(x >> 1)
        x = (x - y) & y

It’s even possible to use the same idea to generate the fibbinary numbers in a constant amortized number of bit-level operations per number, although this ends up being a little less efficient in practice because high-level languages end up translating all these bit operations into word operations anyway.

def fibbinary():
    x = 0
    while True:
        yield x
        y = 1
        while x & (y | (y<<1)) != 0:
            x &=~ y
            y <<= 1
        x |= y

The inner loop ends immediately at fibbinary numbers whose successor is odd (at positions given by the ones of the Fibonacci word), whose fraction of the total is \(1-1/\varphi\approx 0.382\), where \(\varphi\) is the golden ratio. It ends in two steps for the remaining values when their next bit is odd, in the same proportion, and so on. So the average number of steps for the inner loop adds in a geometric series to \(O(1)\).

(Discuss on Mastodon)

by David Eppstein at October 02, 2021 01:53 PM UTC

In this note, we show the mixing of three-term progressions $(x, xg, xg^2)$ in every finite quasirandom group, fully answering a question of Gowers. More precisely, we show that for any $D$-quasirandom group $G$ and any three sets $A_1, A_2, A_3 \subset G$, we have \[ \left|\Pr_{x,y\sim G}\left[ x \in A_1, xy \in A_2, xy^2 \in A_3\right] - \prod_{i=1}^3 \Pr_{x\sim G}\left[x \in A_i\right] \right| \leq \left(\frac{2}{\sqrt{D}}\right)^{\frac14}.\] Prior to this, Tao answered this question when the underlying quasirandom group is $\mathrm{SL}_{d}(\mathbb{F}_q)$. Subsequently, Peluse extended the result to all nonabelian finite simple groups. In this work, we show that a slight modification of Peluse's argument is sufficient to fully resolve Gower's quasirandom conjecture for 3-term progressions. Surprisingly, unlike the proofs of Tao and Peluse, our proof is elementary and only uses basic facts from nonabelian Fourier analysis.

at October 01, 2021 04:38 PM UTC