# Theory of Computing Blog Aggregator

### The Ideal State Machine Model: Multiple Clients and Linearizability

We introduced state machines and state machine replication in an earlier post. In this post, we elaborate on the exact notion of safety and liveness that can be obtained by an ideal state machine when there are multiple clients. First, these definitions highlight the challenges of serving multiple clients. Second,...

from David Eppstein

• I was interested to see a familiar-looking graph drawing as one of the answers to the prompt “multiplicity” for the first entry in Mathober 2021 ($\mathbb{M}$). It’s a multigraph formed by a triangle with tripled edges, and looks a lot like the drawing I made for the Wikipedia Shannon multigraph article, prettied up by making an infinitely recessing sequence of these drawings rather than just one. Good choice for multiplicity.

• Non-concentration of the chromatic number of random graphs ($\mathbb{M}$). Uniformly random graphs, $G(n,1/2)$ in the Erdős–Rényi–Gilbert model, turn out to have chromatic numbers that, for infinitely many $n$, are spread over roughly $\sqrt{n}$ values. But there are weird fluctuations so that, conjecturally, for some $n$ the concentration is much tighter, more like $n^{1/4}$.

• University System of Georgia moves to gut tenure ($\mathbb{M}$). The proposed new policy includes procedures for removal of tenure under certain enumerated grounds, including failure to perform their jobs (this is pretty normal) but then adds a massive escape clause in which the board of regents can remove tenured faculty at any time as long as their reason for doing so is not one of the enumerated ones.

• The first physical models of the hyperbolic plane, made in 1868 by Beltrami ($\mathbb{M}$, via), blog post by Daina Taimiņa from 2010. Maybe you could make something like this by wrapping and stretching a disk of wet paper in a roll around a pseudosphere (https://en.wikipedia.org/wiki/Pseudosphere)? The rolled-up photo of Beltrami’s model suggests that he did that. The via link shows this as a tangent to a story about triangulated polygons, frieze patterns, and the Farey tessellation.

• Why do bees make rhombic dodecahedrons ($\mathbb{M}$)? Nice video from Matt Parker (Stand-up Maths) on why bees usually end the cells of their honeycombs with rhombic dodecahedron faces, why this isn’t the optimal solution to fitting two layers of cells together (in terms of minimum wax usage), and why it isn’t reasonable to expect bees to find exact optima for this problem. If I have to quibble with something, though, it’s his plural. It’s not wrong, but see Google ngrams.

• Mathematicians prove melting ice stays smooth ($\mathbb{M}$, see also). The headline is a little overstated: you’re probably familiar with thin necks of ice melting to sharp points at the instant they separate. But these singularities are instantaneous: mathematical models of ice stay smooth for all but a measure-zero set of times. Original paper: “The singular set in the Stefan problem”, Alessio Figalli, Xavier Ros-Oton, and Joaquim Serra.

• Discussion of the recent meme telling programmers and mathematicians that summation notation and for loops are the same thing. They’re not quite the same, though: summations don’t have an order of evaluation. But which is easier for people who don’t already know what they are to search and find out about? And why do programmers get angry at non-programming notational conventions?

• Mathematics, morally ($\mathbb{M}$, via), Eugenia Cheng, 2004. Somehow I hadn’t run across this before. It argues that much philosophy of mathematics is irrelevant to practice (“You can’t tell from somebody’s mathematics if they are a fictionalist, a rationalist, a platonist, a realist, an operationalist, a logicist, a formalist, structuralist, nominalist, intuitionist.”) and instead considerations of the right way of handling certain topics are more central.

• The SIGACT Committee for the Advancement of Theoretical Computer Science is collecting information on women in theoretical computer science ($\mathbb{M}$). If this is you, please see their announcement for details of how to be counted.

• Cynthia Rudin wins major award with silly name ($\mathbb{M}$), for her work on machine learning systems that learn to predict behavior using simple, interpretable, and transparent formulas.

• According to the SODA web site, SIAM has decided that their conferences will be hybrid through July ($\mathbb{M}$). So if (like me) you wanted to participate in SODA/SOSA/ALENEX/APOCS, but were worried about planning a trip to Virginia with another predicted winter wave of Covid, now you can stay home and conference safely. Or, if you feel hybrid conferences are problematic and organizers should do one or the other but not both, now you have another reason to be annoyed.

• Rarely is the question asked: Are math papers getting longer? ($\mathbb{M}$). Following earlier work by Nick Trefethen, Edward Dunne provides some data suggesting that (for certain journals, at least, and not the ones with page limits) the answer is yes. I’m not convinced by the suggested explanation that it’s because they are taking more effort to explain “connections with other work”, though: is that really a big enough fraction of most papers?

• I haven’t been using my office desktop Mac much because I haven’t been into my office much, so it took me a while to pay attention to the fact that much of its networking had recently broken. Here’s why ($\mathbb{M}$). It was still running OS X El Capitan (10.11.6) and a crucial top-level certificate expired. The machine is too old (late 2009) for the latest OS X but it looks like I can and should upgrade to High Sierra, 10.13. So much for getting anything else accomplished today…

by David Eppstein at October 15, 2021 03:43 PM UTC

### A Young Person's Game?

When László Babai first announced his graph isomorphism in quasipolynomial time result, I wrote

We think of theory as a young person's game, most of the big breakthroughs coming from researchers early in their careers. Babai is 65, having just won the Knuth Prize for his lifetime work on interactive proofs, group algorithms and communication complexity. Babai uses his extensive knowledge of combinatorics and group theory to get his algorithm. No young researcher could have had the knowledge base or maturity to be able to put the pieces together the way that Babai did.

Babai's proof is an exceptional story, but it is exceptional. Most CS theorists have done their best work early in their career. I got myself into a twitter discussion on the topic. For me, I'm proud of the research I did through my forties, but I'll always be best known, research wise, for my work on interactive proofs around 1990. It would be hard to run a scientific study to determine cause and effect but here are some reasons, based on my own experiences, on why we don't see research dominated by the senior people in theory.

The field changes - Computation complexity has moved from a computational-based discipline to one now dominated by combinatorics, algebra and analysis. I'm not complaining, a field should evolve over time but it plays less to my strengths. It's hard to teach this old dog new tricks.
The fruit hanged lower - there were important problems with easier proofs available then not available now
Responsibilities - You have fewer as a PhD student, postdoc or assistant professor.
Family - becomes more of a focus.
Taking on new jobs - Many academics, though not all, take on administrative roles at their university or , or leave academics completely.
The young people have the new ideas - And older people get settled in their ways
The thrill is gone or at least decays - Your first theorem, your first talk, your first conference paper gives you a level of excitement that's hard to match.
Existentialism - The realization that while computing has no doubt changed the world, my research, for the most part, hasn't.
Cognitive Decline - Probably the most controversial but for me I find it hard to focus on problems like I used to. Back in the day I prided myself on knowing all the proofs of my theorems, now I can't even remember the theorems.

Honestly there is just nothing wrong with taking on new roles, writing books, surveys and blogs, focusing on teaching and mentorship and service and leaving the great research to the next generation.

by Lance Fortnow (noreply@blogger.com) at October 15, 2021 01:32 PM UTC

### Multiple faculty positions in theory of computing at York University (apply by November 30, 2021)

from CCI: jobs

York University in Toronto, Canada is inviting applications for five tenured or tenure-track positions in the theory of computing or data science (both broadly interpreted). The review of applications will begin on November 15, and full applications are due by November 30.

Email: eecsjoin@yorku.ca

by shacharlovett at October 15, 2021 01:01 PM UTC

### Nakamoto's Longest-Chain Wins Protocol

In this post, we will cover Nakamoto’s Consensus protocol presented in the Bitcoin whitepaper. There are a lot of posts and videos that explain this seminal and simple protocol. Our goal in this post will be to intuitively piece out the need for different aspects of the protocol, such as...

### RQS Postdoctoral Fellowship at University of Maryland (apply by December 1, 2021)

from CCI: jobs

The NSF Quantum Leap Challenge Institute for Robust Quantum Simulation (https://rqs.umd.edu) seeks exceptional candidates for the RQS Postdoctoral Fellowship. Applications should be submitted through AcademicJobsOnline at https://academicjobsonline.org/ajo/jobs/19910.

Email: quics-coordinator@umiacs.umd.edu

by shacharlovett at October 15, 2021 12:45 AM UTC

### Online Bipartite Matching with Reusable Resources

Abstract: We study the classic online bipartite matching problem with a twist: offline nodes are reusable any number of times. Every offline node $i$ becomes available $d$ steps after it was assigned to. Nothing better than a $0.5$-approximation, obtained by the trivial deterministic greedy algorithm, was known for this problem. We give the first approximation factor beating $0.5$, namely a $0.505$ approximation, by suitably adapting and interpreting the powerful technique of Online Correlated Selection.

### The Complexity of Bipartite Gaussian Boson Sampling

Authors: Daniel Grier, Daniel J. Brod, Juan Miguel Arrazola, Marcos Benicio de Andrade Alonso, Nicolás Quesada
Abstract: Gaussian boson sampling is a model of photonic quantum computing that has attracted attention as a platform for building quantum devices capable of performing tasks that are out of reach for classical devices. There is therefore significant interest, from the perspective of computational complexity theory, in solidifying the mathematical foundation for the hardness of simulating these devices. We show that, under the standard Anti-Concentration and Permanent-of-Gaussians conjectures, there is no efficient classical algorithm to sample from ideal Gaussian boson sampling distributions (even approximately) unless the polynomial hierarchy collapses. The hardness proof holds in the regime where the number of modes scales quadratically with the number of photons, a setting in which hardness was widely believed to hold but that nevertheless had no definitive proof.

Crucial to the proof is a new method for programming a Gaussian boson sampling device so that the output probabilities are proportional to the permanents of submatrices of an arbitrary matrix. This technique is a generalization of Scattershot BosonSampling that we call BipartiteGBS. We also make progress towards the goal of proving hardness in the regime where there are fewer than quadratically more modes than photons (i.e., the high-collision regime) by showing that the ability to approximate permanents of matrices with repeated rows/columns confers the ability to approximate permanents of matrices with no repetitions. The reduction suffices to prove that GBS is hard in the constant-collision regime.

### Provably accurate simulation of gauge theories and bosonic systems

Authors: Yu Tong, Victor V. Albert, Jarrod R. McClean, John Preskill, Yuan Su
Abstract: Quantum many-body systems involving bosonic modes or gauge fields have infinite-dimensional local Hilbert spaces which must be truncated to perform simulations of real-time dynamics on classical or quantum computers. To analyze the truncation error, we develop methods for bounding the rate of growth of local quantum numbers such as the occupation number of a mode at a lattice site, or the electric field at a lattice link. Our approach applies to various models of bosons interacting with spins or fermions, and also to both abelian and non-abelian gauge theories. We show that if states in these models are truncated by imposing an upper limit $\Lambda$ on each local quantum number, and if the initial state has low local quantum numbers, then an error at most $\epsilon$ can be achieved by choosing $\Lambda$ to scale polylogarithmically with $\epsilon^{-1}$, an exponential improvement over previous bounds based on energy conservation. For the Hubbard-Holstein model, we numerically compute a bound on $\Lambda$ that achieves accuracy $\epsilon$, obtaining significantly improved estimates in various parameter regimes. We also establish a criterion for truncating the Hamiltonian with a provable guarantee on the accuracy of time evolution. Building on that result, we formulate quantum algorithms for dynamical simulation of lattice gauge theories and of models with bosonic modes; the gate complexity depends almost linearly on spacetime volume in the former case, and almost quadratically on time in the latter case. We establish a lower bound showing that there are systems involving bosons for which this quadratic scaling with time cannot be improved. By applying our result on the truncation error in time evolution, we also prove that spectrally isolated energy eigenstates can be approximated with accuracy $\epsilon$ by truncating local quantum numbers at $\Lambda=\textrm{polylog}(\epsilon^{-1})$.

### Assistant Professor of Computer Science at California State University East Bay (apply by October 31, 2021)

from CCI: jobs

The Department of Computer Science at CSUEB invites applications for 2 tenure-track appointments as Assistant Professor in Computer Science (considering all areas of computer science, capable of teaching in emerging areas) starting Fall 2022.

Website: http://www.csueastbay.edu/oaa/jobs/csueb.html
Email: cs@csueastbay.edu

by shacharlovett at October 14, 2021 05:42 PM UTC

### Certified Patch Robustness Via Smoothed Vision Transformers (Part 1)

In a series of two blog posts, we dive into how to build practical certified defenses against adversarial patches. In Part I below, we give an overview of smoothing-based defenses, a popular class of certified defenses, and walk through a specific example of using de-randomized smoothing to defend against adversarial patch attacks. In Part II, we discuss our latest work that demonstrates how using vision transformers with de-randomized smoothing (smoothed ViTs) can significantly improve these defenses. Our approach not only provides better certified defenses against adversarial patches but also maintains standard accuracy and inference times at levels comparable to that of regular (non-robust) models. Thanks to these improvements, smoothed ViTs become a truly viable alternative to regular models.

Adversarial examples are small modifications of inputs that can induce machine learning models to misclassify those inputs. These attacks can be roughly categorized into two broad classes: small, synthetic perturbations that are visually indistinguishable from the original input, and physically realizable attacks that can break deployed machine learning systems in the wild. One popular attack in the latter category, known as adversarial patches, maliciously perturbs a bounded region in the image (typically a square).

Clean image Predicted class: Duck
Attacked image Predicted class: Boat
An example of an adversarial patch attack. A square patch is added to the image in order to cause the classifier to misclassify the image as something else, such as a boat instead of a duck.

Adversarial patches can be used to undermine a variety of vision-based tasks—in particular, have been used to deceive image classifiers, manipulate object detectors, and disrupt optical flow estimation. Such attacks tend to be relatively easy to implement too as they just require, e.g., printing adversarial stickers, creating adversarial pieces of clothing,or manufacturing adversarial glasses.

While several heuristic defenses have been proposed to protect against adversarial patches, some of these defenses were shown to not be fully effective. This is unfortunately a fairly common theme in adversarial examples research.

Certified defenses attempt to alleviate such empirical evaluation problems by providing robustness guarantees that are provable. However, certified guarantees tend to be modest and come at a steep cost. For instance, recently proposed certified defenses against adversarial patches can guarantee only 14% robust accuracy on ImageNet. Furthermore, certified models are often substantially slower (by even 2-4 orders of magnitude) and significantly less accurate (in terms of standard accuracy) than non-robust models, which severely limits their practicality.

## Smoothing models for provable robustness

We begin our discussion of certified defenses with an overview of smoothing, a general strategy for creating models that are provably robust to adversarial attacks. They revolve around the idea of deriving a (robust) prediction by aggregating many variations of the original input (e.g., ones that correspond to adding Gaussian noise, masking parts of the input, or augmenting it with generative models). The underlying intuition is that while it might be easy for an adversary to attack a single input, it is much harder for them to simultaneously attack all of the input variations. Thus, instead of directly classifying the original input, we classify each of such variations of that input and then output the most frequently predicted class—if the number of variations that agree with this class is high enough then the prediction is guaranteed to be robust.

Let’s dive into the details! To create a smoothed classifier, we change the typical forward pass of our classifier (often referred to as the base classifier) into a three-step procedure:

1. Construct variations of the input.
2. Classify all input variations individually.
3. Output the most frequently predicted class.

Now, once we have such a smoothed classifier, how can we certify its robustness? The key assumption for certification is that an adversary can only affect a limited number of input variations. If this is true, certification amounts to a simple counting problem: if the difference between the first- and second-most frequently predicted classes is larger than what an adversary could ever change via their manipulation, then the prediction is guaranteed to be robust! Importantly, for this approach to work well, it is key that these variations remain informative of the correct class while remaining hard for the adversary to perturb many of them simultaneously. This gives rise to a natural trade-off between robustness and accuracy.

By choosing different types of input variations, smoothing techniques can defend against different types of adversarial attacks. For example, smoothing classifiers with random additive noise can provide certified defenses against adversarial L2 and L1 attacks. On the other hand, smoothing over “pixel flows” can provide certified defenses against Wasserstein perturbations. One can even “smooth” the entire pipeline of both training and inference (treating it as a single module), to get robustness to data poisoning attacks. In general, if we can construct a set of input variations so that the adversary can only affect some fraction of them, we can use smoothing to defend against the corresponding attacks!

## Derandomized Smoothing: A Smoothing Defense against Patch Attacks

How can we use smoothing to defend against adversarial patches specifically? Recall that to be able to create a smoothing defense, we need to come up with a way to generate variations of an image that limit the effect of an adversarial patch and still be indicative of the correct class label. Here, we exploit a key property of adversarial patches: they are limited to a small, contiguous region of the input image.

Specifically, for each variation, we mask out all of the image except for a small region. In most of these variations, the adversarial patch will be fully masked, and thus cannot affect the model’s prediction by construction. This technique was originally introduced in a certified defense known as derandomized smoothing (Levine & Feizi, 2020). Typically, we leave a column of the image unmasked as shown in the figure below, otherwise known as a column ablation (check out the original paper for some other ablation types).

Example of column ablations for an input image. Most of the image is masked out except for a column of the image. An adversarial patch can only affect a column ablation if the patch itself is located within the ablation, thus limiting the effect of an adversary. Click on the image to see other column ablations for the same image.

## Certified guarantee against patch attacks

Now that we have our smoothed classifier, how can we prove that the prediction is robust to patch attacks? Since a patch is of limited size, an adversary can only affect the few image variations that physically overlap with the patch. Consequently, if the margin of the winning prediction is large enough, then we can guarantee robustness to adversarial patches.

Classified column ablations
Duck
Duck
Duck
Duck
Duck
Duck
Boat
Finch
A simplified demonstration of the derandomized smoothing defense for patch robustness. Here, an image of a duck is split into column ablations (our choice of input variation), each of which are classified individually. Since 6 out of 8 column ablations are classified as duck, the smoothed model predicts that this is a duck. A 32x32 patch is only large enough to affect at most 2 column ablations, which is not enough to change the prediction. Thus, this prediction is certifiably robust to 32x32 patches. In contrast, a 64x64 patch can affect up to 3 column ablations and can potentially flip the most frequent class to boat, and so the model is not certifiably robust to 64x64 patches.

To make this more precise, let $\Delta$ be the number of ablations that a patch could simultaneously affect (in the worst case). If the winning margin is at least $2\Delta$, then the smoothing defense guarantees that the predicted class is robust to this patch size. The fraction of examples which meet this certification threshold is typically referred to as certified accuracy.

For adversarial patch attacks of size m x m, and column ablation size $b$, we have $\Delta=m+b-1$

## Challenges for Certified Patch Defenses

Although certified patch defenses can provide guarantees against adversarial attacks, like many other certified defenses, they face several major challenges that limit their practicality:

1. They provide relatively modest guarantees, and can only guarantee robustness for a relatively small fraction of inputs and/or only a small patch size.
2. The standard accuracy is typically much lower than that of a standard, non-robust model, forcing on the practitioners an unfavorable “trade-offs” between robustness and accuracy.
3. Inference times tend to be several orders of magnitude larger than that of a standard, non-robust model, making certified defenses difficult to deploy in real-time settings.

In the next post, we discuss our latest work on making certified defenses much more practical. Specifically, we show how leveraging vision transformers (ViT) enables better handling of input variations and significant improvement of the margins for certification. With some straightforward modifications to the ViT architecture, we are also able to develop certifiably robust models that not only outperform previous certified defenses but also have practical inference times and standard accuracies.

### Certified Patch Robustness Via Smoothed Vision Transformers (Part 2)

In our previous post, we gave an overview of smoothing-based approaches to certified defenses, and described a certified patch defense known as de-randomized smoothing. In our latest work, we show how to leverage vision transformers (ViTs) to significantly improve such certified patch defenses along all possible axes: standard accuracy, certified accuracy, and speed. In fact, with some straightforward modifications, our certified ViTs not only enable significantly better certified robustness but they also maintain standard accuracy and inference times that are comparable to that of regular (non-robust) models. This progress marks a substantial step forward for certified defenses as a practical alternative to regular models.

An overview of the Smoothed ViT.

The main appeal of certified defense is their ironclad guarantees: a certified result leaves no room for doubt and is not prone to blind-spots of empirical evaluation. However, practitioners care not only about certified performance: the accuracy and the inference speed of the model are equally important factors too. Unfortunately, nearly all certified defenses come at a severe price: (1) the standard accuracy plummets in comparison to a standard model and/or (2) inference time is orders of magnitude slower than that of a standard model.

A standard ResNet-50 can achieve about 76% accuracy on the ImageNet benchmark, and takes less than 1 second to make a prediction on a typical GPU. In contrast, top-performing certified defenses such as Levine & Feizi (2020) reports 44% standard accuracy and takes 150 seconds to make a prediction on similar hardware.

This trade-off between robustness and performance can be untenable even for safety-critical applications. For example, object detection systems deployed in autonomous vehicles need to recognize obstacles and signs quickly in real time. Consequently, in addition to being robust, these systems also need to be fast and accurate.

To approach this tradeoff between robustness and accuracy, we take a closer look at certified defenses from an architectural perspective. Specifically, smoothing defenses use a backbone architecture that is typically implemented using classic convolutional networks, the default choice across much of computer vision. But are these architectures really the best tool for the job?

In our work, we show that when moving from standard models to certified defenses, it may be worth it to rethink the core architecture behind them. Specifically, we find that smoothing-based patch defenses can drastically improve when using vision transformers (ViTs) instead of convolutional networks as their backbone. With some additional adjustments to the transformer architecture, we not only achieve superior certified accuracy to prior work but also reach standard accuracies and inference times comparable to a standard, non-robust ResNet!

## Rethinking smoothing backbones

Recall from the previous post that smoothing defenses use a base classifier to make predictions on many variations of the original input. For patch defenses, this amounts to using the base classifier to classify column ablations (i.e., masked images that only reveal a small column of the original image). Consequently, the accuracy of the base classifier at classifying columns of an image directly translates into the performance of the certified defense.

In the light of this observation, it is crucial to realize that convolutional networks typically process complete (i.e., not masked) images. So, this type of architecture might not actually be best suited for classifying the highly masked images used in smoothing defenses.

Why might we expect convolutions to be suboptimal for masked images? CNNs slide filters across the image and grow the receptive field across layers of the network. This large receptive field, while usually being a strength of CNNs for standard settings, forces the network to process a large amount of masked pixels when processing ablated images. Since these pixels carry no information, the network must learn to detect and ignore these masked regions (in addition to performing unnecessary computation).

## Vision transformers for ablated images

Unlike CNNs, vision transformers (ViTs) use attention modules instead of convolutions as their main building block. Crucially, attention modules can ignore masked regions instead of processing them.

To test out this intuition, we measured the accuracy of various convolutional and ViT architectures at classifying column ablations. It turns out that ViTs are significantly better at this task—for example, on ImageNet, a ViT-S has 12% higher accuracy at classifying column ablations than a similarly sized ResNet-50.

Ablation accuracy of ViTs vs. ResNets: A comparison of the accuracy of ViTs with similarly sized ResNets at classifying image ablations.

These improvements in ablation accuracy directly boosts both the certified and standard accuracies for the smoothed model. For example, the increase in ablation accuracy improves the certified accuracy of a smoothed ViT-S by 13% (as compared to a similarly sized ResNet-50) and the standard accuracy by 12%.

With some additional adjustments to the certification procedure described in our paper, our largest certifiably robust vision transformer can reach 73% standard accuracy—only 3% off from the standard accuracy of a regular, non-robust ResNet-50, and almost 30% higher than previous versions of de-randomized smoothing!

Certified accuracy of ViTs vs. ResNets: A comparison of the certified accuracy of smoothed ViTs/ResNets when using different ablation sizes.
Standard accuracy of ViTs vs. ResNets: A comparison of the standard accuracy of smoothed ViTs/ResNets when using different ablation sizes.

## Modifying ViTs to Speed up Inference Time

Also, recall that a smoothed classifier tends to be significantly slower than a standard model because it needs to classify a large number of input variations for each input image. Indeed, traditional randomized smoothing approaches have used upwards of 100,000 variations, resulting in a forward pass that is 4-5 orders of magnitude slower than a standard model. These costs are infeasible in scenarios where decisions need to be made in a fraction of a second, such as autonomous driving. As Levine & Feizi have shown, for patch defenses, a move towards deterministic smoothing (as opposed to randomized) can result in a great speed up (by reducing the number of variations needed to be classified). However, the resulting solution is still two orders of magnitude slower than standard models.

In order to speed up inference (much) further, we leverage the token-based nature of ViTs to gracefully handle image ablations. Specifically, note that ViTs process images as tokens, where each token represents a contiguous patch of the input image. Since the runtime of a ViT is proportional to the number of tokens, a natural approach to speed up inference is to simply drop the masked out tokens, as shown in the following figure:

Vision transformers split images into patches, which are processed by the transformer architecture. When processing column ablated images, we can simply pass only the tokens in the ablation to significantly reduce the computational complexity of smoothing.

We modify the ViT architecture to drop fully masked tokens in such a manner. As expected, this significantly speeds up the inference time of the smoothed model! Indeed, with this optimization, our smoothed ViT-S is faster than a smoothed ResNet-50 by an order of magnitude. In our paper, we discuss how additional improvements to the certification procedure can make our ViT architectures another order of magnitude faster. In total, our fastest smoothed ViT is actually comparable in terms of inference speed to a standard, non-robust ResNet.

## Conclusion

Smoothed ViTs present a simple but effective way to achieve state-of-the art certified accuracy. These models are also realistically deployable, as they maintain similar accuracy and inference speed as non-robust models. The key here is the vision transformer’s inherent ability to quickly and accurately process image ablations. These advancements are thus a step towards certified defenses that can be practically deployed in real-world scenarios.

### Relevant neighbors

from David Eppstein

I have a new preprint, “Finding Relevant Points for Nearest-Neighbor Classification”, arXiv:2110.06163, to appear in January at the SIAM Symposium on Simplicity in Algorithms (SOSA22). It’s about points in Euclidean spaces of dimension three or more, but I thought it would make a good warm-up to discuss here the one-dimensional version of the same problem, solved (together with the 2d version) by Bremner, Demaine, Erickson, Iacono, Langerman, Morin, and Toussaint in their paper “Output-sensitive algorithms for computing nearest-neighbour decision boundaries”, Discrete Comput. Geom. 2005.

So in this problem, you have a collection of real-valued data points with known discrete classifications (say, a finite set of colors), and you want to guess the color of new points whose color is not already given. Nearest neighbor classification means simply that, to guess the color of $x$, you find the closest known point $y$ and guess that $x$ has the same color as $y$. One easy way to do this would be to store the known points in a sorted list and use binary search. There’s lots more to say about this (for instance its use in combination with random projections for high-dimensional approximate nearest neighbors) but for today I want to focus on the size of this sorted list. We can store a list that is potentially much smaller, but always produces the same results, by keeping only points that have a neighbor with a different classification.

These points with differently-classified neighbors are the “relevant points” of the preprint title. Another way of describing them is that a point is relevant if deleting it would change the classification of some other (unknown) points in space that might later be queried. Among the set of decision boundaries, the ones separating parts of space with different classifications are the ones defined by relevant points. So if we store a nearest-neighbor data structure with only relevant points, we will get the same answer as if we store all known points. But because we’re storing fewer points, it will take less memory and less time (especially if the reduced memory allows it to fit into cache).

If you have the points given to you in sorted order, then it’s easy enough to scan through them in that order, keeping track of which pairs have different classifications, and produce a filtered sequence of the relevant ones. Here it is in Python (with apologies for the low-contrast syntax coloring):

However, as Bremner et al observed, sorting an input to make it usable by this scan does more work than necessary, because we don’t need the sorted ordering of all the other points between the relevant ones. Instead, we can use an idea resembling quickselect, where we modify the quicksort algorithm to stop recursing in subproblems where sorting is unnecessary. For finding relevant points, these subproblems are the homogeneous ones, in which all points have the same classification as each other. Bremner et al combined this idea with a version of quicksort that always partitions its subproblems at the exact median, in order to achieve a good worst-case time bound, but if we’re happy with expected analysis we can use the same random-pivoting idea as the more usual form of quicksort:

The time complexity of this algorithm can be analyzed much like the analysis of quickselect, by observing that the time is proportional to the number of comparisons with pivots, computing the probability that each possible comparison happens, and summing. In quicksort, two distinct elements at distance $d$ from each other in the final sorted output are compared whenever one of them is the first to be chosen as a pivot in the interval between them, which happens with probability exactly $2/(d+1)$. In quickrelevant, this same probability holds for pairs that are separated by one of the decision boundaries. But pairs of elements that are within the same homogeneous block are less likely to be compared, because of the possibility that the recursion will stop before it separates or compares them.

If a pair of elements lies within a single block, has distance $d$ separating them, and is $e$ units from both ends of the block, then it will only be compared if one of the two elements is first to be chosen in at least one of the two intervals of length $d+e$ extending from the two elements towards an end of their block. This happens with probability at most $4/(d+e)$, because there are two ways of choosing which element to pick first as the pivot and two ways of choosing which extended interval it is first in.

If we sum up these probabilities, for pairs involving a single element that is $e$ units from its nearest block boundary among a set of $n$ elements, we get $O\bigl(\log(n/e)\bigr)$ as the expected contribution to the total time for that one element. If we sum the contributions from the elements within a block, for a block of length $\ell_i$, we get a total expected contribution from that block of $O\bigl(\ell_i\log(n/\ell_i)\bigr)$. And if we have $k$ relevant points and $O(k)$ blocks, and we sum over all blocks, the total time is maximized when all the blocks are of equal size, $\Theta(n/k)$, for which we get total time $O(n\log k)$.

For quicksort and quickselect, it’s possible to be more careful in the analysis, derive exact formulas for the probability of making each comparison, and from them get an analysis of the expected number of comparisons that does not use $O$-notation for its leading term; see my linked post on the quickselect analysis. Probably it’s possible here too but it looks messier. Maybe it would make a good undergraduate research project. One thing to be careful of is that the comparisons are not over in the homogeneous case; finding the min and max simultaneously, in a block of length $\ell_i$, takes roughly $3\ell_i/2$ comparisons. But that should only be a lower-order term compared to the $O(n\log k)$ leading term of the analysis.

by David Eppstein at October 13, 2021 11:05 PM UTC

### Postdoc in Theory of Machine Learning at Harvard University (apply by December 1, 2021)

from CCI: jobs

We are looking for exceptional junior scientists to work collaboratively with a group of faculty from Computer Science (Boaz Barak), Statistics (Lucas Janson), Electrical Engineering (Demba Ba) and Applied Mathematics (Cengiz Pehlevan) towards a theory of representations in artificial and natural systems.

Email: theory-postdoc-apply@seas.harvard.edu

by shacharlovett at October 13, 2021 10:38 PM UTC

### Rabin Postdoc at Harvard University (apply by December 1, 2021)

from CCI: jobs

We are looking for junior scientists in theoretical computer science, broadly construed.
The normal duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an allocation for research and travel expenses.
While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests.

Email: theory-postdoc-apply@seas.harvard.edu

by shacharlovett at October 13, 2021 10:36 PM UTC

### TR21-143 | Hitting Sets For Regular Branching Programs | Edward Pyne

from ECCC papers

We construct an explicit $\varepsilon$-hitting set generator (HSG) for regular ordered branching programs of length $n$ and $\textit{unbounded width}$ with a single accept state that has seed length $O(\log(n)(\log\log(n)+\log(1/\varepsilon))).$ Previously, the best known seed length for regular branching programs of width $w$ with a single accept state was by Braverman, Rao, Raz and Yehudayoff (FOCS 2010, SICOMP 2014) and Hoza Pyne and Vadhan (ITCS 2021), which gave $O(\log(n)(\log\log(n)+\min\{\log(w),\log(n)\}+\log(1/\varepsilon))).$ We also give a simple co-HSG for the model with optimal seed length $O(\log n)$. For the more restricted model of $\textit{permutation}$ branching programs, Hoza Pyne and Vadhan (ITCS 2021) constructed a PRG with seed length matching our HSG, and then Pyne and Vadhan (CCC 2021) developed an error-reduction procedure that gave an HSG (in fact a weighted PRG'') with seed length $\widetilde{O}(\log(n)\sqrt{\log(n/\varepsilon)}+\log(1/\varepsilon)).$ We show that if an analogous error-reduction result could be obtained for our HSG, there is an explicit HSG for general ordered branching programs of width $w=n$ with seed length $\widetilde{O}(\log^{3/2}n)$, improving on the $O(\log^2n)$ seed length of Nisan (Combinatorica 1992).

### IDEAL mini-workshop on “Statistical and Computational Aspects of Robustness in High-dimensional Estimation”

from CS Theory Events

October 19, 2021 Virtual https://www.ideal.northwestern.edu/events/mini-workshop-on-statistical-and-computational-aspects-of-robustness-in-high-dimensional-estimation/ Registration deadline: October 18, 2021 Today’s data pose unprecedented challenges to statisticians. It may be incomplete, corrupted or exposed to some unknown source of contamination or adversarial attack. Robustness is one of the revived concepts in statistics and machine learning that can accommodate such complexity and glean useful information from … Continue reading IDEAL mini-workshop on “Statistical and Computational Aspects of Robustness in High-dimensional Estimation”

by shacharlovett at October 13, 2021 03:05 AM UTC

### Machine learning is not nonparametric statistics.

from Ben Recht

Many times in my career, I’ve been told by respected statisticians that machine learning is nothing more than nonparametric statistics. The longer I work in this field, the more I think this view is both misleading and unhelpful. Not only can I never get a consistent definition of what “nonparametric” means, but the jump from statistics to machine learning is considerably larger than most expect. Statistics is an important tool for understanding machine learning and randomness is valuable for machine learning algorithm design, but there is considerably more to machine learning than what we learn in elementary statistics.

Machine learning at its core is the art and science of prediction. By prediction, I mean the general problem of leveraging regularity of natural processes to guess the outcome of yet unseen events. As before, we can formalize the prediction problem by assuming a population of $N$ individuals with a variety of attributes. Suppose each individual has an associated variable $X$ and $Y$. The goal of prediction is to guess the value of $Y$ from $X$ that minimizes some error metric.

A classic prediction problem aims to find a function that makes the fewest number of incorrect predictions across the population. Think of this function like a computer program that takes $X$ as input and outputs a prediction of $Y$. For a fixed prediction function, we can sum up all of the errors made on the population. If we divide by the size of the population, this is the mean error rate of the function.

A particularly important prediction problem is classification. In classification, the attribute $Y$ takes only two values: the input $X$ could be some demographic details about a person, and $Y$ would be whether or not that person was taller than 6 feet. The input $X$ could be an image, and $Y$ could be whether or not the image contains a cat. Or the input could be a set of laboratory results about a patient, and $Y$ could be whether or not the patient is afflicted by a disease. Classification is the simplest and most common prediction problem, one that forms the basis of most contemporary machine learning systems.

For classification problems, it is relatively straightforward to compute the best error rate achievable. First, for every possible value of the attribute $X$, collect the subgroup of individuals of the population with that value. Then, the best assignment for the prediction function is the one that correctly labels the majority of this subgroup. For example, in our height example, we could take all women, aged 30, born in the United States, and reside in California. Then the optimal label for this group would be decided based on whether there are more people in the group who are taller than 6 feet or not. (Answer: no).

This minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. What can we do if we work from a subsample? Just as was the case in experiment design, we’d like to be able to design good prediction functions from a small sample of the population so we don’t have to inspect all individuals. For a fixed function, we could use the same law-of-large-numbers approximations to estimate the best decision. That is, if we decide in advance upon a prediction function, we could estimate the percentage of mistakes on the population by gathering a random sample and computing the proportion of mistakes on this subset. Then we could apply a standard confidence interval analysis to extrapolate to the population.

However, what if we’d like to find a good predictor on the population using only a set of examples sampled from the population. We immediately run into an issue: to find the best prediction function, we needed to observe all possible values of $X$. What if we’d like to make predictions about an individual with a set of attributes that was not observed in our sample?

How can we build accurate population-level predictors from small subsamples? In order to solve this problem, we must make some assumptions about the relationship between predictions at related, but different values of $X$. We can restrict our attention to a set of functions that respect regularity properties that we think the predictions should have. Then, with a subsample from the population, we find the function that minimizes the error on the sample and obeys the prescribed regularity properties.

This optimization procedure is called ”empirical risk minimization” and is the core predictive algorithm of machine learning. Indeed, for all of the talk about neuromorphic deep networks with fancy widgets, most of what machine learning does is try to find computer programs that make good predictions on the data we have collected and that respect some sort of rudimentary knowledge that we have about the broader population.

The flexibility in defining what “knowledge” or “regularity” means complicates the solution of such empirical risk minimization problems. What does the right set of functions look like? There are three immediate concerns:

1. What is the right representation? The set needs to contain enough functions to well approximate the true population prediction function. There are a variety of ways to express complex functions, and each expression has its own benefits and drawbacks.

2. The set of functions needs to be simple to search over, so we don’t have to evaluate every function in our set as this would be too time consuming. Efficient search for high quality solutions is called optimization.

3. How will the predictor generalize to the broader population? The functions cannot be too complex or else they will fail to capture the regularity and smoothness of the prediction problem (estimating functions of too high complexity is colloquially called “overfitting”).

Balancing representation, optimization, and generalization gets complicated quickly, and this is why we have a gigantic academic and industrial field devoted to the problem.

I’m repeating myself at this point, but I again want to pound my fist on the table and reiterate that nothing in our development here requires that the relationship between the variables $X$ and $Y$ be probabilistic. Statistical models are often the starting point of discussion in machine learning, but such models are just a convenient way to describe populations and their proportions. Prediction can be analyzed in terms of a deterministic population, and, just as we discussed in the case of randomized experiments, randomness can be introduced as a means of sampling the population to determine trends. Even generalization, which is usually studied as a statistical phenomenon, can be analyzed in terms of the randomness of the sampling procedure with no probabilistic modeling of the population.

On the other hand, some sort of knowledge about the population is necessary. The more we know about how prediction varies based on changes in the covariates, the better a predictor we can build. Engineering such prior knowledge into appropriate function classes and optimization algorithms form the art and science of contemporary machine learning.

This discussion highlights that while we can view prediction through the lens of statistical sampling, pigeonholing it as simply “nonparametric statistics” does not do the subject justice. While the jump from mean estimation to causal RCTs is small, the jump from mean estimation to prediction is much less immediate. And in machine learning practice, the intuitions from statistics often don’t apply. For example, conventional wisdom from statistics tells us that evaluating multiple models on the same data set amounts to multiple hypothesis testing, and will lead to overfitting on the test set. However, there is more and more evidence that using a train-test split leads does not lead to overfitting. Instead, the phenomena we see is that dataset benchmarks can remain useful for decades. Another common refrain from statistics is that model complexity must be explicitly constrained in order to extrapolate to new data, but this also does not seem to apply at all to machine learning practice.

Prediction predates probability and statistics by centuries. As Moritz and I chronicle in the introduction to Patterns, Predictions, and Actions astronomers were using pattern matching to predict celestial motions, and the astronomer Edmund Halley realized that similar techniques could be used to predict life expectancy when pricing annuities. Moreover, even though modern machine learning embraced contemporary developments in statistics by Neyman, Pearson, and Wald, the tools quickly grew more sophisticated and separate from core statistical practice. In the next post, I’ll discuss an early example of this divergence between machine learning and statistics, describing some of the theoretical understanding of the Perceptron in the 1960s and how its analysis was decidedly different from the theory advanced by statisticians.

### The Khot-Naor Approximation Algorithm for 3-XOR

from Luca Trevisan

Today I would like to discuss the Khot-Naor approximation algorithm for the 3-XOR problem, and an open question related to it.

In 3-XOR, we have a system of linear equations modulo 2, with three variables per equation, that might look something like

$\displaystyle \begin{array}{ll} x_1 + x_2 + x_4 & \equiv 0 \pmod 2\\ x_1 + x_5 + x_6 & \equiv 1 \pmod 2\\ x_2 + x_3 + x_4 & \equiv 1 \pmod 2\\ x_5 + x_3 + x_6 & \equiv 1 \pmod 2 \end{array}$

The above system is not satisfiable (if we add up the left-hand sides we get 0, if we add up the right-hand sides we get 1), but it is possible to satisfy ${3/4}$ of the equations, for example by setting all the variables to 1. In Max 3-XOR problem (which we will simply refer to as “3-XOR” from now on), given a system of equations mod 2 with three variables per equation, we want to find an assignment that satisfies as many equations as possible.

Either setting all the variables to zero or setting all the variables to one will satisfy half of the equations, and the interesting question is how much better than 1/2 it is possible to do on a given instance. Khot and Naor provide an algorithm that, given an instance in which it is possible to satisfy an ${1/2 + \epsilon}$ fraction of equations, finds a solution that satisfies at least ${1/2 + \epsilon \cdot O \left( \sqrt{\frac {\log n} n} \right)}$ fraction of equations, where ${n}$ is the number of variables. The algorithm is randomized, it runs in polynomial time, and it succeeds with high probability. I believe that it is still the state of the art in terms of worst-case approximation guarantee.

Like the approximation algorithm for sparsest cut in Abelian Cayley graphs implied by the result of Bauer et al. that was the subject of the last two posts, the result of Khot and Naor does not prove a bound on the integrality gap of a relaxation of the problem.

I will describe the Khot-Naor algorithm and describe how it manages to use convex optimization to provide an approximation algorithm, but without establishing an integrality gap bound. I thank my student Lucas Pesenti for explaining the algorithm to me and for thinking together about this problem.

If our 3-XOR instance has ${m}$ equations and ${n}$ variables, then the problem of maximizing the number of satisfied equations can be rewritten as

$\displaystyle \frac m2 + \max_{x \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} c_{i,j,k} x_i x_j x_k$

so that our goal is to approximate the combinatorial optimization problem

$\displaystyle \max_{x \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} c_{i,j,k} x_i x_j x_k$

Up to a constant factor loss in the approximation guarantee, Khot and Naor show that the above is equivalent to

$\displaystyle \max_{x,y,z \in \{ -1 , +1 \}^n} \ \sum_{i,j,k} T_{i,j,k} x_i y_j z_k \ \ \ \ \ (1)$

where ${T_{i,j,k}}$ is a symmetric 3-tensor with entries in ${\{ -1,0,1\}}$ and with ${O(m)}$ non-zero entries.

Before continuing, let us recall that if ${M}$ is an ${n \times n}$ matrix, then its ${\ell_\infty}$-to-${\ell_1}$ operator norm has the characterization

$\displaystyle || M ||_{\infty \rightarrow 1 } = \max_{x,y \in [-1,1]^n} x^T M y = \max_{x,y \in \{ -1,1 \}^n} x^T M y \ \ \ \ \ (2)$

We could also define the “Grothendieck norm” ${|| M ||_{Grot}}$ of a matrix as the following natural semidefinite programming relaxation of the ${\ell_\infty}$-to-${\ell_1}$ norm:

$\displaystyle \begin{array}{lll} \max & \sum_{i,j} M_{i,j} \langle x_i , y_j\rangle \\ {\rm s.t.}\\ & || x_i ||^2 = 1 & i = 1,\ldots,n\\ & ||y_j ||^2 = 1 & i = j,\ldots,n \end{array}$

where the ${x_i}$ and ${y_j}$ are arbitrary vectors. The Grothendieck inequality is

$\displaystyle || M||_{\infty \rightarrow 1 } \leq || M ||_{Grot} \leq O(1) \cdot || M ||_{\infty \rightarrow 1 }$

where the ${O(1)}$ is an absolute constant, known to be less than ${1.8}$. Furthermore, the above inequality has a constructive proof, and it leads to a polynomial time constant factor approximation for the problem of finding values ${x_i}$ and ${y_i}$ in ${\pm 1}$ that maximize (2).

Basically, we can see problem (1) as the natural generalization of (2) to tensors, and one would like to see a semidefinite programming relaxation of (1) achieving something resembling the Grothendieck inequality, but with a loss of something like ${\tilde O(\sqrt n)}$. As I mentioned above, this remains an open question, as far as I know.

The idea of Khot and Naor is the following. Suppose that we are given an instance ${T}$ of problem (1), and suppose that ${x^*,y^*,z^*}$ is an optimal solution, and let us call

$\displaystyle \epsilon = \sum_{i,j,k} T_{i,j,k} x^*_i y^*_j z^*_k$

the value of the optimum (the algorithm will not need to know or guess ${\epsilon}$).

The key step is now to see that if we pick a random ${x \in \{-1,1\}^n}$, there is at least a ${1/n^{O(1)}}$ probability that

$\displaystyle \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{\log n}{n}} \right)$

This is a bit difficult, but it is really easy to see that with ${1/n^{O(1)}}$ probability we have

$\displaystyle \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right) \ \ \ \ \ (3)$

and we can do that by seeing that by defining a vector ${w}$ such that ${w_i = \sum_{j,k} T_{i,j,k} y^*_j z^*_k}$, so that

$\displaystyle \langle x, w \rangle = \sum_{i,j,k} T_{i,j,k} x_i y^*_j z^*_k$

So we have

$\displaystyle \langle x^*,w \rangle = \epsilon$

which, using Cauchy-Schwarz, gives

$\displaystyle || w||^2 \geq \epsilon^2 /n$

Now, for a random ${x\sim \{ -1,1\}^n}$, we have

$\displaystyle \mathop{\mathbb E} \langle x,w \rangle^2 = || w||^2 \geq \epsilon^2 /n$

and

$\displaystyle \mathop{\mathbb E} \langle x,w \rangle^4 \leq n^{O(1)}$

so by Paley–Zygmund we have, let’s say

$\displaystyle \mathop{\mathbb P} \left[ \langle x,w \rangle^2 \geq \frac {\epsilon^2}{2n} \right] \geq \frac 1 {n^{O(1)}}$

which, together with the definition of ${w}$ and the fact that the distribution of ${\langle x, w\rangle}$ is symmetric around zero, gives us the claim (3).

Now suppose that we have been lucky and that we have found such an ${x}$. We define the matrix

$\displaystyle X_{j,k} = \sum_i T_{i,j,k} x_i$

and we see that our claim can be written as

$\displaystyle y^{*T} X z^* \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right)$

At this point we just apply the algorithm implied by the Grothendieck inequality to the matrix ${X}$, and we find ${y}$ and ${z}$ in ${\{ -1,1 \}^n}$ such that

$\displaystyle y^T X z \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}} \right)$

meaning that

$\displaystyle \sum_{i,j,k} T_{i,j,k} x_i y_j z_k \geq \epsilon \cdot \Omega \left(\sqrt{\frac{1}{n}}\right)$

Summarizing, our algorithm is to pick a random vector ${x\sim \{-1,1\}}$ and to find a constant-factor approximation for the problem

$\displaystyle \max_{y,z \in \{-1,1\}^n} \ \sum_{i,j,k} T_{i,j,k} x_iy_jz_k \ \ \ \ \ (4)$

using semidefinite programming. We do that ${n^{O(1)}}$ times, and take the best solution.

The analysis can be turned into an upper bound certificate in the following way. For the (suboptimal) analysis using Paley-Zygmund, we only need the entries of the random ${x}$ to be 4-wise independent, and there are distributions on ${\{-1,1\}^n}$ where the entries are unbiased and 4-wise independent, and such that the sample space is of size polynomial in ${n}$. Thus, one could write an SDP relaxation of (4) for each ${x}$ in the support of such a distribution, and then take the maximum of these SDPs, multiply it by ${O(\sqrt n)}$, and it would be a certified upper bound. Such an upper bound, however, would not come from a relaxation of the 3-XOR problem, and I find it really strange that it is not clear how to turn these ideas into a proof that, say, the standard degree-4 sum-of-squares semidefinite programming relaxation of 3-XOR has an integrality gap at most ${\tilde O(\sqrt n)}$.

### Lecturer in Optimisation at Queen Mary University of London (apply by November 3, 2021)

from CCI: jobs

The School of Mathematical Sciences at Queen Mary University of London is seeking to appoint a permanent faculty member in the area of optimisation, with a focus on applying rigorous research methods to address emerging problems in the general field of combinatorial optimisation. We especially welcome applicants who would complement existing expertise in the School’s Combinatorics group.

Website: https://webapps2.is.qmul.ac.uk/jobs/job.action?jobID=5954
Email: r.johnson@qmul.ac.uk

by shacharlovett at October 11, 2021 04:34 PM UTC

### Faculty at Williams College (apply by November 15, 2021)

from CCI: jobs

The Department of Computer Science at Williams College invites applications for two faculty positions beginning July 1,2022. One is a tenure-track position at the rank of assistant professor with a three-year initial term. The other is an open rank position with a preference for more advanced candidates. That position will have terms commensurate with prior experience.

Website: https://apply.interfolio.com/91229
Email: hiring@cs.williams.edu

by shacharlovett at October 11, 2021 02:51 PM UTC

### I have a book out on muffins (you prob already know that)

Lance: How come you haven't blogged on your muffin book? You've blogged about two books by Harry Lewis (see here and here) one book by the lesswrong community (see here), and you even did a mashup of a post by two different Scott A's (see here),  but not on your own work.

Bill: I thought I did a post on my muffin book.

Lance: No. You have blogged about the muffin problem, and sometimes you mention either the book or the problem in passing, but you haven't had a post that says

HEY, I wrote a book!

And this is all the more strange since you asked me to have the book on our blog page.

Bill: (Searches blog with keyword muffin and finds no ref to muffin book). Well pierce my ears and call be drafty! I have not posted on the muffin book! Do you recall my thoughts on when to tell people you are working on a book?

Lance: No

Bill:  I had a college roommate who was an aspiring science fiction writer who told me there are two kinds of people: Those who talk about writing a book, and those who write a book. I have adapted this to:

Do not tell people you are writing a book until you are picking out the cover art.

Lance: I posted about my book when I hadn't even decided on the title. But your cover art is picked out (see here).  And, by the way, its very nice, though it makes me hungry. So I think you can begin talking about the book.

Bill: Indeed! I will!

------------------------------------------------------------------------------------

Hey I have a book! (See here to buy it on amazon.)

Title: Mathematical Muffin Morsels: Nobody Wants a Small Piece

by Gasarch, Metz, Prinz, Smolyak

(The other authors were undergraduates when we wrote the book. Prinz and Smolyak are now grad students in CS, Metz is in Finance.)

Origin:

Martin Gardner wrote a Mathematics Recreational column for Scientific American for many years, starting in 1956 and ending in the early 1980s. For many STEM people of my generation (Using my fake birthday of Oct 1, 1960, I am 62 years old) Martin Gardner's columns were both an inspiration and an early exposure to mathematics. His columns also made the line between Mathematical Recreation and so-called serious mathematics thin or nonexistent. (See here for a review of Martin Gardner in the 21st century, a book about the kind of math Gardner wrote of. The book makes a mockery of the distinction between recreational and serious mathematics.) He passed away in 2010 at the age of 95.

There is a gathering in his honor that is hold roughly every 2 years, called Gathering For Gardner. (It was cancelled in Spring 2020 and Spring 2021 because of COVID- though its in Atlanta where the CDC is, so they could have had it as an experiment and told the CDC the results). You have to be invited to goto it. I got an invite for 2016 from my contact at World Scientific who published my previous book, Problems with a Point: Exploring Math and Computer Science co-authored with Clyde Kruskal  (I had two blogs on it, here and here, and you can buy it on amazon here.) I did three posts on G4G-2016 (herehere, and here).

Aside from seeing some great talks that I understood and liked, I also picked up a pamphlet titled:

The Julia Robinson Math Festival

A Sample of Mathematical Puzzles

Compiled By Nancy Blackman

One of the problems, credited to Alan Frank, was

How can you divide and distribute 5 muffins for 3 students so that everyone gets 5/3 and the smallest piece is as big as possible?

They had some other values for muffins and students as well.

I solved the (5,3) problem and the other ones as well. That was fun.

When I got home I began looking at the problem for m muffins and s students. I let f(m,s) be the biggest smallest piece possible for giving out m muffins to s students. I proved a general theorem, called the Floor-Ceiling theorem, that always gives an upper bound, FC(m,s) on f(m,s). I worked out formulas for

f(m,1) (trivial),

f(m,2) (trivial),

f(m,3) (its always FC(m,3),

f(m,4) (its always FC(m,4)).

While working on f(m,5) I found that  f(m,5) was always FC(m,5) EXCEPT for m=11. So what's up with f(11,5)?

By the Floor Ceiling theorem f(11,5) \le 11/25. We (at that point several ugrads and HS students had joined the project)  were unable to find a protocol that would show f(11,5)\ge 11/25. Personally I thought there WAS such an protocol but perhaps it was more complicated than the ones we had found (We were finding them by hand using some easy linear algebra.) Perhaps a computer program was needed. We did find a protocol for f(11,5)\ge 13/30, which surely was not optimal.

While on an Amtrak I began working out the following train of thought: The protocol for f(11,5)\le 11/25 MUST have

(1) every muffin cut into two pieces,

(2) 3 students get 4 pieces,

(3) 2 students get 5 pieces.

While working on getting a protocol for f(11,5)\le 11/25 with these properties I found that... there could be no such protocol! Then by reworking what I did I found that f(11,5)\le 13/30. So it was done! and we had a new technique, which we call The Half Method. To see the full proof see my slides here

The story above is typical: We get f(m,k) for all 1\le k\le SOMETHING, we get stuck, and then we find ANOTHER technique to show upper bounds (which in this case are limits on how well we can do). This happened about 8 times depending on how you count.  After a while we realized that this could not just be an article, this was a book! World Scienfiic agreed to publish it, and its out now.

Misc Notes

1) I got a conference paper out of it, in the Fun with Algorithms Conference, with some of the co-authors on the book, and some other people. here is the conf paper.

2) Early on we realized that f(m,s) = (m/s)f(s,m) so we only had to look at the m>s case.

3) The fact that f(m,s) exists and is rational is not obvious, but is true. In fact, f(m,s) can be found by a mixed-int program.

4) Late on in the process I found that there was a by-invite-only math newsgroup that had discussed the problem, and in fact was where Alan Frank first posted it. I obtained their materials and found that they had already shown f(m,s)=(m/s)f(s,m) and also that the answer is always rational and exists. Aside from that our results did not overlap.

5) Even later in the process Scott Huddleston emailed me (out of the blue) that he had a program that solved the muffin problem quickly. I was skeptical at first, but he did indeed have a whole new way to look at the problem and his code was very fast (I had Jacob Prinz, one of the co-authors on the book, recode it). Later Richard Chatwin (see here) seems to have proven that Scott's method always works. The approach of Scott and Richard is where to go if you want to do serious further research on Muffins. My book is where you want to go if you want to learn some easy and fun math (a HS student could read it).

6) I co-authored a column with Scott H, Erik Metz, Jacob Prinz on Muffins, featuring his technique, in Lane's complexity column, here.

7) I had an REU student, Stephanie Warman, write a muffin package based on the book.

8) I gave a talk an invited talk on The Muffin Problem  at a Joint AMS-MAA meeting.

9) I gave a talk at Gathering for Gardner 2018 on The Muffin Problem.

10) I often give talks on it to groups of High School students.

11) When I teach Discrete Math Honors I talk about it and assign problems on it- it really is part of the course. As such its a good way to reinforce the pigeon hole principle.

12) I contacted Alan Frank about my work. We arranged to meet at an MIT combinatorics seminar where I was to give a talk on muffins. He brought 11 muffins, with 1 cut (1/2,1/2), 2 cut (14/30,16/30),

and 8 cut (13/30,17/30) so that the 11 of us could each get 11/5 with smallest piece 13/30.

13) Coda:

Why did I keep working on this problem?  I kept working on it because I kept hitting barriers and (with co-authors) breaking them with new techniques that were interesting.  If early on a barrier was not breakable then I would have stopped. If (say) Floor-ceiling solved everything than I might have gotten a paper out of  this, but surely not a book.

Lesson for all of us: look around you! Its not clear what is going to inspire a project!

Lasting effect: I am reluctant to throw out old math magazines and pamphlets since you never know when one will lead to a book.

by gasarch (noreply@blogger.com) at October 11, 2021 01:18 AM UTC

### Gaussian BosonSampling, higher-order correlations, and spoofing: An update

from Scott Aaronson

In my last post, I wrote (among other things) about an ongoing scientific debate between the group of Chaoyang Lu at USTC in China, which over the past year has been doing experiments that seek to demonstrate quantum supremacy via Gaussian BosonSampling; and the group of Sergio Boixo at Google, which had a recent paper on a polynomial-time classical algorithm to sample approximately from the same distributions.  I reported the facts as I understood them at the time.  Since then, though, a long call with the Google team gave me a new and different understanding, and I feel duty-bound to share that here.

A week ago, I considered it obvious that if, using a classical spoofer, you could beat the USTC experiment on a metric like total variation distance from the ideal distribution, then you would’ve completely destroyed USTC’s claim of quantum supremacy.  The reason I believed that, in turn, is a proposition that I hadn’t given a name but needs one, so let me call it Hypothesis H:

The only way a classical algorithm to spoof BosonSampling can possibly do well in total variation distance, is by correctly reproducing the high-order correlations (correlations among the occupation numbers of large numbers of modes) — because that’s where the complexity of BosonSampling lies (if it lies anywhere).

Hypothesis H had important downstream consequences.  Google’s algorithm, by the Google team’s own admission, does not reproduce the high-order correlations.  Furthermore, because of limitations on both samples and classical computation time, Google’s paper calculates the total variation distance from the ideal distribution only on the marginal distribution on roughly 14 out of 144 modes.  On that marginal distribution, Google’s algorithm does do better than the experiment in total variation distance.  Google presents a claimed extrapolation to the full 144 modes, but eyeballing the graphs, it was far from clear to me what would happen: like, maybe the spoofing algorithm would continue to win, but maybe the experiment would turn around and win; who knows?

Chaoyang, meanwhile, made a clear prediction that the experiment would turn around and win, because of

1. the experiment’s success in reproducing the high-order correlations,
2. the admitted failure of Google’s algorithm in reproducing the high-order correlations, and
3. the seeming impossibility of doing well on BosonSampling without reproducing the high-order correlations (Hypothesis H).

Given everything my experience told me about the central importance of high-order correlations for BosonSampling, I was inclined to agree with Chaoyang.

Now for the kicker: it seems that Hypothesis H is false.  A classical spoofer could beat a BosonSampling experiment on total variation distance from the ideal distribution, without even bothering to reproduce the high-order correlations correctly.

This is true because of a combination of two facts about the existing noisy BosonSampling experiments.  The first fact is that the contribution from the order-k correlations falls off like 1/exp(k).  The second fact is that, due to calibration errors and the like, the experiments already show significant deviations from the ideal distribution on the order-1 and order-2 correlations.

Put these facts together and what do you find?  Well, suppose your classical spoofing algorithm takes care to get the low-order contributions to the distribution exactly right.  Just for that reason alone, it could already win over a noisy BosonSampling experiment, as judged by benchmarks like total variation distance from the ideal distribution, or for that matter linear cross-entropy.  Yes, the experiment will beat the classical simulation on the higher-order correlations.  But because those higher-order correlations are exponentially attenuated anyway, they won’t be enough to make up the difference.  The experiment’s lack of perfection on the low-order correlations will swamp everything else.

Granted, I still don’t know for sure that this is what happens — that depends on whether I believe Sergio or Chaoyang about the extrapolation of the variation distance to the full 144 modes (my own eyeballs having failed to render a verdict!).  But I now see that it’s logically possible, maybe even plausible.

So, let’s imagine for the sake of argument that Google’s simulation wins on variation distance, even though the experiment wins on the high-order correlations.  In that case, what would be our verdict: would USTC have achieved quantum supremacy via BosonSampling, or not?

It’s clear what each side could say.

Google could say: by a metric that Scott Aaronson, the coinventor of BosonSampling, thought was perfectly adequate as late as last week — namely, total variation distance from the ideal distribution — we won.  We achieved lower variation distance than USTC’s experiment, and we did it using a fast classical algorithm.  End of discussion.  No moving the goalposts after the fact.

Google could even add: BosonSampling is a sampling task; it’s right there in the name!  The only purpose of any benchmark — whether Linear XEB or high-order correlation — is to give evidence about whether you are or aren’t sampling from a distribution close to the ideal one.  But that means that, if you accept that we are doing the latter better than the experiment, then there’s nothing more to argue about.

USTC could respond: even if Scott Aaronson is the coinventor of BosonSampling, he’s extremely far from an infallible oracle.  In the case at hand, his lack of appreciation for the sources of error in realistic experiments caused him to fixate inappropriately on variation distance as the success criterion.  If you want to see the quantum advantage in our system, you have to deliberately subtract off the low-order correlations and look at the high-order correlations.

USTC could add: from the very beginning, the whole point of quantum supremacy experiments was to demonstrate a clear speedup on some benchmark — we never particularly cared which one!  That horse is out of the barn as soon as we’re talking about quantum supremacy at all — something the Google group, which itself reported the first quantum supremacy experiment in Fall 2019, again for a completely artificial benchmark — knows as well as anyone else.  (The Google team even has experience with adjusting benchmarks: when, for example, Pan and Zhang pointed out that Linear XEB as originally specified is pretty easy to spoof for random 2D circuits, the most cogent rejoinder was: OK, fine then, add an extra check that the returned samples are sufficiently different from one another, which kills Pan and Zhang’s spoofing strategy.) In that case, then, why isn’t a benchmark tailored to the high-order correlations as good as variation distance or linear cross-entropy or any other benchmark?

Both positions are reasonable and have merit — though I confess to somewhat greater sympathy for the one that appeals to my doofosity rather than my supposed infallibility!

OK, but suppose, again for the sake of argument, that we accepted the second position, and we said that USTC gets to declare quantum supremacy as long as its experiment does better than any known classical simulation at reproducing the high-order correlations.  We’d still face the question: does the USTC experiment, in fact, do better on that metric?  It would be awkward if, having won the right to change the rules in its favor, USTC still lost even under the new rules.

Sergio tells me that USTC directly reported experimental data only for up to order-7 correlations, and at least individually, the order-7 correlations are easy to reproduce on a laptop (although sampling in a way that reproduces the order-7 correlations might still be hard—a point that Chaoyang confirms, and where further research would be great). OK, but USTC also reported that their experiment seems to reproduce up to order-19 correlations. And order-19 correlations, the Google team agrees, are hard to sample consistently with on a classical computer by any currently known algorithm.

So then, why don’t we have direct data for the order-19 correlations?  The trouble is simply that it would’ve taken USTC an astronomical amount of computation time.  So instead, they relied on a statistical extrapolation from the observed strength of the lower-order correlations — there we go again with the extrapolations!  Of course, if we’re going to let Google rest its case on an extrapolation, then maybe it’s only sporting to let USTC do the same.

You might wonder: why didn’t we have to worry about any of this stuff with the other path to quantum supremacy, the one via random circuit sampling with superconducting qubits?  The reason is that, with random circuit sampling, all the correlations except the highest-order ones are completely trivial — or, to say it another way, the reduced state of any small number of output qubits is exponentially close to the maximally mixed state.  This is a real difference between BosonSampling and random circuit sampling—and even 5-6 years ago, we knew that this represented an advantage for random circuit sampling, although I now have a deeper appreciation for just how great of an advantage it is.  For it means that, with random circuit sampling, it’s easier to place a “sword in the stone”: to say, for example, here is the Linear XEB score achieved by the trivial classical algorithm that outputs random bits, and lo, our experiment achieves a higher score, and lo, we challenge anyone to invent a fast classical spoofing method that achieves a similarly high score.

With BosonSampling, by contrast, we have various metrics with which to judge performance, but so far, for none of those metrics do we have a plausible hypothesis that says “here’s the best that any polynomial-time classical algorithm can possibly hope to do, and it’s completely plausible that even a noisy current or planned BosonSampling experiment can do better than that.”

In the end, then, I come back to the exact same three goals I would’ve recommended a week ago for the future of quantum supremacy experiments, but with all of them now even more acutely important than before:

1. Experimentally, to increase the fidelity of the devices (with BosonSampling, for example, to observe a larger contribution from the high-order correlations) — a much more urgent goal, from the standpoint of evading classical spoofing algorithms, than further increasing the dimensionality of the Hilbert space.
2. Theoretically, to design better ways to verify the results of sampling-based quantum supremacy experiments classically — ideally, even ways that could be applied via polynomial-time tests.
3. For Gaussian BosonSampling in particular, to get a better understanding of the plausible limits of classical spoofing algorithms, and exactly how good a noisy device needs to be before it exceeds those limits.

Thanks so much to Sergio Boixo and Ben Villalonga for the conversation, and to Chaoyang Lu and Jelmer Renema for comments on this post. Needless to say, any remaining errors are my own.

by Scott at October 10, 2021 06:13 PM UTC

### Soliciting information about Women in TCS

from Theory Matters

The CATCS is compiling a list of Women in Theoretical Computer Science (WinTCS), hoping to facilitate engagement and opportunities for Women in TCS in the future.

We cordially invite women (broadly defined as anyone who self-identifies as a woman) as well as gender non-conforming TCS researchers to participate the following simple google-form survey to provide your information.

Please submit your information before Oct. 31, 2021. After that, information can still be provided through the CATCS website, and will be added to the list monthly.

If you have any questions regarding this survey, please feel free to contact us at goldner@bu.edu or yusuwang@ucsd.edu.

– Kira Goldner and Yusu Wang (on behalf of the SIGACT CATCS)

by shuchic at October 08, 2021 09:56 PM UTC

### C++ is for Cookie and That's Good Enough for Me

Potbelly, a local sandwich chain, made me an offer I couldn't refuse: change my password and earn a free (and quite tasty) oatmeal chocolate chip cookie. A free cookie is a great motivator, and checking that this wasn't some clever phishing attack, changed my password and got my cookie. Not sure why Potbelly wanted me to change my password but happy to take their cookie.

Potbelly likely didn't make this offer to everyone so what if you want a cookie?

1. Use an app to get a cookie delivered.
2. Visit a specialty cookie store.
3. Go to your local supermarket and pick up a package of Chip's Ahoy.
7. Get fresh ingredients direct from a farm stand
8. Grow and gather your own ingredients, ala Pancakes Pancakes
In machine learning we seem to be heading into a similar set of choices
1. Not even realize you are using machine learning, such as recommendations on Netflix or Facebook.
2. Using ML implicitly, like talking to Alexa
3. Using pre-trained ML through an app, like Google Translate
4. Using pre-trained ML through an API
5. Using a model like GPT-3 with an appropriate prompt
6. Use an easily trained model like Amazon Fraud Detector
7. An integrated machine learning environment like Sagemaker
8. Use pre-built ML tools like TensorFlow or PyTorch
9. Code up your own ML algorithms in C++
10. Build your own hardware and software
and probably missing a few options.

When you want cookies or learning, do you buy it prepackaged or do you roll your own? And when people offer it to you for free, how wary should you be?

by Lance Fortnow (noreply@blogger.com) at October 08, 2021 03:10 PM UTC

### ARV on Abelian Cayley Graphs

from Luca Trevisan

Continuing from the previous post, we are going to prove the following result: let ${G}$ be a ${d}$-regular Cayley graph of an Abelian group, ${\phi(G)}$ be the normalized edge expansion of ${G}$, ${ARV(G)}$ be the value of the ARV semidefinite programming relaxation of sparsest cut on ${G}$ (we will define it below), and ${\lambda_2(G)}$ be the second smallest normalized Laplacian eigenvalue of ${G}$. Then we have

$\displaystyle \lambda_2 (G) \leq O(d) \cdot (ARV (G))^2 \ \ \ \ \ (1)$

which, together with the fact that ${ARV(G) \leq 2 \phi(G)}$ and ${\phi(G) \leq \sqrt{2 \lambda_2}}$, implies the Buser inequality

$\displaystyle \lambda_2 (G) \leq O(d) \cdot \phi^2 (G) \ \ \ \ \ (2)$

and the approximation bound

$\displaystyle \phi(G) \leq O(\sqrt d) \cdot ARV(G) \ \ \ \ \ (3)$

The proof of (1), due to Shayan Oveis Gharan and myself, is very similar to the proof by Bauer et al. of (2).

1. Ideas

For a positive integer parameter ${t}$, call ${G^t}$ the multigraph whose adjacency matrix is ${A^t}$, where ${A}$ is the adjacency matrix of ${G}$. So ${G^t}$ is a ${d^t}$-regular graph, and each edge in ${G^t}$ corresponds to a length-${t}$ walk in ${G}$. Our proof boils down to showing

$\displaystyle ARV(G^t) \leq O(\sqrt{dt}) \cdot ARV(G) \ \ \ \ \ (4)$

which gives (1) after we note that

$\displaystyle \lambda_2 (G^t) = 1 - (1 - \lambda_2(G))^t$

and

$\displaystyle ARV(G^t) \geq \lambda_2 (G^t)$

and we combine the above inequalities with ${t := 1/\lambda_2 (G)}$:

$\displaystyle \Omega(1) \leq 1 - (1 - \lambda_2(G))^t = \lambda_2 (G^t) \leq ARV(G^t) \leq O(\sqrt{dt}) \cdot ARV(G)$

The reader will see that our argument could also prove (roughly as done by Bauer et al.) that ${\phi(G^t) \leq O(\sqrt d) \cdot \phi(G)}$, simply by reasoning about distributions of cuts instead of reasoning about ARV solutions, which would give (2) more directly. By reasoning about ARV solutions, however, we are also able to establish (3), which we think is independently interesting.

It remains to prove (4). I will provide a completely self-contained proof, including the definition of Cayley graphs and of the ARV relaxation, but, first, here is a summary for the reader already familiar with this material: we take an optimal solution of ARV for ${G}$, and bound how well it does for ${G^t}$. We need to understand how much bigger is the fraction of edges cut by the solution in ${G^t}$ compared to the fraction of edges cut in ${G}$; a random edge in ${G^t}$ is obtained by randomly sampling ${t}$ generators and adding them together, which is roughly like sampling ${t/d}$ times each generator, each time with a random sign. Because of cancellations, we expect that the sum of these ${t}$ random generators can be obtained by summing roughly ${\sqrt{t/d}}$ copies of each of the ${d}$ generators, or ${\sqrt{td}}$ generators in total. So a random edge of ${G^t}$ corresponds roughly to ${\sqrt{td}}$ edges of ${G}$, and this is by how much, at most, the fraction of cut edges can grow.

2. Definitions

Now we present more details and definition. Recall that if ${\Gamma}$ is a group, for which we use additive notation, and ${S}$ is a set or multiset of group elements, then the Cayley graph ${Cay(\Gamma,S)}$ is the graph that has a vertex for every group element and an edge ${(x,x+s)}$ for every group element ${x\in \Gamma}$ and every element ${s\in S}$. We restrict ourselves to undirected graphs, so we will always assume that if ${s}$ is an element of ${S}$, then ${-s}$ is also an element of ${S}$ (with the same multiplicity, if ${S}$ if a multiset). Note that the resulting undirected graph is ${|S|}$-regular.

Several families of graphs can be seen to be Cayley graphs, including cycles, cliques, balanced complete bipartite graphs, hypercubes, toruses, and so on. All the above examples are actually Cayley graphs of Abelian groups. Several interesting families of graphs, for example several families of expanders, are Cayley graphs of non-Abelian groups, but the result of this post will apply only to Abelian groups.

To define the ARV relaxation, let us take it slowly and start from the definition of the sparsest cut problem. The edge expansion problem ${\phi(G)}$ is closely related to the sparsest cut problem, which can be defined as

$\displaystyle \sigma(G) := \min_{x \in \{0,1\}^V} \ \ \frac{x^T L_G x}{x^T L_K x }$

where ${L_G = I - A/d}$ is the normalized Laplacian of ${G}$ and ${L_K = I - J/n}$ is the normalized Laplacian of the clique on ${n}$ vertices. We wrote it this way to emphasize the similarity with the computation of the second smallest normalized Laplacian eigenvalue, which can be written as

$\displaystyle \lambda_2 (G) = \min_{x \in R^V : \ x \perp {\bf 1}} \ \ \frac {x^T L_G x}{x^Tx } = \min_{x\in R^V} \frac {x^T L_G x}{x^T L_K x}$

where the second equality is perhaps not obvious but is easily proved (all the solutions ${x\perp {\bf 1}}$ have the same cost function on both sides, and the last expression is shift invariant, so there is no loss in optimizing over all ${{\mathbb R}^V}$ or just in the space ${\perp {\bf 1}}$). We see that the computation of ${\lambda_2}$ is just a relaxation of the sparsest cut problem, and so we have

$\displaystyle \lambda_2(G) \leq \sigma(G)$

We can write the sparsest cut problem in a less algebraic version as

$\displaystyle \sigma(G) = \min_{S\subseteq V} \ \ \frac{ cut(S)}{\frac dn \cdot |V| \cdot |V-S| }$

and recall that

$\displaystyle \phi(G) = \min_{S\subseteq V: |S| \leq \frac {|V|} 2} \ \ \frac{ cut(S)}{d |S|}$

The cost function for ${\sigma(G)}$ does not change if we switch ${S}$ with ${V-S}$, so ${\sigma(G)}$ could be defined equivalently as an optimization problem over subsets ${S\subseteq V: |S| \leq |V|/2}$, and at this point ${\sigma(G)}$ and ${\phi(G)}$ are the same problem except for an extra factor of ${|V-S|/n}$ in the denominator of ${\sigma(G)}$. Such a factor is always between ${1/2}$ and ${1}$, so we have:

$\displaystyle \phi(G) \leq \sigma(G)\leq 2 \phi(G)$

(Note that all these definitions have given us a particularly convoluted proof that ${\lambda_2(G) \leq 2 \phi(G)}$.)

Yet another way to characterize ${\lambda_2}$ is as

$\displaystyle \lambda_2 (G) = \min_{x\in R^V} \frac {x^T L_G x}{x^T L_K x} = \min_{X \succeq {\bf 0}} \frac{L_G \bullet X}{L_K\bullet X}$

Where ${A\bullet B = \sum_{i,j} A_{i,j} B_{i,j}}$ is the Frobenius inner product between matrices. This is also something that is not obvious but that it is not difficult to prove, the main point being that if we write a PSD matrix ${X}$ as ${\sum_i x_ix_i^T}$, then

$\displaystyle \frac{L_G \bullet X}{L_K\bullet X} = \frac{\sum_i x_i^T L_G x_i}{\sum_i x_i^T L_Kx_i} \geq \min_i \frac{ x_i^T L_G x_i}{x_i^T L_Kx_i}$

and so there is no loss in passing from an optimization over all PSD matrices versus all rank-1 PSD matrices.

We can rewrite ${\lambda_2 \min_{X \succeq {\bf 0}} \frac{L_G \bullet X}{L_K\bullet X} }$ in terms of the Cholesky decomposition of ${X}$ as

$\displaystyle \begin{array}{llr} \min & \frac {\sum_{(u,v)\in E} || x_u - x_v||^2 }{\frac dn \sum_{u < v} ||x_u - x_v ||^2 } \\ \mbox{s.t}\\ & x_v \in {\mathbb R}^m & \ \ \forall v\in V\\ & m\geq 1 \end{array}$

where the correspondence between PSD matrices ${X}$ and vectors ${\{ x_v \}_{v\in V}}$ is that we have ${X_{u,v} = \langle x_u,x_v \rangle}$ (that is, ${\{ x_v \}_{v\in V}}$ is the Cholesky decomposition of ${X}$ and ${X}$ is the Gram matrix of the ${\{ x_v \}_{v\in V}}$). An integral solution of the sparsest cut problem corresponds to choosing rank ${m=1}$ and solutions in which each 1-dimensional ${x_v}$ is either 1 or 0, corresponding on whether ${v}$ is in the set ${S}$ or not. The ARV relaxation is

$\displaystyle \begin{array}{llr} \min & \frac {\sum_{(u,v)\in E} || x_u - x_v||^2 }{\frac dn \sum_{u < v} ||x_u - x_v ||^2 } \\ \mbox{s.t}\\ & ||x_u - x_v||^2 \leq ||x_u - x_z||^2 + ||x_z - x_v||^2 &\ \ \forall u,v,z \in V\\ & x_v \in {\mathbb R}^m & \forall v\in V\\ & m\geq 1 \end{array}$

which is a relaxation of sparsest cut because the “triangle inequality” constraints that we introduced are satisfied by 1-dimensional 0/1 solutions. Thus we have

$\displaystyle \lambda_2 (G) \leq ARV(G) \leq \phi(G)$

3. The Argument

Let us take any solution for ARV, and let us symmetrize it so that the symmetrized solution ${x}$ satisfies

$\displaystyle || x_u - x_{u + s} ||^2 = || x_v - x_{v+s} ||^2 \ \ \forall s\in S \ \forall u,v\in V$

That is, make sure that the contribution of each edge to the numerator of the cost function depends only on the generator that defines the edge, and not on the pair of endpoints.

It is easier to see that this symmetrization is possible if we view our solution as a PSD matrix ${X}$. In this case, for every group element ${g}$, we can let ${X_g}$ be the solution (with the same cost) obtained by permuting rows and columns according to the mapping ${v \rightarrow g+ v}$; then we can consider the solution ${\frac 1n \sum_{g\in V} X_g}$, which satisfies the required symmetry condition.

Because of this condition, the cost function applied to ${\{ x_v \}_{v\in V}}$ in ${G}$ is

$\displaystyle \frac { \frac n2 \cdot \sum_{s\in S} || x_s - x_0 ||^2}{ \frac dn \sum_{u < v} ||x_u - x_v ||^2 }$

and the cost function applied to ${\{ x_v \}_{v\in V}}$ in ${G^t}$ is

$\displaystyle \frac { \frac n2 \cdot \sum_{s_1,\ldots,s_t\in S^t} || x_{s_1 + \cdots + s_t} - x_0 ||^2}{ \frac {d^t} n \sum_{u < v} ||x_u - x_v ||^2 }$

meaning that our goal is now simply to prove

$\displaystyle \frac 1{d^t} \sum_{(s_1,\ldots,s_t)\in S^t} || x_{s_1 + \cdots + s_k} - x_0 ||^2 \leq O(\sqrt {dt}) \cdot \frac 1d \sum_{s\in S} || x_s - x_0 ||^2$

or, if we take a probabilistic view