at October 16, 2021 07:45 PM UTC
Theory of Computing Blog Aggregator

I was interested to see a familiarlooking graph drawing as one of the answers to the prompt “multiplicity” for the first entry in Mathober 2021 (\(\mathbb{M}\)). It’s a multigraph formed by a triangle with tripled edges, and looks a lot like the drawing I made for the Wikipedia Shannon multigraph article, prettied up by making an infinitely recessing sequence of these drawings rather than just one. Good choice for multiplicity.

Nonconcentration of the chromatic number of random graphs (\(\mathbb{M}\)). Uniformly random graphs, \(G(n,1/2)\) in the Erdős–Rényi–Gilbert model, turn out to have chromatic numbers that, for infinitely many \(n\), are spread over roughly \(\sqrt{n}\) values. But there are weird fluctuations so that, conjecturally, for some \(n\) the concentration is much tighter, more like \(n^{1/4}\).

University System of Georgia moves to gut tenure (\(\mathbb{M}\)). The proposed new policy includes procedures for removal of tenure under certain enumerated grounds, including failure to perform their jobs (this is pretty normal) but then adds a massive escape clause in which the board of regents can remove tenured faculty at any time as long as their reason for doing so is not one of the enumerated ones.

The first physical models of the hyperbolic plane, made in 1868 by Beltrami (\(\mathbb{M}\), via), blog post by Daina Taimiņa from 2010. Maybe you could make something like this by wrapping and stretching a disk of wet paper in a roll around a pseudosphere (https://en.wikipedia.org/wiki/Pseudosphere)? The rolledup photo of Beltrami’s model suggests that he did that. The via link shows this as a tangent to a story about triangulated polygons, frieze patterns, and the Farey tessellation.

Why do bees make rhombic dodecahedrons (\(\mathbb{M}\))? Nice video from Matt Parker (Standup Maths) on why bees usually end the cells of their honeycombs with rhombic dodecahedron faces, why this isn’t the optimal solution to fitting two layers of cells together (in terms of minimum wax usage), and why it isn’t reasonable to expect bees to find exact optima for this problem. If I have to quibble with something, though, it’s his plural. It’s not wrong, but see Google ngrams.

Mathematicians prove melting ice stays smooth (\(\mathbb{M}\), see also). The headline is a little overstated: you’re probably familiar with thin necks of ice melting to sharp points at the instant they separate. But these singularities are instantaneous: mathematical models of ice stay smooth for all but a measurezero set of times. Original paper: “The singular set in the Stefan problem”, Alessio Figalli, Xavier RosOton, and Joaquim Serra.

Discussion of the recent meme telling programmers and mathematicians that summation notation and for loops are the same thing. They’re not quite the same, though: summations don’t have an order of evaluation. But which is easier for people who don’t already know what they are to search and find out about? And why do programmers get angry at nonprogramming notational conventions?

Mathematics, morally (\(\mathbb{M}\), via), Eugenia Cheng, 2004. Somehow I hadn’t run across this before. It argues that much philosophy of mathematics is irrelevant to practice (“You can’t tell from somebody’s mathematics if they are a fictionalist, a rationalist, a platonist, a realist, an operationalist, a logicist, a formalist, structuralist, nominalist, intuitionist.”) and instead considerations of the right way of handling certain topics are more central.

The SIGACT Committee for the Advancement of Theoretical Computer Science is collecting information on women in theoretical computer science (\(\mathbb{M}\)). If this is you, please see their announcement for details of how to be counted.

Cynthia Rudin wins major award with silly name (\(\mathbb{M}\)), for her work on machine learning systems that learn to predict behavior using simple, interpretable, and transparent formulas.

According to the SODA web site, SIAM has decided that their conferences will be hybrid through July (\(\mathbb{M}\)). So if (like me) you wanted to participate in SODA/SOSA/ALENEX/APOCS, but were worried about planning a trip to Virginia with another predicted winter wave of Covid, now you can stay home and conference safely. Or, if you feel hybrid conferences are problematic and organizers should do one or the other but not both, now you have another reason to be annoyed.

Rarely is the question asked: Are math papers getting longer? (\(\mathbb{M}\)). Following earlier work by Nick Trefethen, Edward Dunne provides some data suggesting that (for certain journals, at least, and not the ones with page limits) the answer is yes. I’m not convinced by the suggested explanation that it’s because they are taking more effort to explain “connections with other work”, though: is that really a big enough fraction of most papers?

I haven’t been using my office desktop Mac much because I haven’t been into my office much, so it took me a while to pay attention to the fact that much of its networking had recently broken. Here’s why (\(\mathbb{M}\)). It was still running OS X El Capitan (10.11.6) and a crucial toplevel certificate expired. The machine is too old (late 2009) for the latest OS X but it looks like I can and should upgrade to High Sierra, 10.13. So much for getting anything else accomplished today…
by David Eppstein at October 15, 2021 03:43 PM UTC
When László Babai first announced his graph isomorphism in quasipolynomial time result, I wrote
We think of theory as a young person's game, most of the big breakthroughs coming from researchers early in their careers. Babai is 65, having just won the Knuth Prize for his lifetime work on interactive proofs, group algorithms and communication complexity. Babai uses his extensive knowledge of combinatorics and group theory to get his algorithm. No young researcher could have had the knowledge base or maturity to be able to put the pieces together the way that Babai did.
Babai's proof is an exceptional story, but it is exceptional. Most CS theorists have done their best work early in their career. I got myself into a twitter discussion on the topic. For me, I'm proud of the research I did through my forties, but I'll always be best known, research wise, for my work on interactive proofs around 1990. It would be hard to run a scientific study to determine cause and effect but here are some reasons, based on my own experiences, on why we don't see research dominated by the senior people in theory.
The field changes  Computation complexity has moved from a computationalbased discipline to one now dominated by combinatorics, algebra and analysis. I'm not complaining, a field should evolve over time but it plays less to my strengths. It's hard to teach this old dog new tricks.by Lance Fortnow (noreply@blogger.com) at October 15, 2021 01:32 PM UTC
York University in Toronto, Canada is inviting applications for five tenured or tenuretrack positions in the theory of computing or data science (both broadly interpreted). The review of applications will begin on November 15, and full applications are due by November 30.
Website: https://lassonde.yorku.ca/about/careers/facultyrecruitment
Email: eecsjoin@yorku.ca
by shacharlovett at October 15, 2021 01:01 PM UTC
at October 15, 2021 04:00 AM UTC
The NSF Quantum Leap Challenge Institute for Robust Quantum Simulation (https://rqs.umd.edu) seeks exceptional candidates for the RQS Postdoctoral Fellowship. Applications should be submitted through AcademicJobsOnline at https://academicjobsonline.org/ajo/jobs/19910.
Website: https://academicjobsonline.org/ajo/jobs/19910
Email: quicscoordinator@umiacs.umd.edu
by shacharlovett at October 15, 2021 12:45 AM UTC
Authors: Steven Delong, Alireza Farhadi, Rad Niazadeh, Balasubramanian Sivan
Download: PDF
Abstract: We study the classic online bipartite matching problem with a twist: offline
nodes are reusable any number of times. Every offline node $i$ becomes
available $d$ steps after it was assigned to. Nothing better than a
$0.5$approximation, obtained by the trivial deterministic greedy algorithm,
was known for this problem. We give the first approximation factor beating
$0.5$, namely a $0.505$ approximation, by suitably adapting and interpreting
the powerful technique of Online Correlated Selection.
at October 15, 2021 10:39 PM UTC
Authors: Daniel Grier, Daniel J. Brod, Juan Miguel Arrazola, Marcos Benicio de Andrade Alonso, Nicolás Quesada
Download: PDF
Abstract: Gaussian boson sampling is a model of photonic quantum computing that has
attracted attention as a platform for building quantum devices capable of
performing tasks that are out of reach for classical devices. There is
therefore significant interest, from the perspective of computational
complexity theory, in solidifying the mathematical foundation for the hardness
of simulating these devices. We show that, under the standard
AntiConcentration and PermanentofGaussians conjectures, there is no
efficient classical algorithm to sample from ideal Gaussian boson sampling
distributions (even approximately) unless the polynomial hierarchy collapses.
The hardness proof holds in the regime where the number of modes scales
quadratically with the number of photons, a setting in which hardness was
widely believed to hold but that nevertheless had no definitive proof.
Crucial to the proof is a new method for programming a Gaussian boson sampling device so that the output probabilities are proportional to the permanents of submatrices of an arbitrary matrix. This technique is a generalization of Scattershot BosonSampling that we call BipartiteGBS. We also make progress towards the goal of proving hardness in the regime where there are fewer than quadratically more modes than photons (i.e., the highcollision regime) by showing that the ability to approximate permanents of matrices with repeated rows/columns confers the ability to approximate permanents of matrices with no repetitions. The reduction suffices to prove that GBS is hard in the constantcollision regime.
at October 15, 2021 10:37 PM UTC
Authors: Yu Tong, Victor V. Albert, Jarrod R. McClean, John Preskill, Yuan Su
Download: PDF
Abstract: Quantum manybody systems involving bosonic modes or gauge fields have
infinitedimensional local Hilbert spaces which must be truncated to perform
simulations of realtime dynamics on classical or quantum computers. To analyze
the truncation error, we develop methods for bounding the rate of growth of
local quantum numbers such as the occupation number of a mode at a lattice
site, or the electric field at a lattice link. Our approach applies to various
models of bosons interacting with spins or fermions, and also to both abelian
and nonabelian gauge theories. We show that if states in these models are
truncated by imposing an upper limit $\Lambda$ on each local quantum number,
and if the initial state has low local quantum numbers, then an error at most
$\epsilon$ can be achieved by choosing $\Lambda$ to scale polylogarithmically
with $\epsilon^{1}$, an exponential improvement over previous bounds based on
energy conservation. For the HubbardHolstein model, we numerically compute a
bound on $\Lambda$ that achieves accuracy $\epsilon$, obtaining significantly
improved estimates in various parameter regimes. We also establish a criterion
for truncating the Hamiltonian with a provable guarantee on the accuracy of
time evolution. Building on that result, we formulate quantum algorithms for
dynamical simulation of lattice gauge theories and of models with bosonic
modes; the gate complexity depends almost linearly on spacetime volume in the
former case, and almost quadratically on time in the latter case. We establish
a lower bound showing that there are systems involving bosons for which this
quadratic scaling with time cannot be improved. By applying our result on the
truncation error in time evolution, we also prove that spectrally isolated
energy eigenstates can be approximated with accuracy $\epsilon$ by truncating
local quantum numbers at $\Lambda=\textrm{polylog}(\epsilon^{1})$.
at October 15, 2021 10:40 PM UTC
The Department of Computer Science at CSUEB invites applications for 2 tenuretrack appointments as Assistant Professor in Computer Science (considering all areas of computer science, capable of teaching in emerging areas) starting Fall 2022.
Website: http://www.csueastbay.edu/oaa/jobs/csueb.html
Email: cs@csueastbay.edu
by shacharlovett at October 14, 2021 05:42 PM UTC
In a series of two blog posts, we dive into how to build practical certified defenses against adversarial patches. In Part I below, we give an overview of smoothingbased defenses, a popular class of certified defenses, and walk through a specific example of using derandomized smoothing to defend against adversarial patch attacks. In Part II, we discuss our latest work that demonstrates how using vision transformers with derandomized smoothing (smoothed ViTs) can significantly improve these defenses. Our approach not only provides better certified defenses against adversarial patches but also maintains standard accuracy and inference times at levels comparable to that of regular (nonrobust) models. Thanks to these improvements, smoothed ViTs become a truly viable alternative to regular models.
Adversarial examples are small modifications of inputs that can induce machine learning models to misclassify those inputs. These attacks can be roughly categorized into two broad classes: small, synthetic perturbations that are visually indistinguishable from the original input, and physically realizable attacks that can break deployed machine learning systems in the wild. One popular attack in the latter category, known as adversarial patches, maliciously perturbs a bounded region in the image (typically a square).
Adversarial patches can be used to undermine a variety of visionbased tasks—in particular, have been used to deceive image classifiers, manipulate object detectors, and disrupt optical flow estimation. Such attacks tend to be relatively easy to implement too as they just require, e.g., printing adversarial stickers, creating adversarial pieces of clothing,or manufacturing adversarial glasses.
While several heuristic defenses have been proposed to protect against adversarial patches, some of these defenses were shown to not be fully effective. This is unfortunately a fairly common theme in adversarial examples research.
Certified defenses attempt to alleviate such empirical evaluation problems by providing robustness guarantees that are provable. However, certified guarantees tend to be modest and come at a steep cost. For instance, recently proposed certified defenses against adversarial patches can guarantee only 14% robust accuracy on ImageNet. Furthermore, certified models are often substantially slower (by even 24 orders of magnitude) and significantly less accurate (in terms of standard accuracy) than nonrobust models, which severely limits their practicality.
Smoothing models for provable robustness
We begin our discussion of certified defenses with an overview of smoothing, a general strategy for creating models that are provably robust to adversarial attacks. They revolve around the idea of deriving a (robust) prediction by aggregating many variations of the original input (e.g., ones that correspond to adding Gaussian noise, masking parts of the input, or augmenting it with generative models). The underlying intuition is that while it might be easy for an adversary to attack a single input, it is much harder for them to simultaneously attack all of the input variations. Thus, instead of directly classifying the original input, we classify each of such variations of that input and then output the most frequently predicted class—if the number of variations that agree with this class is high enough then the prediction is guaranteed to be robust.
Let’s dive into the details! To create a smoothed classifier, we change the typical forward pass of our classifier (often referred to as the base classifier) into a threestep procedure:
 Construct variations of the input.
 Classify all input variations individually.
 Output the most frequently predicted class.
Now, once we have such a smoothed classifier, how can we certify its robustness? The key assumption for certification is that an adversary can only affect a limited number of input variations. If this is true, certification amounts to a simple counting problem: if the difference between the first and secondmost frequently predicted classes is larger than what an adversary could ever change via their manipulation, then the prediction is guaranteed to be robust! Importantly, for this approach to work well, it is key that these variations remain informative of the correct class while remaining hard for the adversary to perturb many of them simultaneously. This gives rise to a natural tradeoff between robustness and accuracy.
By choosing different types of input variations, smoothing techniques can defend against different types of adversarial attacks. For example, smoothing classifiers with random additive noise can provide certified defenses against adversarial L2 and L1 attacks. On the other hand, smoothing over “pixel flows” can provide certified defenses against Wasserstein perturbations. One can even “smooth” the entire pipeline of both training and inference (treating it as a single module), to get robustness to data poisoning attacks. In general, if we can construct a set of input variations so that the adversary can only affect some fraction of them, we can use smoothing to defend against the corresponding attacks!
Derandomized Smoothing: A Smoothing Defense against Patch Attacks
How can we use smoothing to defend against adversarial patches specifically? Recall that to be able to create a smoothing defense, we need to come up with a way to generate variations of an image that limit the effect of an adversarial patch and still be indicative of the correct class label. Here, we exploit a key property of adversarial patches: they are limited to a small, contiguous region of the input image.
Specifically, for each variation, we mask out all of the image except for a small region. In most of these variations, the adversarial patch will be fully masked, and thus cannot affect the model’s prediction by construction. This technique was originally introduced in a certified defense known as derandomized smoothing (Levine & Feizi, 2020). Typically, we leave a column of the image unmasked as shown in the figure below, otherwise known as a column ablation (check out the original paper for some other ablation types).
Certified guarantee against patch attacks
Now that we have our smoothed classifier, how can we prove that the prediction is robust to patch attacks? Since a patch is of limited size, an adversary can only affect the few image variations that physically overlap with the patch. Consequently, if the margin of the winning prediction is large enough, then we can guarantee robustness to adversarial patches.
To make this more precise, let $\Delta$ be the number of ablations that a patch could simultaneously affect (in the worst case). If the winning margin is at least $2\Delta$, then the smoothing defense guarantees that the predicted class is robust to this patch size. The fraction of examples which meet this certification threshold is typically referred to as certified accuracy.
Challenges for Certified Patch Defenses
Although certified patch defenses can provide guarantees against adversarial attacks, like many other certified defenses, they face several major challenges that limit their practicality:
 They provide relatively modest guarantees, and can only guarantee robustness for a relatively small fraction of inputs and/or only a small patch size.
 The standard accuracy is typically much lower than that of a standard, nonrobust model, forcing on the practitioners an unfavorable “tradeoffs” between robustness and accuracy.
 Inference times tend to be several orders of magnitude larger than that of a standard, nonrobust model, making certified defenses difficult to deploy in realtime settings.
In the next post, we discuss our latest work on making certified defenses much more practical. Specifically, we show how leveraging vision transformers (ViT) enables better handling of input variations and significant improvement of the margins for certification. With some straightforward modifications to the ViT architecture, we are also able to develop certifiably robust models that not only outperform previous certified defenses but also have practical inference times and standard accuracies.
at October 14, 2021 12:00 AM UTC
In our previous post, we gave an overview of smoothingbased approaches to certified defenses, and described a certified patch defense known as derandomized smoothing. In our latest work, we show how to leverage vision transformers (ViTs) to significantly improve such certified patch defenses along all possible axes: standard accuracy, certified accuracy, and speed. In fact, with some straightforward modifications, our certified ViTs not only enable significantly better certified robustness but they also maintain standard accuracy and inference times that are comparable to that of regular (nonrobust) models. This progress marks a substantial step forward for certified defenses as a practical alternative to regular models.
The main appeal of certified defense is their ironclad guarantees: a certified result leaves no room for doubt and is not prone to blindspots of empirical evaluation. However, practitioners care not only about certified performance: the accuracy and the inference speed of the model are equally important factors too. Unfortunately, nearly all certified defenses come at a severe price: (1) the standard accuracy plummets in comparison to a standard model and/or (2) inference time is orders of magnitude slower than that of a standard model.
This tradeoff between robustness and performance can be untenable even for safetycritical applications. For example, object detection systems deployed in autonomous vehicles need to recognize obstacles and signs quickly in real time. Consequently, in addition to being robust, these systems also need to be fast and accurate.
To approach this tradeoff between robustness and accuracy, we take a closer look at certified defenses from an architectural perspective. Specifically, smoothing defenses use a backbone architecture that is typically implemented using classic convolutional networks, the default choice across much of computer vision. But are these architectures really the best tool for the job?
In our work, we show that when moving from standard models to certified defenses, it may be worth it to rethink the core architecture behind them. Specifically, we find that smoothingbased patch defenses can drastically improve when using vision transformers (ViTs) instead of convolutional networks as their backbone. With some additional adjustments to the transformer architecture, we not only achieve superior certified accuracy to prior work but also reach standard accuracies and inference times comparable to a standard, nonrobust ResNet!
Rethinking smoothing backbones
Recall from the previous post that smoothing defenses use a base classifier to make predictions on many variations of the original input. For patch defenses, this amounts to using the base classifier to classify column ablations (i.e., masked images that only reveal a small column of the original image). Consequently, the accuracy of the base classifier at classifying columns of an image directly translates into the performance of the certified defense.
In the light of this observation, it is crucial to realize that convolutional networks typically process complete (i.e., not masked) images. So, this type of architecture might not actually be best suited for classifying the highly masked images used in smoothing defenses.
Why might we expect convolutions to be suboptimal for masked images? CNNs slide filters across the image and grow the receptive field across layers of the network. This large receptive field, while usually being a strength of CNNs for standard settings, forces the network to process a large amount of masked pixels when processing ablated images. Since these pixels carry no information, the network must learn to detect and ignore these masked regions (in addition to performing unnecessary computation).
Vision transformers for ablated images
Unlike CNNs, vision transformers (ViTs) use attention modules instead of convolutions as their main building block. Crucially, attention modules can ignore masked regions instead of processing them.
To test out this intuition, we measured the accuracy of various convolutional and ViT architectures at classifying column ablations. It turns out that ViTs are significantly better at this task—for example, on ImageNet, a ViTS has 12% higher accuracy at classifying column ablations than a similarly sized ResNet50.
These improvements in ablation accuracy directly boosts both the certified and standard accuracies for the smoothed model. For example, the increase in ablation accuracy improves the certified accuracy of a smoothed ViTS by 13% (as compared to a similarly sized ResNet50) and the standard accuracy by 12%.
With some additional adjustments to the certification procedure described in our paper, our largest certifiably robust vision transformer can reach 73% standard accuracy—only 3% off from the standard accuracy of a regular, nonrobust ResNet50, and almost 30% higher than previous versions of derandomized smoothing!
Modifying ViTs to Speed up Inference Time
Also, recall that a smoothed classifier tends to be significantly slower than a standard model because it needs to classify a large number of input variations for each input image. Indeed, traditional randomized smoothing approaches have used upwards of 100,000 variations, resulting in a forward pass that is 45 orders of magnitude slower than a standard model. These costs are infeasible in scenarios where decisions need to be made in a fraction of a second, such as autonomous driving. As Levine & Feizi have shown, for patch defenses, a move towards deterministic smoothing (as opposed to randomized) can result in a great speed up (by reducing the number of variations needed to be classified). However, the resulting solution is still two orders of magnitude slower than standard models.
In order to speed up inference (much) further, we leverage the tokenbased nature of ViTs to gracefully handle image ablations. Specifically, note that ViTs process images as tokens, where each token represents a contiguous patch of the input image. Since the runtime of a ViT is proportional to the number of tokens, a natural approach to speed up inference is to simply drop the masked out tokens, as shown in the following figure:
We modify the ViT architecture to drop fully masked tokens in such a manner. As expected, this significantly speeds up the inference time of the smoothed model! Indeed, with this optimization, our smoothed ViTS is faster than a smoothed ResNet50 by an order of magnitude. In our paper, we discuss how additional improvements to the certification procedure can make our ViT architectures another order of magnitude faster. In total, our fastest smoothed ViT is actually comparable in terms of inference speed to a standard, nonrobust ResNet.
Conclusion
Smoothed ViTs present a simple but effective way to achieve stateofthe art certified accuracy. These models are also realistically deployable, as they maintain similar accuracy and inference speed as nonrobust models. The key here is the vision transformer’s inherent ability to quickly and accurately process image ablations. These advancements are thus a step towards certified defenses that can be practically deployed in realworld scenarios.
at October 14, 2021 12:00 AM UTC
I have a new preprint, “Finding Relevant Points for NearestNeighbor Classification”, arXiv:2110.06163, to appear in January at the SIAM Symposium on Simplicity in Algorithms (SOSA22). It’s about points in Euclidean spaces of dimension three or more, but I thought it would make a good warmup to discuss here the onedimensional version of the same problem, solved (together with the 2d version) by Bremner, Demaine, Erickson, Iacono, Langerman, Morin, and Toussaint in their paper “Outputsensitive algorithms for computing nearestneighbour decision boundaries”, Discrete Comput. Geom. 2005.
So in this problem, you have a collection of realvalued data points with known discrete classifications (say, a finite set of colors), and you want to guess the color of new points whose color is not already given. Nearest neighbor classification means simply that, to guess the color of \(x\), you find the closest known point \(y\) and guess that \(x\) has the same color as \(y\). One easy way to do this would be to store the known points in a sorted list and use binary search. There’s lots more to say about this (for instance its use in combination with random projections for highdimensional approximate nearest neighbors) but for today I want to focus on the size of this sorted list. We can store a list that is potentially much smaller, but always produces the same results, by keeping only points that have a neighbor with a different classification.
These points with differentlyclassified neighbors are the “relevant points” of the preprint title. Another way of describing them is that a point is relevant if deleting it would change the classification of some other (unknown) points in space that might later be queried. Among the set of decision boundaries, the ones separating parts of space with different classifications are the ones defined by relevant points. So if we store a nearestneighbor data structure with only relevant points, we will get the same answer as if we store all known points. But because we’re storing fewer points, it will take less memory and less time (especially if the reduced memory allows it to fit into cache).
If you have the points given to you in sorted order, then it’s easy enough to scan through them in that order, keeping track of which pairs have different classifications, and produce a filtered sequence of the relevant ones. Here it is in Python (with apologies for the lowcontrast syntax coloring):
However, as Bremner et al observed, sorting an input to make it usable by this scan does more work than necessary, because we don’t need the sorted ordering of all the other points between the relevant ones. Instead, we can use an idea resembling quickselect, where we modify the quicksort algorithm to stop recursing in subproblems where sorting is unnecessary. For finding relevant points, these subproblems are the homogeneous ones, in which all points have the same classification as each other. Bremner et al combined this idea with a version of quicksort that always partitions its subproblems at the exact median, in order to achieve a good worstcase time bound, but if we’re happy with expected analysis we can use the same randompivoting idea as the more usual form of quicksort:
The time complexity of this algorithm can be analyzed much like the analysis of quickselect, by observing that the time is proportional to the number of comparisons with pivots, computing the probability that each possible comparison happens, and summing. In quicksort, two distinct elements at distance \(d\) from each other in the final sorted output are compared whenever one of them is the first to be chosen as a pivot in the interval between them, which happens with probability exactly \(2/(d+1)\). In quickrelevant, this same probability holds for pairs that are separated by one of the decision boundaries. But pairs of elements that are within the same homogeneous block are less likely to be compared, because of the possibility that the recursion will stop before it separates or compares them.
If a pair of elements lies within a single block, has distance \(d\) separating them, and is $e$ units from both ends of the block, then it will only be compared if one of the two elements is first to be chosen in at least one of the two intervals of length \(d+e\) extending from the two elements towards an end of their block. This happens with probability at most \(4/(d+e)\), because there are two ways of choosing which element to pick first as the pivot and two ways of choosing which extended interval it is first in.
If we sum up these probabilities, for pairs involving a single element that is \(e\) units from its nearest block boundary among a set of \(n\) elements, we get \(O\bigl(\log(n/e)\bigr)\) as the expected contribution to the total time for that one element. If we sum the contributions from the elements within a block, for a block of length \(\ell_i\), we get a total expected contribution from that block of \(O\bigl(\ell_i\log(n/\ell_i)\bigr)\). And if we have \(k\) relevant points and \(O(k)\) blocks, and we sum over all blocks, the total time is maximized when all the blocks are of equal size, \(\Theta(n/k)\), for which we get total time \(O(n\log k)\).
For quicksort and quickselect, it’s possible to be more careful in the analysis, derive exact formulas for the probability of making each comparison, and from them get an analysis of the expected number of comparisons that does not use \(O\)notation for its leading term; see my linked post on the quickselect analysis. Probably it’s possible here too but it looks messier. Maybe it would make a good undergraduate research project. One thing to be careful of is that the comparisons are not over in the homogeneous case; finding the min and max simultaneously, in a block of length \(\ell_i\), takes roughly \(3\ell_i/2\) comparisons. But that should only be a lowerorder term compared to the \(O(n\log k)\) leading term of the analysis.
by David Eppstein at October 13, 2021 11:05 PM UTC
We are looking for exceptional junior scientists to work collaboratively with a group of faculty from Computer Science (Boaz Barak), Statistics (Lucas Janson), Electrical Engineering (Demba Ba) and Applied Mathematics (Cengiz Pehlevan) towards a theory of representations in artificial and natural systems.
Website: https://academicpositions.harvard.edu/postings/10730
Email: theorypostdocapply@seas.harvard.edu
by shacharlovett at October 13, 2021 10:38 PM UTC
We are looking for junior scientists in theoretical computer science, broadly construed.
The normal duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an allocation for research and travel expenses.
While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests.
Website: https://academicpositions.harvard.edu/postings/10730
Email: theorypostdocapply@seas.harvard.edu
by shacharlovett at October 13, 2021 10:36 PM UTC
at October 13, 2021 03:27 PM UTC
by shacharlovett at October 13, 2021 03:05 AM UTC
Many times in my career, I’ve been told by respected statisticians that machine learning is nothing more than nonparametric statistics. The longer I work in this field, the more I think this view is both misleading and unhelpful. Not only can I never get a consistent definition of what “nonparametric” means, but the jump from statistics to machine learning is considerably larger than most expect. Statistics is an important tool for understanding machine learning and randomness is valuable for machine learning algorithm design, but there is considerably more to machine learning than what we learn in elementary statistics.
Machine learning at its core is the art and science of prediction. By prediction, I mean the general problem of leveraging regularity of natural processes to guess the outcome of yet unseen events. As before, we can formalize the prediction problem by assuming a population of $N$ individuals with a variety of attributes. Suppose each individual has an associated variable $X$ and $Y$. The goal of prediction is to guess the value of $Y$ from $X$ that minimizes some error metric.
A classic prediction problem aims to find a function that makes the fewest number of incorrect predictions across the population. Think of this function like a computer program that takes $X$ as input and outputs a prediction of $Y$. For a fixed prediction function, we can sum up all of the errors made on the population. If we divide by the size of the population, this is the mean error rate of the function.
A particularly important prediction problem is classification. In classification, the attribute $Y$ takes only two values: the input $X$ could be some demographic details about a person, and $Y$ would be whether or not that person was taller than 6 feet. The input $X$ could be an image, and $Y$ could be whether or not the image contains a cat. Or the input could be a set of laboratory results about a patient, and $Y$ could be whether or not the patient is afflicted by a disease. Classification is the simplest and most common prediction problem, one that forms the basis of most contemporary machine learning systems.
For classification problems, it is relatively straightforward to compute the best error rate achievable. First, for every possible value of the attribute $X$, collect the subgroup of individuals of the population with that value. Then, the best assignment for the prediction function is the one that correctly labels the majority of this subgroup. For example, in our height example, we could take all women, aged 30, born in the United States, and reside in California. Then the optimal label for this group would be decided based on whether there are more people in the group who are taller than 6 feet or not. (Answer: no).
This minimum error rule is intuitive and simple, but computing the rule exactly requires examining the entire population. What can we do if we work from a subsample? Just as was the case in experiment design, we’d like to be able to design good prediction functions from a small sample of the population so we don’t have to inspect all individuals. For a fixed function, we could use the same lawoflargenumbers approximations to estimate the best decision. That is, if we decide in advance upon a prediction function, we could estimate the percentage of mistakes on the population by gathering a random sample and computing the proportion of mistakes on this subset. Then we could apply a standard confidence interval analysis to extrapolate to the population.
However, what if we’d like to find a good predictor on the population using only a set of examples sampled from the population. We immediately run into an issue: to find the best prediction function, we needed to observe all possible values of $X$. What if we’d like to make predictions about an individual with a set of attributes that was not observed in our sample?
How can we build accurate populationlevel predictors from small subsamples? In order to solve this problem, we must make some assumptions about the relationship between predictions at related, but different values of $X$. We can restrict our attention to a set of functions that respect regularity properties that we think the predictions should have. Then, with a subsample from the population, we find the function that minimizes the error on the sample and obeys the prescribed regularity properties.
This optimization procedure is called ”empirical risk minimization” and is the core predictive algorithm of machine learning. Indeed, for all of the talk about neuromorphic deep networks with fancy widgets, most of what machine learning does is try to find computer programs that make good predictions on the data we have collected and that respect some sort of rudimentary knowledge that we have about the broader population.
The flexibility in defining what “knowledge” or “regularity” means complicates the solution of such empirical risk minimization problems. What does the right set of functions look like? There are three immediate concerns:

What is the right representation? The set needs to contain enough functions to well approximate the true population prediction function. There are a variety of ways to express complex functions, and each expression has its own benefits and drawbacks.

The set of functions needs to be simple to search over, so we don’t have to evaluate every function in our set as this would be too time consuming. Efficient search for high quality solutions is called optimization.

How will the predictor generalize to the broader population? The functions cannot be too complex or else they will fail to capture the regularity and smoothness of the prediction problem (estimating functions of too high complexity is colloquially called “overfitting”).
Balancing representation, optimization, and generalization gets complicated quickly, and this is why we have a gigantic academic and industrial field devoted to the problem.
I’m repeating myself at this point, but I again want to pound my fist on the table and reiterate that nothing in our development here requires that the relationship between the variables $X$ and $Y$ be probabilistic. Statistical models are often the starting point of discussion in machine learning, but such models are just a convenient way to describe populations and their proportions. Prediction can be analyzed in terms of a deterministic population, and, just as we discussed in the case of randomized experiments, randomness can be introduced as a means of sampling the population to determine trends. Even generalization, which is usually studied as a statistical phenomenon, can be analyzed in terms of the randomness of the sampling procedure with no probabilistic modeling of the population.
On the other hand, some sort of knowledge about the population is necessary. The more we know about how prediction varies based on changes in the covariates, the better a predictor we can build. Engineering such prior knowledge into appropriate function classes and optimization algorithms form the art and science of contemporary machine learning.
This discussion highlights that while we can view prediction through the lens of statistical sampling, pigeonholing it as simply “nonparametric statistics” does not do the subject justice. While the jump from mean estimation to causal RCTs is small, the jump from mean estimation to prediction is much less immediate. And in machine learning practice, the intuitions from statistics often don’t apply. For example, conventional wisdom from statistics tells us that evaluating multiple models on the same data set amounts to multiple hypothesis testing, and will lead to overfitting on the test set. However, there is more and more evidence that using a traintest split leads does not lead to overfitting. Instead, the phenomena we see is that dataset benchmarks can remain useful for decades. Another common refrain from statistics is that model complexity must be explicitly constrained in order to extrapolate to new data, but this also does not seem to apply at all to machine learning practice.
Prediction predates probability and statistics by centuries. As Moritz and I chronicle in the introduction to Patterns, Predictions, and Actions astronomers were using pattern matching to predict celestial motions, and the astronomer Edmund Halley realized that similar techniques could be used to predict life expectancy when pricing annuities. Moreover, even though modern machine learning embraced contemporary developments in statistics by Neyman, Pearson, and Wald, the tools quickly grew more sophisticated and separate from core statistical practice. In the next post, I’ll discuss an early example of this divergence between machine learning and statistics, describing some of the theoretical understanding of the Perceptron in the 1960s and how its analysis was decidedly different from the theory advanced by statisticians.
at October 13, 2021 12:00 AM UTC
Today I would like to discuss the KhotNaor approximation algorithm for the 3XOR problem, and an open question related to it.
In 3XOR, we have a system of linear equations modulo 2, with three variables per equation, that might look something like
The above system is not satisfiable (if we add up the lefthand sides we get 0, if we add up the righthand sides we get 1), but it is possible to satisfy of the equations, for example by setting all the variables to 1. In Max 3XOR problem (which we will simply refer to as “3XOR” from now on), given a system of equations mod 2 with three variables per equation, we want to find an assignment that satisfies as many equations as possible.
Either setting all the variables to zero or setting all the variables to one will satisfy half of the equations, and the interesting question is how much better than 1/2 it is possible to do on a given instance. Khot and Naor provide an algorithm that, given an instance in which it is possible to satisfy an fraction of equations, finds a solution that satisfies at least fraction of equations, where is the number of variables. The algorithm is randomized, it runs in polynomial time, and it succeeds with high probability. I believe that it is still the state of the art in terms of worstcase approximation guarantee.
Like the approximation algorithm for sparsest cut in Abelian Cayley graphs implied by the result of Bauer et al. that was the subject of the last two posts, the result of Khot and Naor does not prove a bound on the integrality gap of a relaxation of the problem.
I will describe the KhotNaor algorithm and describe how it manages to use convex optimization to provide an approximation algorithm, but without establishing an integrality gap bound. I thank my student Lucas Pesenti for explaining the algorithm to me and for thinking together about this problem.
If our 3XOR instance has equations and variables, then the problem of maximizing the number of satisfied equations can be rewritten as
so that our goal is to approximate the combinatorial optimization problem
Up to a constant factor loss in the approximation guarantee, Khot and Naor show that the above is equivalent to
where is a symmetric 3tensor with entries in and with nonzero entries.
Before continuing, let us recall that if is an matrix, then its to operator norm has the characterization
We could also define the “Grothendieck norm” of a matrix as the following natural semidefinite programming relaxation of the to norm:
where the and are arbitrary vectors. The Grothendieck inequality is
where the is an absolute constant, known to be less than . Furthermore, the above inequality has a constructive proof, and it leads to a polynomial time constant factor approximation for the problem of finding values and in that maximize (2).
Basically, we can see problem (1) as the natural generalization of (2) to tensors, and one would like to see a semidefinite programming relaxation of (1) achieving something resembling the Grothendieck inequality, but with a loss of something like . As I mentioned above, this remains an open question, as far as I know.
The idea of Khot and Naor is the following. Suppose that we are given an instance of problem (1), and suppose that is an optimal solution, and let us call
the value of the optimum (the algorithm will not need to know or guess ).
The key step is now to see that if we pick a random , there is at least a probability that
This is a bit difficult, but it is really easy to see that with probability we have
and we can do that by seeing that by defining a vector such that , so that
So we have
which, using CauchySchwarz, gives
Now, for a random , we have
and
so by Paley–Zygmund we have, let’s say
which, together with the definition of and the fact that the distribution of is symmetric around zero, gives us the claim (3).
Now suppose that we have been lucky and that we have found such an . We define the matrix
and we see that our claim can be written as
At this point we just apply the algorithm implied by the Grothendieck inequality to the matrix , and we find and in such that
meaning that
Summarizing, our algorithm is to pick a random vector and to find a constantfactor approximation for the problem
using semidefinite programming. We do that times, and take the best solution.
The analysis can be turned into an upper bound certificate in the following way. For the (suboptimal) analysis using PaleyZygmund, we only need the entries of the random to be 4wise independent, and there are distributions on where the entries are unbiased and 4wise independent, and such that the sample space is of size polynomial in . Thus, one could write an SDP relaxation of (4) for each in the support of such a distribution, and then take the maximum of these SDPs, multiply it by , and it would be a certified upper bound. Such an upper bound, however, would not come from a relaxation of the 3XOR problem, and I find it really strange that it is not clear how to turn these ideas into a proof that, say, the standard degree4 sumofsquares semidefinite programming relaxation of 3XOR has an integrality gap at most .
by luca at October 12, 2021 12:55 PM UTC
The School of Mathematical Sciences at Queen Mary University of London is seeking to appoint a permanent faculty member in the area of optimisation, with a focus on applying rigorous research methods to address emerging problems in the general field of combinatorial optimisation. We especially welcome applicants who would complement existing expertise in the School’s Combinatorics group.
Website: https://webapps2.is.qmul.ac.uk/jobs/job.action?jobID=5954
Email: r.johnson@qmul.ac.uk
by shacharlovett at October 11, 2021 04:34 PM UTC
The Department of Computer Science at Williams College invites applications for two faculty positions beginning July 1,2022. One is a tenuretrack position at the rank of assistant professor with a threeyear initial term. The other is an open rank position with a preference for more advanced candidates. That position will have terms commensurate with prior experience.
Website: https://apply.interfolio.com/91229
Email: hiring@cs.williams.edu
by shacharlovett at October 11, 2021 02:51 PM UTC
Lance: How come you haven't blogged on your muffin book? You've blogged about two books by Harry Lewis (see here and here) one book by the lesswrong community (see here), and you even did a mashup of a post by two different Scott A's (see here), but not on your own work.
Bill: I thought I did a post on my muffin book.
Lance: No. You have blogged about the muffin problem, and sometimes you mention either the book or the problem in passing, but you haven't had a post that says
HEY, I wrote a book!
And this is all the more strange since you asked me to have the book on our blog page.
Bill: (Searches blog with keyword muffin and finds no ref to muffin book). Well pierce my ears and call be drafty! I have not posted on the muffin book! Do you recall my thoughts on when to tell people you are working on a book?
Lance: No
Bill: I had a college roommate who was an aspiring science fiction writer who told me there are two kinds of people: Those who talk about writing a book, and those who write a book. I have adapted this to:
Do not tell people you are writing a book until you are picking out the cover art.
Lance: I posted about my book when I hadn't even decided on the title. But your cover art is picked out (see here). And, by the way, its very nice, though it makes me hungry. So I think you can begin talking about the book.
Bill: Indeed! I will!

Hey I have a book! (See here to buy it on amazon.)
Title: Mathematical Muffin Morsels: Nobody Wants a Small Piece
by Gasarch, Metz, Prinz, Smolyak
(The other authors were undergraduates when we wrote the book. Prinz and Smolyak are now grad students in CS, Metz is in Finance.)
Origin:
Martin Gardner wrote a Mathematics Recreational column for Scientific American for many years, starting in 1956 and ending in the early 1980s. For many STEM people of my generation (Using my fake birthday of Oct 1, 1960, I am 62 years old) Martin Gardner's columns were both an inspiration and an early exposure to mathematics. His columns also made the line between Mathematical Recreation and socalled serious mathematics thin or nonexistent. (See here for a review of Martin Gardner in the 21st century, a book about the kind of math Gardner wrote of. The book makes a mockery of the distinction between recreational and serious mathematics.) He passed away in 2010 at the age of 95.
There is a gathering in his honor that is hold roughly every 2 years, called Gathering For Gardner. (It was cancelled in Spring 2020 and Spring 2021 because of COVID though its in Atlanta where the CDC is, so they could have had it as an experiment and told the CDC the results). You have to be invited to goto it. I got an invite for 2016 from my contact at World Scientific who published my previous book, Problems with a Point: Exploring Math and Computer Science coauthored with Clyde Kruskal (I had two blogs on it, here and here, and you can buy it on amazon here.) I did three posts on G4G2016 (here, here, and here).
Aside from seeing some great talks that I understood and liked, I also picked up a pamphlet titled:
The Julia Robinson Math Festival
A Sample of Mathematical Puzzles
Compiled By Nancy Blackman
One of the problems, credited to Alan Frank, was
How can you divide and distribute 5 muffins for 3 students so that everyone gets 5/3 and the smallest piece is as big as possible?
They had some other values for muffins and students as well.
I solved the (5,3) problem and the other ones as well. That was fun.
When I got home I began looking at the problem for m muffins and s students. I let f(m,s) be the biggest smallest piece possible for giving out m muffins to s students. I proved a general theorem, called the FloorCeiling theorem, that always gives an upper bound, FC(m,s) on f(m,s). I worked out formulas for
f(m,1) (trivial),
f(m,2) (trivial),
f(m,3) (its always FC(m,3),
f(m,4) (its always FC(m,4)).
While working on f(m,5) I found that f(m,5) was always FC(m,5) EXCEPT for m=11. So what's up with f(11,5)?
By the Floor Ceiling theorem f(11,5) \le 11/25. We (at that point several ugrads and HS students had joined the project) were unable to find a protocol that would show f(11,5)\ge 11/25. Personally I thought there WAS such an protocol but perhaps it was more complicated than the ones we had found (We were finding them by hand using some easy linear algebra.) Perhaps a computer program was needed. We did find a protocol for f(11,5)\ge 13/30, which surely was not optimal.
While on an Amtrak I began working out the following train of thought: The protocol for f(11,5)\le 11/25 MUST have
(1) every muffin cut into two pieces,
(2) 3 students get 4 pieces,
(3) 2 students get 5 pieces.
While working on getting a protocol for f(11,5)\le 11/25 with these properties I found that... there could be no such protocol! Then by reworking what I did I found that f(11,5)\le 13/30. So it was done! and we had a new technique, which we call The Half Method. To see the full proof see my slides here
The story above is typical: We get f(m,k) for all 1\le k\le SOMETHING, we get stuck, and then we find ANOTHER technique to show upper bounds (which in this case are limits on how well we can do). This happened about 8 times depending on how you count. After a while we realized that this could not just be an article, this was a book! World Scienfiic agreed to publish it, and its out now.
Misc Notes
1) I got a conference paper out of it, in the Fun with Algorithms Conference, with some of the coauthors on the book, and some other people. here is the conf paper.
2) Early on we realized that f(m,s) = (m/s)f(s,m) so we only had to look at the m>s case.
3) The fact that f(m,s) exists and is rational is not obvious, but is true. In fact, f(m,s) can be found by a mixedint program.
4) Late on in the process I found that there was a byinviteonly math newsgroup that had discussed the problem, and in fact was where Alan Frank first posted it. I obtained their materials and found that they had already shown f(m,s)=(m/s)f(s,m) and also that the answer is always rational and exists. Aside from that our results did not overlap.
5) Even later in the process Scott Huddleston emailed me (out of the blue) that he had a program that solved the muffin problem quickly. I was skeptical at first, but he did indeed have a whole new way to look at the problem and his code was very fast (I had Jacob Prinz, one of the coauthors on the book, recode it). Later Richard Chatwin (see here) seems to have proven that Scott's method always works. The approach of Scott and Richard is where to go if you want to do serious further research on Muffins. My book is where you want to go if you want to learn some easy and fun math (a HS student could read it).
6) I coauthored a column with Scott H, Erik Metz, Jacob Prinz on Muffins, featuring his technique, in Lane's complexity column, here.
7) I had an REU student, Stephanie Warman, write a muffin package based on the book.
8) I gave a talk an invited talk on The Muffin Problem at a Joint AMSMAA meeting.
9) I gave a talk at Gathering for Gardner 2018 on The Muffin Problem.
10) I often give talks on it to groups of High School students.
11) When I teach Discrete Math Honors I talk about it and assign problems on it it really is part of the course. As such its a good way to reinforce the pigeon hole principle.
12) I contacted Alan Frank about my work. We arranged to meet at an MIT combinatorics seminar where I was to give a talk on muffins. He brought 11 muffins, with 1 cut (1/2,1/2), 2 cut (14/30,16/30),
and 8 cut (13/30,17/30) so that the 11 of us could each get 11/5 with smallest piece 13/30.
13) Coda:
Why did I keep working on this problem? I kept working on it because I kept hitting barriers and (with coauthors) breaking them with new techniques that were interesting. If early on a barrier was not breakable then I would have stopped. If (say) Floorceiling solved everything than I might have gotten a paper out of this, but surely not a book.
Lesson for all of us: look around you! Its not clear what is going to inspire a project!
Lasting effect: I am reluctant to throw out old math magazines and pamphlets since you never know when one will lead to a book.
by gasarch (noreply@blogger.com) at October 11, 2021 01:18 AM UTC
In my last post, I wrote (among other things) about an ongoing scientific debate between the group of Chaoyang Lu at USTC in China, which over the past year has been doing experiments that seek to demonstrate quantum supremacy via Gaussian BosonSampling; and the group of Sergio Boixo at Google, which had a recent paper on a polynomialtime classical algorithm to sample approximately from the same distributions. I reported the facts as I understood them at the time. Since then, though, a long call with the Google team gave me a new and different understanding, and I feel dutybound to share that here.
A week ago, I considered it obvious that if, using a classical spoofer, you could beat the USTC experiment on a metric like total variation distance from the ideal distribution, then you would’ve completely destroyed USTC’s claim of quantum supremacy. The reason I believed that, in turn, is a proposition that I hadn’t given a name but needs one, so let me call it Hypothesis H:
The only way a classical algorithm to spoof BosonSampling can possibly do well in total variation distance, is by correctly reproducing the highorder correlations (correlations among the occupation numbers of large numbers of modes) — because that’s where the complexity of BosonSampling lies (if it lies anywhere).
Hypothesis H had important downstream consequences. Google’s algorithm, by the Google team’s own admission, does not reproduce the highorder correlations. Furthermore, because of limitations on both samples and classical computation time, Google’s paper calculates the total variation distance from the ideal distribution only on the marginal distribution on roughly 14 out of 144 modes. On that marginal distribution, Google’s algorithm does do better than the experiment in total variation distance. Google presents a claimed extrapolation to the full 144 modes, but eyeballing the graphs, it was far from clear to me what would happen: like, maybe the spoofing algorithm would continue to win, but maybe the experiment would turn around and win; who knows?
Chaoyang, meanwhile, made a clear prediction that the experiment would turn around and win, because of
 the experiment’s success in reproducing the highorder correlations,
 the admitted failure of Google’s algorithm in reproducing the highorder correlations, and
 the seeming impossibility of doing well on BosonSampling without reproducing the highorder correlations (Hypothesis H).
Given everything my experience told me about the central importance of highorder correlations for BosonSampling, I was inclined to agree with Chaoyang.
Now for the kicker: it seems that Hypothesis H is false. A classical spoofer could beat a BosonSampling experiment on total variation distance from the ideal distribution, without even bothering to reproduce the highorder correlations correctly.
This is true because of a combination of two facts about the existing noisy BosonSampling experiments. The first fact is that the contribution from the orderk correlations falls off like 1/exp(k). The second fact is that, due to calibration errors and the like, the experiments already show significant deviations from the ideal distribution on the order1 and order2 correlations.
Put these facts together and what do you find? Well, suppose your classical spoofing algorithm takes care to get the loworder contributions to the distribution exactly right. Just for that reason alone, it could already win over a noisy BosonSampling experiment, as judged by benchmarks like total variation distance from the ideal distribution, or for that matter linear crossentropy. Yes, the experiment will beat the classical simulation on the higherorder correlations. But because those higherorder correlations are exponentially attenuated anyway, they won’t be enough to make up the difference. The experiment’s lack of perfection on the loworder correlations will swamp everything else.
Granted, I still don’t know for sure that this is what happens — that depends on whether I believe Sergio or Chaoyang about the extrapolation of the variation distance to the full 144 modes (my own eyeballs having failed to render a verdict!). But I now see that it’s logically possible, maybe even plausible.
So, let’s imagine for the sake of argument that Google’s simulation wins on variation distance, even though the experiment wins on the highorder correlations. In that case, what would be our verdict: would USTC have achieved quantum supremacy via BosonSampling, or not?
It’s clear what each side could say.
Google could say: by a metric that Scott Aaronson, the coinventor of BosonSampling, thought was perfectly adequate as late as last week — namely, total variation distance from the ideal distribution — we won. We achieved lower variation distance than USTC’s experiment, and we did it using a fast classical algorithm. End of discussion. No moving the goalposts after the fact.
Google could even add: BosonSampling is a sampling task; it’s right there in the name! The only purpose of any benchmark — whether Linear XEB or highorder correlation — is to give evidence about whether you are or aren’t sampling from a distribution close to the ideal one. But that means that, if you accept that we are doing the latter better than the experiment, then there’s nothing more to argue about.
USTC could respond: even if Scott Aaronson is the coinventor of BosonSampling, he’s extremely far from an infallible oracle. In the case at hand, his lack of appreciation for the sources of error in realistic experiments caused him to fixate inappropriately on variation distance as the success criterion. If you want to see the quantum advantage in our system, you have to deliberately subtract off the loworder correlations and look at the highorder correlations.
USTC could add: from the very beginning, the whole point of quantum supremacy experiments was to demonstrate a clear speedup on some benchmark — we never particularly cared which one! That horse is out of the barn as soon as we’re talking about quantum supremacy at all — something the Google group, which itself reported the first quantum supremacy experiment in Fall 2019, again for a completely artificial benchmark — knows as well as anyone else. (The Google team even has experience with adjusting benchmarks: when, for example, Pan and Zhang pointed out that Linear XEB as originally specified is pretty easy to spoof for random 2D circuits, the most cogent rejoinder was: OK, fine then, add an extra check that the returned samples are sufficiently different from one another, which kills Pan and Zhang’s spoofing strategy.) In that case, then, why isn’t a benchmark tailored to the highorder correlations as good as variation distance or linear crossentropy or any other benchmark?
Both positions are reasonable and have merit — though I confess to somewhat greater sympathy for the one that appeals to my doofosity rather than my supposed infallibility!
OK, but suppose, again for the sake of argument, that we accepted the second position, and we said that USTC gets to declare quantum supremacy as long as its experiment does better than any known classical simulation at reproducing the highorder correlations. We’d still face the question: does the USTC experiment, in fact, do better on that metric? It would be awkward if, having won the right to change the rules in its favor, USTC still lost even under the new rules.
Sergio tells me that USTC directly reported experimental data only for up to order7 correlations, and at least individually, the order7 correlations are easy to reproduce on a laptop (although sampling in a way that reproduces the order7 correlations might still be hard—a point that Chaoyang confirms, and where further research would be great). OK, but USTC also reported that their experiment seems to reproduce up to order19 correlations. And order19 correlations, the Google team agrees, are hard to sample consistently with on a classical computer by any currently known algorithm.
So then, why don’t we have direct data for the order19 correlations? The trouble is simply that it would’ve taken USTC an astronomical amount of computation time. So instead, they relied on a statistical extrapolation from the observed strength of the lowerorder correlations — there we go again with the extrapolations! Of course, if we’re going to let Google rest its case on an extrapolation, then maybe it’s only sporting to let USTC do the same.
You might wonder: why didn’t we have to worry about any of this stuff with the other path to quantum supremacy, the one via random circuit sampling with superconducting qubits? The reason is that, with random circuit sampling, all the correlations except the highestorder ones are completely trivial — or, to say it another way, the reduced state of any small number of output qubits is exponentially close to the maximally mixed state. This is a real difference between BosonSampling and random circuit sampling—and even 56 years ago, we knew that this represented an advantage for random circuit sampling, although I now have a deeper appreciation for just how great of an advantage it is. For it means that, with random circuit sampling, it’s easier to place a “sword in the stone”: to say, for example, here is the Linear XEB score achieved by the trivial classical algorithm that outputs random bits, and lo, our experiment achieves a higher score, and lo, we challenge anyone to invent a fast classical spoofing method that achieves a similarly high score.
With BosonSampling, by contrast, we have various metrics with which to judge performance, but so far, for none of those metrics do we have a plausible hypothesis that says “here’s the best that any polynomialtime classical algorithm can possibly hope to do, and it’s completely plausible that even a noisy current or planned BosonSampling experiment can do better than that.”
In the end, then, I come back to the exact same three goals I would’ve recommended a week ago for the future of quantum supremacy experiments, but with all of them now even more acutely important than before:
 Experimentally, to increase the fidelity of the devices (with BosonSampling, for example, to observe a larger contribution from the highorder correlations) — a much more urgent goal, from the standpoint of evading classical spoofing algorithms, than further increasing the dimensionality of the Hilbert space.
 Theoretically, to design better ways to verify the results of samplingbased quantum supremacy experiments classically — ideally, even ways that could be applied via polynomialtime tests.
 For Gaussian BosonSampling in particular, to get a better understanding of the plausible limits of classical spoofing algorithms, and exactly how good a noisy device needs to be before it exceeds those limits.
Thanks so much to Sergio Boixo and Ben Villalonga for the conversation, and to Chaoyang Lu and Jelmer Renema for comments on this post. Needless to say, any remaining errors are my own.
by Scott at October 10, 2021 06:13 PM UTC
The CATCS is compiling a list of Women in Theoretical Computer Science (WinTCS), hoping to facilitate engagement and opportunities for Women in TCS in the future.
We cordially invite women (broadly defined as anyone who selfidentifies as a woman) as well as gender nonconforming TCS researchers to participate the following simple googleform survey to provide your information.
Please submit your information before Oct. 31, 2021. After that, information can still be provided through the CATCS website, and will be added to the list monthly.
**Please feel free to share this solicitation broadly within your networks.**
If you have any questions regarding this survey, please feel free to contact us at goldner@bu.edu or yusuwang@ucsd.edu.
– Kira Goldner and Yusu Wang (on behalf of the SIGACT CATCS)
by shuchic at October 08, 2021 09:56 PM UTC
Potbelly, a local sandwich chain, made me an offer I couldn't refuse: change my password and earn a free (and quite tasty) oatmeal chocolate chip cookie. A free cookie is a great motivator, and checking that this wasn't some clever phishing attack, changed my password and got my cookie. Not sure why Potbelly wanted me to change my password but happy to take their cookie.
Potbelly likely didn't make this offer to everyone so what if you want a cookie?
 Use an app to get a cookie delivered.
 Visit a specialty cookie store.
 Go to your local supermarket and pick up a package of Chip's Ahoy.
 Buy some premade cookie dough and put it in the oven.
 Buy some cookie mix, add ingredients and bake.
 Find a cookie recipe, buy the ingredients and get cooking
 Get fresh ingredients direct from a farm stand
 Grow and gather your own ingredients, ala Pancakes Pancakes
 Not even realize you are using machine learning, such as recommendations on Netflix or Facebook.
 Using ML implicitly, like talking to Alexa
 Using pretrained ML through an app, like Google Translate
 Using pretrained ML through an API
 Using a model like GPT3 with an appropriate prompt
 Use an easily trained model like Amazon Fraud Detector
 An integrated machine learning environment like Sagemaker
 Use prebuilt ML tools like TensorFlow or PyTorch
 Code up your own ML algorithms in C++
 Build your own hardware and software
by Lance Fortnow (noreply@blogger.com) at October 08, 2021 03:10 PM UTC
Continuing from the previous post, we are going to prove the following result: let be a regular Cayley graph of an Abelian group, be the normalized edge expansion of , be the value of the ARV semidefinite programming relaxation of sparsest cut on (we will define it below), and be the second smallest normalized Laplacian eigenvalue of . Then we have
which, together with the fact that and , implies the Buser inequality
The proof of (1), due to Shayan Oveis Gharan and myself, is very similar to the proof by Bauer et al. of (2).
1. Ideas
For a positive integer parameter , call the multigraph whose adjacency matrix is , where is the adjacency matrix of . So is a regular graph, and each edge in corresponds to a length walk in . Our proof boils down to showing
which gives (1) after we note that
and
and we combine the above inequalities with :
The reader will see that our argument could also prove (roughly as done by Bauer et al.) that , simply by reasoning about distributions of cuts instead of reasoning about ARV solutions, which would give (2) more directly. By reasoning about ARV solutions, however, we are also able to establish (3), which we think is independently interesting.
It remains to prove (4). I will provide a completely selfcontained proof, including the definition of Cayley graphs and of the ARV relaxation, but, first, here is a summary for the reader already familiar with this material: we take an optimal solution of ARV for , and bound how well it does for . We need to understand how much bigger is the fraction of edges cut by the solution in compared to the fraction of edges cut in ; a random edge in is obtained by randomly sampling generators and adding them together, which is roughly like sampling times each generator, each time with a random sign. Because of cancellations, we expect that the sum of these random generators can be obtained by summing roughly copies of each of the generators, or generators in total. So a random edge of corresponds roughly to edges of , and this is by how much, at most, the fraction of cut edges can grow.
2. Definitions
Now we present more details and definition. Recall that if is a group, for which we use additive notation, and is a set or multiset of group elements, then the Cayley graph is the graph that has a vertex for every group element and an edge for every group element and every element . We restrict ourselves to undirected graphs, so we will always assume that if is an element of , then is also an element of (with the same multiplicity, if if a multiset). Note that the resulting undirected graph is regular.
Several families of graphs can be seen to be Cayley graphs, including cycles, cliques, balanced complete bipartite graphs, hypercubes, toruses, and so on. All the above examples are actually Cayley graphs of Abelian groups. Several interesting families of graphs, for example several families of expanders, are Cayley graphs of nonAbelian groups, but the result of this post will apply only to Abelian groups.
To define the ARV relaxation, let us take it slowly and start from the definition of the sparsest cut problem. The edge expansion problem is closely related to the sparsest cut problem, which can be defined as
where is the normalized Laplacian of and is the normalized Laplacian of the clique on vertices. We wrote it this way to emphasize the similarity with the computation of the second smallest normalized Laplacian eigenvalue, which can be written as
where the second equality is perhaps not obvious but is easily proved (all the solutions have the same cost function on both sides, and the last expression is shift invariant, so there is no loss in optimizing over all or just in the space ). We see that the computation of is just a relaxation of the sparsest cut problem, and so we have
We can write the sparsest cut problem in a less algebraic version as
and recall that
The cost function for does not change if we switch with , so could be defined equivalently as an optimization problem over subsets , and at this point and are the same problem except for an extra factor of in the denominator of . Such a factor is always between and , so we have:
(Note that all these definitions have given us a particularly convoluted proof that .)
Yet another way to characterize is as
Where is the Frobenius inner product between matrices. This is also something that is not obvious but that it is not difficult to prove, the main point being that if we write a PSD matrix as , then
and so there is no loss in passing from an optimization over all PSD matrices versus all rank1 PSD matrices.
We can rewrite in terms of the Cholesky decomposition of as
where the correspondence between PSD matrices and vectors is that we have (that is, is the Cholesky decomposition of and is the Gram matrix of the ). An integral solution of the sparsest cut problem corresponds to choosing rank and solutions in which each 1dimensional is either 1 or 0, corresponding on whether is in the set or not. The ARV relaxation is
which is a relaxation of sparsest cut because the “triangle inequality” constraints that we introduced are satisfied by 1dimensional 0/1 solutions. Thus we have
3. The Argument
Let us take any solution for ARV, and let us symmetrize it so that the symmetrized solution satisfies
That is, make sure that the contribution of each edge to the numerator of the cost function depends only on the generator that defines the edge, and not on the pair of endpoints.
It is easier to see that this symmetrization is possible if we view our solution as a PSD matrix . In this case, for every group element , we can let be the solution (with the same cost) obtained by permuting rows and columns according to the mapping ; then we can consider the solution , which satisfies the required symmetry condition.
Because of this condition, the cost function applied to in is
and the cost function applied to in is
meaning that our goal is now simply to prove
or, if we take a probabilistic view
If we let be the number of times that generator appears in the sum , counting cancellations (so that, if appears 4 times and appears 6 times we let and ) we have
where multiplying an integer by a generator means adding the generator to itself that many times. Using the triangle inequality and the symmetrization we have
The next observation is that, for every ,
where the expectation is over the choice of . This is because can be seen as , where we define if , if and otherwise. We have
Combining everything, we have
So every solution of ARV for is also a solution for with a cost that increases at most by a factor, which proves (4) as promised.
4. A Couple of Interesting Questions
We showed that ARV has integrality gap at most for every regular Abelian Cayley graph, but we did not demonstrate a rounding algorithm able to achieve an approximation ratio.
If we follow the argument, starting from an ARV solution of cost , choosing of the order of we see that has an ARV solution (the same as before) of cost at most, say, , and so , implying that and so Fiedler’s algorithm according to a Laplacian eigenvalue finds a cut of sparsity at most .
We can also see that if is the matrix corresponding to an ARV solution of value for , then one of the eigenvectors of must be a test vector of Rayleigh quotient at most for the Laplacian , where is order of . However it is not clear how to get, from such a vector, a test vector for the Laplacian of of Rayleigh quotient at most for the Laplacian of , though one such vector should be for a properly chosen in the range .
If this actually work, then the following, or something like that, would be a rounding algorithm: given a PSD solution of ARV of cost , consider PSD matrices of the form , which do not necessarily satisfy the triangle inequalities any more, for , and try to round them using a random hyperplane. Would this make sense for other classes of graphs?
It is plausible that sparsest cut in Abelian Cayley graphs has actually a constantfactor approximation in polynomial or quasipolynomial time, and maybe either that ARV achieves constantfactor approximation.
by luca at October 08, 2021 02:58 PM UTC
If I could explain it to the average person, I wouldn’t have been worth the Nobel Prize— Richard Feynman
Composite from frontpage source 
Giorgio Parisi has just been awarded the 2021 Physics Nobel Prize for his work on the disorder in systems of all kinds. At the same time Syukuro Manabe and Klaus Hasselmann won for their joint work on the physical modeling of the Earth’s climate and reliably predicting global warming.
Today Ken and I look at some aspects of this year’s prizes.
Small issue: I do believe in humancaused global warming, but shouldn’t the citation read: for reliably predicting the future temperature of the planet? Currently it reads a bit like for reliably predicting the rise in the global stock market, as if the outcome not the model were primary.
Oh well. Every year there are Nobel prizes awarded for things that have at least some computational aspect. This year’s prize is certainly for something related to computation. One aspect that we are hoping for is Aspect—Alain Aspect. Ken was among those tipping him, John Clauser, and Anton Zeilinger for this year’s prize—for their work demonstrating quantum entanglement and nonclassical experimental outcomes. This work spans substantial areas of quantum computing.
General Comments
The Nobel committee’s technical review of the prizewinning trio’s accomplishments ends with an expansive comment:
Clearly this year’s Laureates have made groundbreaking contributions to our understanding of complex physical systems in their broadest sense, from the microscopic to the global. They show that without a proper accounting of disorder, noise and variability, determinism is just an illusion. Indeed, the work recognized here reflects in part the comment ascribed to Richard Feynman (Nobel Laureate 1965), that he “Believed in the primacy of doubt, not as a blemish on our ability to know, but as the essence of knowing.”
Recognizing the work of this troika reflects the importance of understanding that no single prediction of anything can be taken as inviolable truth, and that without soberly probing the origins of variability we cannot understand the behavior of any system. Therefore, only after having considered these origins do we understand that global warming is real and attributable to our own activities, that a vast array of the phenomena we observe in nature emerge from an underlying disorder, and that embracing the noise and uncertainty is an essential step on the road towards predictability.
Ken says about the globalwarming part of the last sentence that all you need to do is ask a wine grower, of which there are many in the Niagara region. The latitudes at which certain grape strains thrive have been shifting recently and steadily north.
For some general comments by Parisi, see his interview. He says lots of neat stuff including:
Well, things that Well, my mentor Nicola Cabibbo was usually saying that we should work on a problem only if working on the problem is fun. So, I mean, fun is not very clear what it means, but it’s something that we find deeply interesting, and that we strongly believe that it is I mean you won’t find fun in unclear because one gets a new idea of something unexpected and so on. So I tried to work on something that was interesting and which I believed that had some capacity to add something.
His Trick
The Nobel citation says: “One of the many theoretical tools Professor Parisi has used to establish his theory is the socalled ‘replica trick’—a mathematical method which takes a disordered system, replicates it multiple times, and compares how different replicas of the system behave. You can do this, for instance, by compressing marbles in a box, which will form a different configuration each time you make the compression. Over many repetitions, Parisi knew, telling patterns might emerge.” They point to a paper from talks by Parisi in 2013 also involving Flaviano Morone, Francesco Caltagirone, and Elizabeth Harrison.
The trick has a Wikipedia page, which says that its crux “is that while the disorder averaging is done assuming to be an integer, to recover the disorderaveraged logarithm one must send continuously to zero. This apparent contradiction at the heart of the replica trick has never been formally resolved, however in all cases where the replica method can be compared with other exact solutions, the methods lead to the same results.”
The mathematical identity underlying the replica trick is
The is the partition function of a physical system or some related thermodynamical measure. The power represents independent copies of the system—if is an integer, that is. But instead we treat as a real number going to zero. The formal justification is best seen via a Taylor series expansion:
But on the whole, this strikes me as a silly idea. Suppose that you have a nasty function where . How do you try to understand the behavior of ? Several ideas come to mind:
 Try to see how grows as ?
 Try to get an approximate formula for as grows?
 Try to understand ?
Wait, this is nuts. How can the value help us understand the limit of
Indeed. Another form is that
The trick here is that instead of letting go to infinity they set to zero. This is crazy. But it is so crazy that it yields a ton of insight into the behavior for large . I wish I could understand this better. Perhaps it could help us with our problems like ?
Indeed, one of Parisi’s main applications is to spin glass models, which are related to the Ising model and likewise have associated problems that are hard. This applies even to a spinglass model for which Parisi showed that exact solutions are computable.
Manabe and Hasselmann
The prize for Manabe is well summed by the title to this Italian news article, which translates as, “The scientist who put the climate into computers.” The article’s subhead makes a point that deserves further reflection: “Without his models it would have been impossible to do experiments on the climate.” Stated positively and picking up “reliable” from the Nobel citation, the point is:
Having reliable climate models has made it possible to do experiments on the climate.
Now we cannot actually do experiments on the climate—short of setting off a nuclear winter or erecting a Dyson sphere, maybe. We can observe changes caused by El Niño events and shifts in the jet stream, for instance, and try to build the framework of a controlled experiment around them. But that’s all.
So what is meant is that the models are robust enough, and have proven themselves on predictions at smaller or broader scales, that we can confidently regard computational experiments with them as indicative of realworld outcomes. The article mentions being able to tell what would happen if we made mountains disappear or shuffled the continents around. But in line with what we said in the intro, the logic sounds circular or selfconfirming unless its reasonableness is explained more. The Washington Post article on the prizes stops short of saying that Manabe’s modeling of the effect of atmospheric carbon dioxide is a truly confirmed prediction, but it does say that his 50 years of modeling choices have had an almost 1.000 batting average.
Hasselmann is hailed for supplying a different piece that promotes confidence and accounting of causality. This is to demonstrate consistent effects of shortterm local weather events on longerterm global climate. To those like us not versed in the background, this might again sound circular: didn’t the global configuration cause the local event? One particular effect Hasselmann traced is of weather events on ocean currents. The chain from atmosphere to storm to ocean subsurface is noncircular.
Perhaps all this can be put as pithily as John Wheeler’s noncircular explanation of general relativity: “Matter tells spacetime how to curve, and curved spacetime tells matter how to move.” But until then, we feel a need to hallow a more fundamental story of how computational modeling works and why it is effective.
Open Problems
One day we will see a more computationallyoriented Nobel Prize, but how soon? Until then, best to all working on theory. You are Laureates in our eyes.
[added equation with derivatives to “Trick” section]
by RJLipton+KWRegan at October 08, 2021 04:34 AM UTC
Place of work: Warsaw, Poland
The newly established IDEAS NCBR institute is looking for a position of Research Team Leader dealing with formal modeling and proving the security of cryptographic protocols used in blockchain technology. The research will be carried out in cooperation with the cryptography and blockchain laboratory headed by prof. Stefan Dziembowski at the University of Warsaw.
Website: https://ideasncbr.pl/en/researchteamleader/
Email: jobs@ideasncbr.pl
by shacharlovett at October 07, 2021 01:43 PM UTC
As life is tentatively returning to normal, I would like to once again post technical material here. Before returning to online optimization, I would like to start with something from 2015 that we never wrote up properly, that has to do with graph curvature and with Buser inequalities in graphs.
The starting point is the Cheeger inequalities on graphs.
If is a regular graph, and is its adjacency matrix, then we define the normalized Laplacian matrix of as , and we call the second smallest (counting multiplicities) eigenvalue of . It is important in the spectral theory of graphs that this eigenvalue has a variational characterization as the solution of the following optimization problem:
The normalized edge expansion of is defined as
where denotes the number of edges with one endpoint in and one endpoint outside . We have talked about these quantities several times, so we will just jump to the fact that the following Cheeger inequalities hold:
Those inequalities are called Cheeger inequality because the upper bound is the discrete analog of a result of Cheeger concerning Riemann manifolds. There were previous posts about this here and here, and for our current purposes it will be enough to recall that, roughly speaking, a Riemann manifold defines a space on which we can define realvalued and complexvalued functions, and such functions can be integrated and differentiated. Subsets of a Riemann manifold have, if they are measurable, a well defined volume, and their boundary a well defined lowerdimensional volume; it possible to define a Cheeger constant of a manifold in a way that is syntactically analogous to the definition of edge expansion:
It is also possible to define a Laplacian operator on smooth functions , and if is the second smallest eigenvalue of we have the characterization
and Cheeger proved
which is syntactically almost the same inequality and, actually, has a syntactically very similar proof (the extra factor comes from the fact that things are normalized slightly differently).
The “dictionary” between finite graphs and compact manifolds is that vertices correspond to points, degree correspond to dimensionality, adjacent vertices and correspond to “infinitesimally close” points along one dimension, and the set of edges in the cut correspond to the boundary of a set , “volume” corresponds to number of edges (both when we think of volume of the boundary and volume of a set), vectors correspond to smooth functions , and the collection of values of at the neighbors of a vertex corresponds to the gradient of a function at point .
Having made this long premise, the point of this post is that the inequality
which is the easy direction to show in the graph case, does not hold for manifolds.
The easy proof for graphs is that if is the cut that realizes the minimum edge expansion, we can take , or rather the projection of the above vector on the space orthogonal to , and a twoline calculation gives the inequality.
If, however, is a subset of points of the manifold that realizes the Cheeger constant, we cannot define to be the indicator of , because the function would not be smooth and its gradient would be undefined.
We could think of rescuing the proof by taking a smoothed version of the indicator of , something that goes to 1 on one side and to 0 on the other side of the cut, as a function of the distance from the boundary. The “quadratic form” of such a function, however, would depend not just on the area of the boundary of , but also on how quickly the volume of the subset of the manifold at distance from the boundary of grows as a function of .
If the Ricci curvature of the manifold is negative, this volume grows very quickly, and the quadratic form of the smoothed function is very bad. The graph analog of this would be a graph made of two expanders joined by a bridge edge, and the problem is that the analogy between graphs and manifolds breaks down because “infinitesimal distance” in the manifold corresponds to any distance on the graph, and although the bridge edge is a sparse cut in the graph, it is crossed by a lot of paths of length , or at least this is the best intuition that I have been able to build about what breaks down here in the analogy between graphs and manifolds.
If the Ricci curvature of the manifold is nonnegative and the dimension is bounded, there is however an inequality in manifolds that goes in the direction of the “easy graph Cheeger inequality,” and it has the form
This is the Buser inequality for manifolds of nonnegative Ricci curvature, and note how strong it is: it says that and approximate each other up to the constant factor .
If the Ricci curvature is negative, and is the dimension of the manifold, then one has the inequality
Given the importance that curvature has in the study of manifolds, there has been considerable interest in defining a notion of curvature for graphs. For example curvature relates to how quickly balls around a point grow with the diameter the ball, a concept that is of great importance in graphs as well; curvature is a locally defined concept that has a global consequences, and in graph property testing one is precisely interested in understanding global properties based on local ones; and a Busertype inequality in graphs would be very interesting because it would provide a class of graphs for which Fiedler’s algorithm provides constantfactor approximation for sparsest cut.
There have been multiple attempts at defining notions of curvature for graphs, the main ones being Olivier curvature and BakryEmery curvature. Each captures some but not all of the useful properties of Ricci curvature in manifolds. My (admittedly, poorly informed) intuition is that curvature in manifold is defined “locally” but, as we mentioned above, “local” in graphs can have multiple meanings depending on the distance scale, and it is difficult to come up with a clean and usable definition that talks about multiple distance scales.
Specifically to the point of the Buser inequality, Bauer et al and Klartag et al prove Buser inequalities for graphs with respect to the BakryEmery definition. Because Cayley graphs of Abelian groups happen to have curvature 0 according to this definition, their work gives the following result:
Theorem 1 (Buser inequality for Cayley graphs of Abelian groups) If is a regular Cayley graph of an Abelian group, is the second smallest normalized Laplacian eigenvalue of and is the normalized edge expansion of , we have
(Klartag et al. state the result as above; Bauer et al. state it only for certain groups, but I think that their proof applies to all Abelian groups)
In particular, the above statement implies that if we have a regular Abelian Cayley graph then provides an approximation of the sparsest cut up to a multiplicative error and that Fiedler’s algorithm is an approximate algorithm for sparsest cut in Abelian Cayley graphs.
At some point in 2015 (or maybe 2014, during the Simons program on spectral graph theory?), I read the paper of Bauer at al. and wondered about a few questions. First of all, is there a way to prove the above theorem without using any notion of curvature?
Secondly, the theorem implies that Fiedler’s algorithm has a good approximation, but it does so in a very unusual way: we have that is a continuous relaxation of the sparsest cut problem, and we have Fiedler’s algorithm that, as analyzed by Cheeger’s inequality, provides a somewhat bad rounding, and finally the Buser inequality is telling us that the relaxation has very poor integrality gap on all Abelian Cayley graphs, and so even though the rounding seems bad it is actually good compared to the integral optimum. There should be an underlying reason for all this, meaning some other relaxation that has an actual integrality gap for sparsest cut that is at most .
I mentioned these questions to Shayan Oveis Gharan, and we were able to prove that if is the value of the AroraRaoVazirani (or GoemansLinial) semidefinite programming relaxation of sparsest cut, then, for all regular Cayley graphs, we have
which proves the above theorem and, together with the Cheeger inequality , also shows
that is, the ARV relaxation has integrality gap at most on Abelian Cayley graphs. I will discuss the proof (which is a sort of “sumofsquares version” of the proof of Bauer et al.) in the next post.
by luca at October 07, 2021 10:38 AM UTC
We have 6 tenure/tenure track position out of which one is focused on theory. We are looking for folks in both algorithms and complexity. (The Nov 30 deadline is a suggestion, we will be considering applications a bit after that as well.)
Website: https://www.ubjobs.buffalo.edu/postings/30481
Email: atri@buffalo.edu
by shacharlovett at October 07, 2021 02:10 AM UTC
I’m looking for a postdoctoral researcher from Jan 22 to work on an EPSRCfunded project on passwordhashing algorithms and idealized models of computation at Durham Uni. Applicants with backgrounds in Cryptography, Algorithms and Complexity are very welcome. Durham is one of the top universities in the UK, & the CS dept. hosts one of the strongest theory groups across the ACiD and NESTiD groups.
Website: https://gow.epsrc.ukri.org/NGBOViewGrant.aspx?GrantRef=EP/V034065/1
Email: pooya.farshim@gmail.com
by shacharlovett at October 06, 2021 04:26 PM UTC
A nice array of papers for this month, ranging from distribution testing, dense graph property testing, sublinear graph algorithms, and various mixtures of them. Let us now sample our spread! (Adding another semistreaming paper who achieved similar results to another posted paper. Ed)
A Lower Bound on the Complexity of Testing Grained Distributions by Oded Goldreich and Dana Ron (ECCC). A discrete distribution is called \(m\)grained if all probabilities are integer multiples of \(1/m\). This paper studies the complexity of testing this property of distributions. For simplicity, consider the property of being \(n/2\)grained, where the support size is \(n\). The classic lower bound for testing uniformity shows that \(\Omega(\sqrt{n})\) samples are required to distinguish the uniform distribution from a distribution uniform on \(n/2\) elements. Thus, we get a lower bound of \(\Omega(\sqrt{n})\) for testing \(n/2\)grainedness (if I am permitted to use that word). This paper proves a lower bound of \(\Omega(n^c)\), for all constant \(c < 1\). It is conjectured that the lower bound is actually \(\Omega(n/\log n)\), which would match the upper bound (for any labelinvariant property).
Testing Distributions of Huge Objects by Oded Goldreich and Dana Ron (ECCC). This paper introduced a new model that marries distribution testing with property testing on strings. The “object” of interest is a distribution \(\mathcal{D}\) over strings of length \(n\). We wish to test if \(\mathcal{D}\) possesses some property. The tester can get a random string \(x\) from the distribution, and can query any desired index of \(x\). The distance between distributions is defined using the earthmover distance (where we use the Hamming distance between strings). This model is called the DoHO (Distributions of Huge Objects) model. There are many questions posed and connections drawn to classical property testing and distribution testing. What I find interesting is a compelling application: the distribution \(\mathcal{D}\) may represent noisy or perturbed versions of a single object. The DoHO model gives a natural generalization of standard property testing to noisy objects. This paper considers problems such as testing if \(\mathcal{D}\) is: a random perturbation of a string, or a random cyclic shift, or a random isomorphism of a graph.
Sublinear Time and Space Algorithms for Correlation Clustering via SparseDense Decompositions by Sepehr Assadi and Chen Wang (arXiv). Correlation clustering is a classic problem where edges in a graph are labeled ‘+’ or ‘‘, denoting whether these edges should be uncut or cut. The aim is to cluster the graph minimizing the total “disagreements” (cut ‘+’ edges or uncut ‘‘ edges). This paper gives an \(O(1)\)approximation algorithm that runs in \(O(n\log^2n)\) time; this is the first sublinear time approximation algorithm for this problem. Correlation clustering has seen results for the property testing/sublinear algorithms community, first by Bonchi, Garcia Soriano, and Kutzkov. But previous results were essentially on the dense graph model, giving \(O(\epsilon n^2)\) error assuming adjacency matrix input. This paper considers access to the adjacency list of ‘+’ edges. Interestingly (from a technical standpoint), the key tool is a new sparsedense decomposition. Such decompositions emerged from the seminal work of AssadiChenKhanna for sublinear \((\Delta+1)\)colorings, and it is great to see applications beyond coloring.
SublinearTime Computation in the Presence of Online Erasures by Iden Kalemaj, Sofya Raskhodnikova, and Nithin Varma (arXiv). Can property testing be done when portions of the input are hidden? This question was first raised by DixitRaskhodnikovaThakurtaVarma, who gave a model of erasureresilient testing. There is an adversary who hides (erases) part of the input function; queries to those parts just yield a dummy symbol. This paper defines an online version of this model. There is an erasure parameter \(t\). On each query by the property tester, the adversary can erase \(t\) values of the function. Consider the property of classic linearity of functions \(f:\{0,1\}^d \rightarrow \{0,1\}\). The BLR tester queries triples of pairs \((x,y, x \oplus y)\). Observe how this tester is easily defeated by our adversary, by erasing the value \(f(x\oplus y)\). One of the main results of this paper is a \(O(1/\varepsilon)\)query tester for linearity, that works for any constant erasure parameter \(t\). Note that this matches the bound for the standard setting. There are a number of results for other classic properties, such as monotonicity (sortedness) and Lipschitz.
Sublinear Time Eigenvalue Approximation via Random Sampling by Rajarshi Bhattacharjee, Cameron Musco, and Archan Ray (arXiv). Consider the problem of estimating all the eigenvalues of a real, symmetric \(n \times n\) matrix \(M\) with bounded entries, in sublinear time. The main result shows that the eigenvalues of a uniform random \(O(\epsilon^{4}\log n)\) principal submatrix can be used to approximate all eigenvalues of \(M\) up to additive error \(\epsilon n\). One can think of this as a sort of concentration inequality for eigenvalues. This result follows (and builds upon) work of BakshiChepurkoJayaram on property testing semidefiniteness. The key idea is that eigenvectors corresponding to large eigenvalues have small infinity norm: intuitively, since all entries in \(M\) are bounded, such an eigenvector must have its mass spread out among many coordinates. Hence, we can get information about it by randomly sampling a few coordinates. The paper also shows that approach of taking principal submatrices requires taking \(\Omega(\epsilon^{2})\) columns/rows.
Deterministic Graph Coloring in the Streaming Model by Sepehr Assadi, Andrew Chen, and Glenn Sun (arXiv). This is technically not a sublinear algorithms paper (well ok, streaming is sublinear, but we tend not to cover the streaming literature. Maybe we should? – Ed.) But, I think that the connections are of interest to our community of readers. The main tool of sublinear \((\Delta+1)\)coloring algorithm of AssadiChenKhanna is the palette sparsification lemma (\(\Delta\) is the maximum degree). This lemma shows that vertices can randomly shorten their ”palette” of colors, after which all colorings from these palettes lead to very few monochromatic edges. This is an immensely powerful tool, since one can get immediately sublinear complexity algorithms in many models: adjacency list, streaming, distributed. Is the randomness necessary? Note that these algorithms run in \(\Omega(n)\) time/space, so it is conceivable that deterministic sublinear algorithms exists. This paper shows that randomization is necessary in the semistreaming model (space \(O(n poly(\log n))\)). Indeed, there exist no deterministic semistreaming algorithms that can achieve even \(\exp(\Delta^{o(1)})\) colorings.
Adversarially Robust Coloring for Graph Stream by Amit Chakrabarti, Prantar Ghosh, and Manuel Stoeckl (arXiv). This paper studies the same problem as the above, but presents the results in a different way. In a randomized algorithm, we normally think of an adversary that fixes a (hard) input, and the algorithm then makes its random decisions. An adaptive adversary is one that changes the input (stream) based on the decisions of an algorithm. In this definition, a robust algorithm is one that can give correct answers, even for adversarially generated output. A deterministic algorithm is automatically robust. This paper show that there do not exist semistreaming algorithms that can achieve \((\Delta+1)\)colorings. The quantitative lower bound is weaker (\(\Omega(\Delta^2)\) colors), but it is against a stronger adversary.
by Seshadhri at October 06, 2021 04:31 AM UTC
The next TCS+ talk will take place next Wednesday, October 13th at 1:00 PM Eastern Time (10:00 AM Pacific Time, 19:00 Central European Time, 17:00 UTC: check your time zone!). Nutan Limaye from IT University of Copenhagen will speak about “Superpolynomial Lower Bounds Against LowDepth Algebraic Circuits” (abstract below).
You can reserve a spot as an individual or a group to join us live by signing up on the online form. Registration is not required to attend the interactive talk, and the link will be posted on the website the day prior to the talk; however, by registering in the form, you will receive a reminder, along with the link. (The recorded talk will also be posted on our website afterwards) As usual, for more information about the TCS+ online seminar series and the upcoming talks, or to suggest a possible topic or speaker, please see the website.
Abstract: Every multivariate polynomial P(X) can be written as a sum of monomials, i.e. a sum of products of variables and field constants. In general, the size of such an expression is the number of monomials that have a nonzero coefficient in P.
What happens if we add another layer of complexity, and consider sums of products of sums (of variables and field constants) expressions? Now, it becomes unclear how to prove that a given polynomial P(X) does not have small expressions. In this result, we solve exactly this problem.
More precisely, we prove that certain explicit polynomials have no polynomialsized “SigmaPiSigma” (sums of products of sums) representations. We can also show similar results for SigmaPiSigmaPi, SigmaPiSigmaPiSigma and so on for all “constantdepth” expressions.
The talk is based on a joint work of Nutan Limaye, Srikanth Srinivasan, and Sébastien Tavenas.
by plustcs at October 05, 2021 09:34 PM UTC
Today the Italian academic community, along with lots of other people, was delighted to hear that Giorgio Parisi is one of the three recipients of the 2021 Nobel Prize for Physics.
Parisi has been a giant in the area of understanding “complex” and “disordered” systems. Perhaps, his most influential contribution has been his “replica method” for the analysis of the SherringtonKirkpatrick model. His ideas have led to several breakthroughs in statistical physics by Parisi and his collaborators, and they have also found applications in computer science: to tight analyses on a number of questions about combinatorial optimization on random graphs, to results on random constraint satisfaction problems (including the famous connection with random kSAT analyzed by Mezard, Parisi and Zecchina) and random error correcting codes, and to understanding the solution landscape in optimization problems arising from machine learning. Furthermore these ideas have also led to the development and analysis of algorithms.
The news was particularly well received at Bocconi, where most of the faculty of the future CS department has done work that involved the replica method. (Not to be left out, even I have recently used replica methods.)
Mezard and Montanari have written a booklength treatment on the interplay between ideas from statistical physics, algorithms, optimization, information theory and coding theory that arise from this tradition. Readers of in theory looking for a shorter exposition aimed at theoretical computer scientists will enjoy these notes posted by Boaz Barak, or this even shorter post by Boaz.
In this post, I will try to give a sense to the reader of what the replica method for the SherringtonKirkpatrick model looks like when applied to the averagecase analysis of optimization problems, stripped of all the physics. Of course, without the physics, nothing makes any sense, and the interested reader should look at Boaz’s posts (and to references that he provides) for an introduction to the context. I did not have time to check too carefully what I wrote, so be aware that several details could be wrong.
What is the typical value of the max cut in a random graph with vertices?
Working out an upper bound using union bounds and Chernoff bound, and a lower bound by thinking about a greedy algorithm, we can quickly convince ourselves that the answer is . Great, but what is the constant in front of the ? This question is answered by the Parisi formula, though this fact was not rigorously established by Parisi. (Guerra proved that the formula gives an upper bound, Talagrand proved that it gives a tight bound.)
Some manipulations can reduce the question above to the following question: suppose that I pick a random symmetric matrix , say with zero diagonal, and such that (up to the symmetry requirement) the entries are mutually independent and each entry is equally likely to be or , or perhaps each entry is distributed according to a standard normal distribution (the two versions can be proved to be equivalent), what is the typical value of
up to additive terms,?
As a first step, we could replace the maximum with a “softmax,” and note that, for every choice of , we have
The above upper bound gets tighter and tighter for larger , so if we were able to estimate
for every (where the expectation is over the randomness of ) then we would be in good shape.
We could definitely use convexity and write
and then use linearity of expectation and independence of the entries of to get to
Now things simplify quite a bit, because for all the expression , in the Rademacher setting, is equally likely to be or , so that, for , we have
and
so that
which, choosing , gives an upper bound which is in the right ballpark. Note that this is exactly the same calculations coming out of a Chernoff bound and union bound. If we optimize the choice of we unfortunately do not get the right constant in front of .
So, if we call
we see that we lose too much if we do the step
But what else can we do to get rid of the logarithm and to reduce to an expression in which we take expectations of products of independent quantities (if we are not able to exploit the assumption that has mutually independent entries, we will not be able to make progress)?
The idea is that if is a small enough quantity (something much smaller than ), then is close to 1 and we have the approximation
and we obviously have
so we can use the approximation
and
Let’s forget for a moment that we want to be a very small parameter. If was an integer, we would have
Note that the above expression involves choices of tuples of feasible solutions of our maximization problem. These are the “replicas” in “replica method.”
The above expression does not look too bad, and note how we were fully able to use the independence assumption and “simplify” the expression. Unfortunately, it is actually still very bad. In this case it is preferable to assume the to be Gaussian, write the expectation as an integral, do a change of variable and some tricks so that we reduce to computing the maximum of a certain function, let’s call it , where the input is a matrix, and then we have to guess what is an input of maximum value for this function. If we are lucky, the maximum is equivalent by a in which all entries are identical, the replica symmetric solution. In the SherringtonKirkpatrick model we don’t have such luck, and the next guess is that the optimal is a blockdiagonal matrix, or a replica symmetrybreaking solution. For large , and large number of blocks, we can approximate the choice of such matrices by writing down a system of differential equations, the Parisi equations, and we are going to assume that such equations do indeed describe an optimal and so a solution to the integral, and so they give as a computation of .
After all this, we get an expression for for every sufficiently large integer , as a function of up to lower order errors. What next? Remember how we wanted to be a tiny real number and not a sufficiently large integer? Well, we take the expression, we forget about the error terms, and we set
by luca at October 05, 2021 08:53 PM UTC
1. Huge congratulations to the winners of this year’s Nobel Prize in Physics: Syukuro Manabe and Klaus Hasselmann for climate modelling, and separately, Giorgio Parisi for statistical physics. While I don’t know the others, I had the great honor to get to know Parisi three years ago, when he was chair of the committee that awarded me the TomassoniChisesi Prize in Physics, and when I visited Parisi’s department at Sapienza University of Rome to give the prize lecture and collect the award. I remember Parisi’s kindness, a lot of good food, and a lot of discussion of the interplay between theoretical computer science and physics. Note that, while much of Parisi’s work is beyond my competence to comment on, in computer science he’s very wellknown for applying statistical physics methods to the analysis of survey propagation—an algorithm that revolutionized the study of random 3SAT when it was introduced two decades ago.
2. Two weeks ago, a group at Google put out a paper with a new efficient classical algorithm to simulate the recent Gaussian BosonSampling experiments from USTC in China. They argued that this algorithm called into question USTC’s claim of BosonSamplingbased quantum supremacy. Since then, I’ve been in contact with Sergio Boixo from Google, Chaoyang Lu from USTC, and Jelmer Renema, a Dutch BosonSampling expert and friend of the blog, to try to get to the bottom of this. Very briefly, the situation seems to be that Google’s new algorithm outperforms the USTC experiment on one particular metric: namely, total variation distance from the ideal marginal distribution, if (crucially) you look at only a subset of the optical modes, say 14 modes out of 144 total. Meanwhile, though, if you look at the k^{th}order correlations for large values of k, then the USTC experiment continues to win. With the experiment, the correlations fall off exponentially with k but still have a meaningful, detectable signal even for (say) k=19, whereas with Google’s spoofing algorithm, you choose the k that you want to spoof (say, 2 or 3), and then the correlations become nonsense for larger k.
Now, given that you were only ever supposed to see a quantum advantage from BosonSampling if you looked at the k^{th}order correlations for large values of k, and given that we already knew, from the work of Leonid Gurvits, that very small marginals in BosonSampling experiments would be easy to reproduce on a classical computer, my inclination is to say that USTC’s claim of BosonSamplingbased quantum supremacy still stands. On the other hand, it’s true that, with BosonSampling especially, more so than with qubitbased random circuit sampling, we currently lack an adequate theoretical understanding of what the target should be. That is, which numerical metric should an experiment aim to maximize, and how well does it have to score on that metric before it’s plausibly outperforming any fast classical algorithm? One thing I feel confident about is that, whichever metric is chosen—Linear CrossEntropy or whatever else—it needs to capture the k^{th}order correlations for large values of k. No metric that’s insensitive to those correlations is good enough.
3. Like many others, I was outraged and depressed that MIT uninvited Dorian Abbot (see also here), a geophysicist at the University of Chicago, who was slated to give the Carlson Lecture in the Department of Earth, Atmospheric, and Planetary Sciences about the atmospheres of extrasolar planets. The reason for the cancellation was that, totally unrelatedly to his scheduled lecture, Abbot had argued in Newsweek and elsewhere that Diversity, Equity, and Inclusion initiatives should aim for equality for opportunity rather than equality of outcomes, a Twittermob decided to go after him in retaliation, and they succeeded. It should go without saying that it’s perfectly reasonable to disagree with Abbot’s stance, to counterargue—if those very concepts haven’t gone the way of floppy disks. It should also go without saying that the MIT EAPS department chair is free to bow to socialmedia pressure, as he did, rather than standing on principle … just like I’m free to criticize him for it. To my mind, though, cancelling a scientific talk because of the speaker’s centrist (!) political views completely, 100% validates the right’s narrative about academia, that it’s become a fanatically intolerant echo chamber. To my fellow progressive academics, I beseech thee in the bowels of Bertrand Russell: why would you commit such an unforced error?
Yes, one can imagine views (e.g., open Nazism) so hateful that they might justify the cancellation of unrelated scientific lectures by people who hold those views, as many physicists after WWII refused to speak to Werner Heisenberg. But it seems obvious to me—as it would’ve been obvious to everyone else not long ago—that no matter where a reasonable person draws the line, Abbot’s views as he expressed them in Newsweek don’t come within a hundred miles of it. To be more explicit still: if Abbot’s views justify deplatforming him as a planetary scientist, then all my quantum computing and theoretical computer science lectures deserve to be cancelled too, for the many attempts I’ve made on this blog over the past 16 years to share my honest thoughts and life experiences, to write like a vulnerable human being rather than like a university press office. While I’m sure some sneerers gleefully embrace that implication, I ask everyone else to consider how deeply they believe in the idea of academic freedom at all—keeping in mind that such a commitment only ever gets tested when there’s a chance someone might denounce you for it.
Update: Princeton’s James Madison Program has volunteered to host Abbot’s Zoom talk in place of MIT. The talk is entitled “Climate and the Potential for Life on Other Planets.” Like probably hundreds of others who heard about this only because of the attempted cancellation, I plan to attend!
Unrelated Bonus Update: Here’s a neat YouTube video put together by the ACM about me as well as David Silver of AlphaGo and AlphaZero, on the occasion of our ACM Prizes in Computing.
by Scott at October 05, 2021 07:23 PM UTC
On behalf of the FOCS 2021 PC, I am delighted to announce the Best Paper Awards.
Best paper:
Nutan Limaye, Srikanth Srinivasan and Sébastien Tavenas. Superpolynomial Lower Bounds Against LowDepth Algebraic Circuits
This paper makes a fundamental advance by proving superpolynomial lower bounds against algebraic circuits of arbitrary constant depth.
Paper: https://eccc.weizmann.ac.il/report/2021/081/
Machtey Award for Best Student Paper:
Xiao Mao. Breaking the Cubic Barrier for (Unweighted) Tree Edit Distance
This paper shows how, unlike the weighted case, the unweighted tree edit distance problem has a subcubic time algorithm.
Paper: https://arxiv.org/abs/2106.02026
Congratulations to the winners and see you all in Denver from Feb 710, 2022!
by nisheethvishnoi at October 04, 2021 06:58 PM UTC
at October 04, 2021 02:46 PM UTC
(Disclosure  Harry Lewis was my PhD advisor.)
It seems like just a few weeks ago I I blogged about a book of Harry Lewis's that was recently available (see here). And now I am blogging about another one. Writing two books in two years seems hard! I can only think of one other computer scientist who has done that recently (see here and here).
In 2008 Abelson, Ledeen, and Lewis wrote
Blown to Bits: Your Life, Liberty, and Happiness after the Digital Explosion
which I reviewed in SIGACT news, see here
Both computers and society have changed since 2008. Hence an update was needed.
In 2021 Adelson, Ledeen, Lewis, and Seltzer wrote a second edition.
Should you buy the new version if you bought the old version?
1) Not my problem I got them both for free since I reviewed them.
2) Not your problem The second edition is available freeonline here. Is that a link to some dark corner of the dark web? No, its the formal webpage about the book. So the book is available freeonline legally, if you care (and even if you don't care).
3) If you like paper, the book is on amazon. (If you don't like paper, the book is still on amazon).
I reviewed it in SIGACT news. A nonpaywalled link: here (is that link legal? I have no idea.)
In this post I'll just mention two things that changed since the last book
1) Shared Music and pirating were an issue back in 2008. It does not seem to be anymore since there is now a variety of services that seem to make pirating not worth it: itunes, streaming services, and some bands give it away for free and ask you to pay what its worth. Movies are still struggling with this issue.
2) AI systems that reinforce existing bias is a new problem.
by gasarch (noreply@blogger.com) at October 04, 2021 04:19 AM UTC
I just added to Wikipedia two articles on the Jordan–Pólya numbers and fibbinary numbers, two integer sequences used in my recent paper on Egyptian fractions. Jordan–Pólya numbers are the products of factorials, while the fibbinary numbers are the ones with binary representations having no two consecutive 1’s. The OEIS page on the fibbinary numbers, A003714, lists many ways of generating this sequence algorithmically, of which most are boring or slow (generate all binary numbers and test which ones belong to the sequence; you can test if a variable x
is fibbinary by checking that x&(x>>1)
is zero). I thought it might be interesting to highlight two of those methods that are a little more clever and generate these numbers in small numbers of operations.
Some functional languages, and in part Python even though it’s mostly not functional, have a notion of a stream, a potentially infinite sequence of values generated by a coroutine. In Python, you can program these using simple generators and the yield
keyword. I wrote here long ago about methods for using generators recursively: a generator can call itself, manipulate the resulting sequence of values, and pass them on to its output. It’s actually a very old idea, used for instance to generate regular numbers by Dijkstra in his 1976 book A Discipline of Programming. Reinhard Zumkeller used the same idea to generate the fibbinary numbers in Haskell, based on the observation that the sequence of positive fibbinary numbers can be generated, starting from the number 1, by two operations, doubling smaller values or replacing a smaller value \(x\) with \(4x+1\). Here is is, translated into Python:
It’s elegant, but has a couple of minor flaws. First, it omits the number \(0\), and while it can be modified to include \(0\), the modifications make the code messier. But second, it takes more than a constant amount of time per element to generate each sequence element. A fibbinary number \(x\) has to be generated from a sequence of smaller elements by repeated doubling and quadrupling, and that takes \(\log x\) steps per element. Even if we assume those steps to take constant time each, generating the first \(n\) elements in this way takes time \(\Theta(n\log n)\). It’s better than the \(\Theta(n^{\log_\varphi 2})\approx n^{1.44}\) that you would get from generateandtest, but still not as good as we might hope for. One way to fix this would be to memoize the generator, so that the recursive calls look at a stored copy of the sequence generated by the outer call rather than generating the same sequence redundantly, but this again makes the code messier and also takes more storage than necessary.
Instead, Jörg Arndt observed that you can generate each fibbinary number directly from the previous one by a process closely resembling binary addition. Adding one to a binary number sets the first available bit from zero to one, and zeros out all the smaller bits; here, a bit is available if it is already zero. Finding the next fibbinary number does the same thing, but with a different definition of availability: a bit is available if both it and the next larger bit are zero. We can find the available bit using binary addition on a modified word that fills in bits whose neighbor is nonzero. Using this idea, we can generate the fibbinary numbers in a constant number of bitwise binary wordlevel operations per number. Here it is again in Python, translated from Arndt’s C++ and simplified based on a comment by efroach76:
It’s even possible to use the same idea to generate the fibbinary numbers in a constant amortized number of bitlevel operations per number, although this ends up being a little less efficient in practice because highlevel languages end up translating all these bit operations into word operations anyway.
The inner loop ends immediately at fibbinary numbers whose successor is odd (at positions given by the ones of the Fibonacci word), whose fraction of the total is \(11/\varphi\approx 0.382\), where \(\varphi\) is the golden ratio. It ends in two steps for the remaining values when their next bit is odd, in the same proportion, and so on. So the average number of steps for the inner loop adds in a geometric series to \(O(1)\).
by David Eppstein at October 02, 2021 01:53 PM UTC
at October 01, 2021 04:38 PM UTC