# Theory of Computing Blog Aggregator

### Group Testing with a Graph Infection Spread Model

Authors: Batuhan Arasli, Sennur Ulukus
Abstract: We propose a novel infection spread model based on a random connection graph which represents connections between $n$ individuals. Infection spreads via connections between individuals and this results in a probabilistic cluster formation structure as well as a non-i.i.d. (correlated) infection status for individuals. We propose a class of two-step sampled group testing algorithms where we exploit the known probabilistic infection spread model. We investigate the metrics associated with two-step sampled group testing algorithms. To demonstrate our results, for analytically tractable exponentially split cluster formation trees, we calculate the required number of tests and the expected number of false classifications in terms of the system parameters, and identify the trade-off between them. For such exponentially split cluster formation trees, for zero-error construction, we prove that the required number of tests is $O(\log_2n)$. Thus, for such cluster formation trees, our algorithm outperforms any zero-error non-adaptive group test, binary splitting algorithm, and Hwang's generalized binary splitting algorithm. Our results imply that, by exploiting probabilistic information on the connections of individuals, group testing can be used to reduce the number of required tests significantly even when infection rate is high, contrasting the prevalent belief that group testing is useful only when infection rate is low.

### Minimum Cost Flows, MDPs, and $\ell_1$-Regression in Nearly Linear Time for Dense Instances

Authors: Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, Di Wang
Abstract: In this paper we provide new randomized algorithms with improved runtimes for solving linear programs with two-sided constraints. In the special case of the minimum cost flow problem on $n$-vertex $m$-edge graphs with integer polynomially-bounded costs and capacities we obtain a randomized method which solves the problem in $\tilde{O}(m+n^{1.5})$ time. This improves upon the previous best runtime of $\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) and, in the special case of unit-capacity maximum flow, improves upon the previous best runtimes of $m^{4/3+o(1)}$ (Liu-Sidford 2020, Kathuria 2020) and $\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) for sufficiently dense graphs.

For $\ell_1$-regression in a matrix with $n$-columns and $m$-rows we obtain a randomized method which computes an $\epsilon$-approximate solution in $\tilde{O}(mn+n^{2.5})$ time. This yields a randomized method which computes an $\epsilon$-optimal policy of a discounted Markov Decision Process with $S$ states and $A$ actions per state in time $\tilde{O}(S^2A+S^{2.5})$. These methods improve upon the previous best runtimes of methods which depend polylogarithmically on problem parameters, which were $\tilde{O}(mn^{1.5})$ (Lee-Sidford 2015) and $\tilde{O}(S^{2.5}A)$ (Lee-Sidford 2014, Sidford-Wang-Wu-Ye 2018).

To obtain this result we introduce two new algorithmic tools of independent interest. First, we design a new general interior point method for solving linear programs with two sided constraints which combines techniques from (Lee-Song-Zhang 2019, Brand et al. 2020) to obtain a robust stochastic method with iteration count nearly the square root of the smaller dimension. Second, to implement this method we provide dynamic data structures for efficiently maintaining approximations to variants of Lewis-weights, a fundamental importance measure for matrices which generalize leverage scores and effective resistances.

### No-go Theorem for Acceleration in the Hyperbolic Plane

Authors: Linus Hamilton, Ankur Moitra
Abstract: In recent years there has been significant effort to adapt the key tools and ideas in convex optimization to the Riemannian setting. One key challenge has remained: Is there a Nesterov-like accelerated gradient method for geodesically convex functions on a Riemannian manifold? Recent work has given partial answers and the hope was that this ought to be possible.

Here we dash these hopes. We prove that in a noisy setting, there is no analogue of accelerated gradient descent for geodesically convex functions on the hyperbolic plane. Our results apply even when the noise is exponentially small. The key intuition behind our proof is short and simple: In negatively curved spaces, the volume of a ball grows so fast that information about the past gradients is not useful in the future.

### Necessary and Sufficient Condition for Satisfiability of a Boolean Formula in CNF and its Implications on P versus NP problem

Authors: Manoj Kumar
Abstract: In this paper, a necessary and sufficient condition for satisfiability of a boolean formula, in CNF, has been determined. It has been found that the maximum cardinality of satisfiable boolean formula increases exponentially, with increase in number of variables. Due to which, any algorithm require exponential time, in worst case scenario, depending upon the number of variables in a boolean formula, to check satisfiability of the given boolean formula. Which proves the non-existence of a polynomial time algorithm for satisfiability problem. As satisfiability is a NP-complete problem, and non-existence of a polynomial time algorithm to solve satisfiability proves exclusion of satisfiability from class P. Which implies P is not equal to NP. Further, the necessary and sufficient condition can be used to optimize existing algorithms, in some cases, the unsatisfiability of a given boolean function can be determined in polynomial time. For this purpose, a novel function has been defined, that can be used to determine cardinality of a given boolean formula, and occurances of a literal in the given formula, in polynomial time.

### Spectral Clustering Oracles in Sublinear Time

Authors: Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, Christian Sohler
Abstract: Given a graph $G$ that can be partitioned into $k$ disjoint expanders with outer conductance upper bounded by $\epsilon\ll 1$, can we efficiently construct a small space data structure that allows quickly classifying vertices of $G$ according to the expander (cluster) they belong to? Formally, we would like an efficient local computation algorithm that misclassifies at most an $O(\epsilon)$ fraction of vertices in every expander. We refer to such a data structure as a \textit{spectral clustering oracle}. Our main result is a spectral clustering oracle with query time $O^*(n^{1/2+O(\epsilon)})$ and preprocessing time $2^{O(\frac{1}{\epsilon} k^4 \log^2(k))} n^{1/2+O(\epsilon)}$ that provides misclassification error $O(\epsilon \log k)$ per cluster for any $\epsilon \ll 1/\log k$. More generally, query time can be reduced at the expense of increasing the preprocessing time appropriately (as long as the product is about $n^{1+O(\epsilon)}$) -- this in particular gives a nearly linear time spectral clustering primitive. The main technical contribution is a sublinear time oracle that provides dot product access to the spectral embedding of $G$ by estimating distributions of short random walks from vertices in $G$. The distributions themselves provide a poor approximation to the spectral embedding, but we show that an appropriate linear transformation can be used to achieve high precision dot product access. We then show that dot product access to the spectral embedding is sufficient to design a clustering oracle. At a high level our approach amounts to hyperplane partitioning in the spectral embedding of $G$, but crucially operates on a nested sequence of carefully defined subspaces in the spectral embedding to achieve per cluster recovery guarantees.

### Improving Run Length Encoding by Preprocessing

Authors: Sven Fiergolla, Petra Wolf
Abstract: The Run Length Encoding (RLE) compression method is a long standing simple lossless compression scheme which is easy to implement and achieves a good compression on input data which contains repeating consecutive symbols. In its pure form RLE is not applicable on natural text or other input data with short sequences of identical symbols. We present a combination of preprocessing steps that turn arbitrary input data in a byte-wise encoding into a bit-string which is highly suitable for RLE compression. The main idea is to first read all most significant bits of the input byte-string, followed by the second most significant bit, and so on. We combine this approach by a dynamic byte remapping as well as a Burrows-Wheeler-Scott transform on a byte level. Finally, we apply a Huffman Encoding on the output of the bit-wise RLE encoding to allow for more dynamic lengths of code words encoding runs of the RLE. With our technique we can achieve a lossless average compression which is better than the standard RLE compression by a factor of 8 on average.

### Metric Distortion of Social Choice Rules: Lower Bounds and Fairness Properties

Authors: Ashish Goel, Anilesh Kollagunta Krishnaswamy, Kamesh Munagala
Abstract: We study social choice rules under the utilitarian distortion framework, with an additional metric assumption on the agents' costs over the alternatives. In this approach, these costs are given by an underlying metric on the set of all agents plus alternatives. Social choice rules have access to only the ordinal preferences of agents but not the latent cardinal costs that induce them. Distortion is then defined as the ratio between the social cost (typically the sum of agent costs) of the alternative chosen by the mechanism at hand, and that of the optimal alternative chosen by an omniscient algorithm. The worst-case distortion of a social choice rule is, therefore, a measure of how close it always gets to the optimal alternative without any knowledge of the underlying costs. Under this model, it has been conjectured that Ranked Pairs, the well-known weighted-tournament rule, achieves a distortion of at most 3 [Anshelevich et al. 2015]. We disprove this conjecture by constructing a sequence of instances which shows that the worst-case distortion of Ranked Pairs is at least 5. Our lower bound on the worst case distortion of Ranked Pairs matches a previously known upper bound for the Copeland rule, proving that in the worst case, the simpler Copeland rule is at least as good as Ranked Pairs. And as long as we are limited to (weighted or unweighted) tournament rules, we demonstrate that randomization cannot help achieve an expected worst-case distortion of less than 3. Using the concept of approximate majorization within the distortion framework, we prove that Copeland and Randomized Dictatorship achieve low constant factor fairness-ratios (5 and 3 respectively), which is a considerable generalization of similar results for the sum of costs and single largest cost objectives. In addition to all of the above, we outline several interesting directions for further research in this space.

### ML Theory with bad drawings

This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts.

In any science, ML included, the goals of theory and practice are not disjoint. We study the same general phenomena from different perspectives. We can think of questions in computation at a high level as trying to map the unknown terrain of the computational cost of achieving certain quality of output, given some conditions on the input.

Practitioners aim to find points on this terrain’s practically relevant regions, while theoreticians are more interested in its broad landscape, including at points far out into infinity. Both theoreticians and practitioners care about discontinuities, when small changes in one aspect correspond to large changes in another. As theoreticians, we care about computational/statistical tradeoffs, particularly about points where a small difference in quality yields an exponential difference in time or sample complexity. In practice, models such as GPT-3 demonstrate that a quantitative increase in computational resources can correspond to a qualitative increase in abilities.

Since theory and practice study the same phenomenon in different ways, their relation can vary. Sometimes theory is forward-looking or prescriptive – giving “proof of concepts” results that can inspire future application or impossibility results that can rule out certain directions. Sometimes it is backward-looking or descriptive – explaining phenomena uncovered by experiments and putting them in a larger context.

## What is machine learning?

Before we talk about what is machine-learning theory, we should talk about what is machine learning. It’s common to see claims of the form “machine learning is X”:

• Machine learning is just statistics
• Machine learning is just optimization
• Machine learning is just approximation or “curve fitting”

and probably many more.

I view machine learning as addressing the following setup:

There is a system $f$ that interacts with the world around it in some manner, and we have some sense of whether this interaction is successful or unsuccessful.

For example, a self-driving car is “successful” if it gets passengers from point A to point B safely, quickly, and comfortably. It is “unsuccessful” if it gets into accidents, takes a long time, or drives in a halting or otherwise uncomfortable manner.

### Three levels of indirection

The system $f$ is the “machine.” It is “learning” since we adapt $f$ so that it becomes more successful. Ideally, we would set $f$ to be the most successful system. However, what we actually do is at least thrice-removed from this ideal:

1. The model gap: We do not optimize overall possible systems, but rather a small subset of such systems (e.g., ones that belong to a certain family of models).
2. The metric gap: In almost all cases, we do not optimize the actual measure of success we care about, but rather another metric that is at best correlated with it.
3. The algorithm gap: We don’t even optimize the latter metric since it will almost always be non-convex, and hence the system we end up with depends on our starting point and the particular algorithms we use.

The magic of machine learning is that sometimes (though not always!) we can still get good results despite these gaps. Much of the theory of machine learning is about understanding under what conditions can we bridge some of these gaps.

The above discussion explains the “machine Learning is just X” takes. The expressivity of our models falls under approximation theory. The gap between the success we want to achieve and the metric we can measure often corresponds to the difference between population and sample performance, which becomes a question of statistics. The study of our algorithms’ performance falls under optimization.

The metric gap is perhaps the widest of them all. While in some settings (e.g., designing systems to play video games) we can directly measure a system’s success, this is typically the exception rather than rule. Often:

• The data we have access to is not the data the system will run on and might not even come from the same distribution.
• The measure of success that we care about is not necessarily well defined or accessible to us. Even if it was, it is not necessarily in a form that we can directly optimize.

Nevertheless, the hope is that if we optimize the hell out of the metrics we can measure, the system will perform well in the way we care about. In other words:

A priori, it is not clear that this should be the case, and indeed it’s not always is. Part of the magic of auto-differentiation is that it allows optimization of more complex metrics that match more closely with the success metric we have in mind. But, except for very special circumstances, the two metrics can never be the same. The mismatch between the goal we have and the metric we optimize can manifest in one of the following general ways:

• “No free lunch” or Goodhart’s law: If we optimize a metric $\mathcal{M}$ then we will get a system that does very well at $\mathcal{M}$ and not at all well in any other measure. For example, if we optimize accuracy in predicting images, then we may not do well on slightly perturbed versions of these images.
• “It’s not the destination, it’s the journey” or Anna Karenina principle: All successful systems are similar to one another, and hence if we make our system successful with respect to a metric $\mathcal{M}$ then it is likely to also be successful with respect to related measures. For example, image classifiers trained on ImageNet have been successfully used for very different images, and there is also evidence that success in ImageNet translates into success on a related distribution (i.e., ImageNet v2).

At the moment, we have no good way to predict when a system will behave according to Goodhart’s law versus the Anna Karenina principle. It seems that when it comes to learning representations, machine learning systems follow the Anna Karenina principle: all successful models tend to learn very similar representations of their data. In contrast, when it comes to making decisions, we get manifestations of Goodhart’s law, and optimizing for one metric can give very different results than the other.

The model gap is the gap between the set of all possible systems and the set that we actually optimize over. Even if a system can be captured as a finite function (say a function mapping $32 \times 32$ pixel images to $10$ classes), a simple counting argument shows that the vast majority of these functions require far too many gates to be efficiently computable by any reasonable computational model. (In case you are curious, the counting argument also works for quantum circuits.) Hence we necessarily have to deal with a subclass of all possible functions.

This raises the question of whether we should expect a realizable system to exist, let alone for us to find it. Often machine learning is used in artifical intelligence applications, where we are trying to mimic human performance. In such cases, human performance is an “existence proof” that some reasonably-sized circuit is successful in the goal. But whether this circuit can be embedded in our model family is still an open question.

Remember when I said that sometimes it’s not about the journey but about the destination? I lied. The algorithm gap implies that in modern machine learning, there is a certain sense in which it is always about the journey. In modern machine learning, the system $f$ is typically parameterized by a vector $w \in \mathbb{R}^N$ of real numbers. Hence the metric we optimize over is to minimize some loss function $\mathcal{L}(w)$. We use local search algorithms, which start off at some vector $w_0$ chosen via some distribution, and then iteratively takes small steps $w_1,w_2,w_3,\ldots$. For each $w_i$, the next step $w_{i+1}$ is chosen from some distribution of vectors close to $w_i$ such that (hopefully) in expectation $\mathcal{L}(w_{i+1}) < \mathcal{L}(w_i)$.

Modern machine learning systems are non convex, which means that the final point $w_t$ we end up in depends on the starting point, the algorithms, and the randomness. Hence, we can’t have a neat “separation of concerns” and decouple the architecture, metric, and algorithm. When we say that a system obeys the “Anna Karenina principle,” we really mean that for natural algorithms to optimize natural metrics on natural architectures, successful outputs (ones that do well on the metric) are likely to be similar to one another. The use of the “natural” qualifier is a warning sign that we don’t fully understand the conditions under which this happens, but it is clear that some conditions are necessary. Due to non-convexity, it is typically possible to find a “bad minima” that would be very good in some specific metric but terrible in other ones.

## Types of theory for machine learning (and for CS in general)

Given the above discussion, what kind of theoretical results should we expect in machine learning? Let’s try to classify the type of results we see in general theoretical computer science. In the discussion below, I will focus on the limitations of these theoretical results and use humorous names, but make no mistake: all of these categories correspond to valuable theoretical insights.

### End to end analysis

The quicksort algorithm provides a canonical example of algorithm analysis. At this point, we have a reasonably complete understanding of quicksort. We can analyze the distribution of its running time down to the constant.

Hence we have an efficient algorithm, used in practice, with rigoroulsy proven theorems that precisely characterize its performance. This is the “gold standard” of analysis, but also an unrealistic and unachievable goal in almost any other setting.

### Proof of concept

A more common situation is that we have algorithms that work in practice and algorithms with rigorous analysis, but they are not the same algorithms. A canonical example is linear programming. The simplex algorithm often works well in practice but was known to take exponential time on certain instances. For a long time, it was an open question whether or not there is an algorithm for linear programming that take polynomial time in the worst case.

In 1979, Khachiyan gave the Ellipsoid algorithm that runs in polynomial time, but with a polynomial so large that it is impractical. Still, the Ellipsoid algorithm was a great breakthrough, giving hope for better algorithms and a new approach for defining progress measures for linear programming instances. Indeed in 1984, Karmarkar came up with the interior points algorithm, with much better time dependence.

The interior-point algorithm is practical and often outperforms the simplex method on large enough instances. However, the algorithm as implemented is not identical to the one analyzed- implementations use different step sizes and other heuristics for which we do not have a precise analysis.

Nevertheless, even if (unlike quicksort) the rigorously analyzed algorithm is not identical to the practical implementations, the story of linear programming shows how crucial “proofs of concept” can be to introducing new ideas and techniques.

The flip side of “proof of concept” results are impossiblity results that rule out even “proof of concept ” algorithms. Impossibility results always come with fine print and caveats (for example, NP-hardness results always refer to worst case complexity). However, they still teach us about the structure of the problem and help define the contours of what we can expect to achieve.

## Character witness

“End to end analysis” is when we can prove the guarantees we need on algorithms we actually want to use. “Proof of concept” is when we prove the guarantees we need on impractical algorithms. A “character witness” result is when we prove something positive about an algorithm people use in practice, even if that positive property falls short of the guarantees we actually want.

While the term “character witness” sounds derogatory, such results can sometimes yield truly profound insights. A canonical example again comes from linear programming. While the simplex algorithm can be exponential in the worst case, in a seminal work, Spielman and Teng showed that it does run in polynomial-time if the input is slightly perturbed- this is so-called smoothed analysis. While the actual polynomial and the level of perturbation do not yield practical bounds, this is still an important result. It gives formal meaning to the intuition that the simplex only fails on “pathological” instances and initiated a new mode of analyzing algorithms between worst-case and average-case complexity.

In machine learning, a “character witness” result can take many forms. For example, some “character witness” results are analyses of algorithms under certain assumptions on the data, that even if not literally true, seem like they could be “morally true”. Another type of “character witness” result shows that an algorithm would yield the right results if it is allowed to run for an infinite or exponentially long time. Evaluating the significance of such results can be challenging. The main question is whether the analysis teaches us something we didn’t know before.

A third type of “character witness” results are quantitative bounds that are too weak for practical use but are still non-vacuous. For example, approximation algorithms with too big of an approximation factor, or generalization bounds with too big of a guaranteed gap. In such cases, one would hope that these bounds will be at least correlated with performance: algorithms with better bounds will also have higher quality output.

## Toy problems

The name “toy problem” also sounds derogatory, but toy problems or toy models can be extremely important in science, and machine learning is no different. A toy model is a way to abstract the salient issues of a model to enable analysis. Results on “toy models” can teach us about general principles that hold in more complex models.

When choosing a toy model, it is important not to mistake models that share superficial similarity with models that keep the salient issues we want to study. For example, consider the following two variants of deep neural networks:

• Networks that have standard architecture except that we make them extremely wide, with the width tending to infinity independently of all other parameters.
• Networks where all activation functions are linear.

Since the practical intuition is that bigger models are better, it may seem that such “ultra-wide” models are not toy models at all and should capture state of the art deep networks. However, the neural tangent kernel results show that these models become kernel models that do not learn their representation at all.

Intuitively, making activation functions linear seems pointless since the composition of linear functions is linear, and hence such linear networks are no more expressive than a layer one linear net. Thus such linear networks seem too much of a “toy model” (maybe a “happy meal model”?). Yet, it turns out that despite their limited expressivity, deep linear networks capture an essential feature of deep networks- the bias induced by the gradient descent algorithm. For example, running gradient descent on a depth two linear network translate to regularizing by a proxy of rank. In general, gradient descent on a deep linear network induces a very different geometry on the manifold of linear functions.

Hence, although the first model seems much more “real” than the second one, there are some deep learning questions where the second model is more relevant.

### Theory without theorems

One might think that the whole point of theory is to provide rigorous results: if we are not going to prove theorems, we might as well use benchmarks. Yet, there are important theoretical insights we can get from experiments. A favorite example of mine is the Zhang et al result that deep neural networks can fit random labels. This simple experiment ruled out in one fell swoop a whole direction for proving generalization bounds on deep nets. A theory work does not have to involve theorems: well-chosen experiments can provide important theoretical insights.

## Conclusion

The above was a stream-of-consciousness and very rough personal overview of questions and results in ML theory. In the coming seminar we will see results of all the types above, as well as many open questions.

Acknowledgements: Thanks to Yamini Bansal and Preetum Nakkiran for helpful comments (though they are not to blame for any mistakes!)

by Boaz Barak at January 15, 2021 09:41 PM UTC

from David Eppstein

by David Eppstein at January 15, 2021 06:18 PM UTC

### How to Estimate the Uncertainty of Predictions

from Aaron Roth

This is a post about a new paper Online Multivalid Learning: Means, Moments, and Prediction Intervals, that is joint work with Varun Gupta, Christopher Jung, Georgy Noarov, and Mallesh Pai. It is cross-posted to the new TOC4Fairness blog

Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients' symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically --- and you need to make decisions as you go. What can you do?

The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with marginal prediction intervals: i.e intervals $(\ell(x), u(x))$ such that over the randomness of some data distribution over labelled examples $(x,y)$: $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)]\right] \approx 0.95$ These would be 95% marginal prediction intervals --- but in general you could pick your favorite coverage probability $1-\delta$.

Conformal prediction has a lot going for it --- its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:

1. Strong Assumptions. Like many tools from statistics and machine learning, conformal prediction methods require that the future look like the past. In particular, they require that the data be drawn i.i.d. from some distribution --- or at least be exchangable (i.e. their distribution should be invariant to permutation). This is sometimes the case --- but it often is not. In our pandemic scenario, the distribution on patient features might quickly change in unexpected ways as the disease moves between different populations, as might the relationship between features and outcomes, as treatments advance. In other settings in which consequential decisions are being made about people --- like lending and hiring decisions --- people might intentionally manipulate their features in response to the predictive algorithms you deploy, in an attempt to get the outcome they want. Or you might be trying to predict outcomes in time series data, in which there are explicit dependencies across time. In all of these scenarios, exchangeability is violated.
2. Weak Guarantees. Marginal coverage guarantees are averages over people. 95% marginal coverage means that the true label falls within the predicted interval for 95% of people. It need not mean anything for people like you. For example, if you are part of a demographic group that makes up less than 5% of the population, it is entirely consistent with the guarantees of a 95% marginal prediction interval that labels for people from your demographic group fall outside of their intervals 100% of the time. This can be both an accuracy and a fairness concern --- marginal prediction works well for "typical" members of a population, but not necessarily for everyone else.

What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:

Multivalidity Ideally, we would want conditional guarantees --- i.e. the promise that for every $x$, that we would have $\Pr_{y}\left[y \in [\ell(x), u(x)] | x \right] \approx 0.95$. In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the "unrealized" (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector $x$). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don't have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups $G$ corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group $S \in G$ in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group $S \in G$, $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)] | x \in S \right] \approx 0.95$. Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve --- at least for any such group that isn't too small.

And what kinds of assumptions do we need?

Adversarial Data Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences --- i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of $1-\delta + O(1/T)$ after $T$ rounds on the underlying distribution, as conformal prediction is able to, (for us there is no underlying distribution) we offer empirical coverage rates of $1-\delta \pm O(1/\sqrt{T})$. This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.

More Generally Our techniques are not specific to prediction intervals. We can do the same thing for predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.

Techniques At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 --- and is closely related to an earlier, beautiful argument by Sergiu Hart --- but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard told us their "true" distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution --- and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That's not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.

But lets put our game theory hats on. Suppose we've been making predictions for awhile. We can write down some measure of our error so far --- say the maximum, over all demographic groups in $G$, of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the increase in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first --- this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle --- it doesn't give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.

The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That's what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm --- but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.

Finally, I had more fun writing this paper --- learning about old techniques from the game theoretic calibration literature --- than I've had in awhile. I hope a few people enjoy reading it!

by Aaron (noreply@blogger.com) at January 15, 2021 02:01 PM UTC

### How to Estimate the Uncertainty of Predictions

from TOC for Fairness

This is a post about a new paper Online Multivalid Learning: Means, Moments, and Prediction Intervals, that is joint work with Varun Gupta, Christopher Jung, Georgy Noarov, and Mallesh Pai.

Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients’ symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically — and you need to make decisions as you go. What can you do?

The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with marginal prediction intervals: i.e intervals $(\ell(x), u(x))$ such that over the randomness of some data distribution over labelled examples $(x,y)$: $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)]\right] \approx 0.95$ These would be 95% marginal prediction intervals — but in general you could pick your favorite coverage probability $1-\delta$.

Conformal prediction has a lot going for it — its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:

1. Strong Assumptions. Like many tools from statistics and machine learning, conformal prediction methods require that the future look like the past. In particular, they require that the data be drawn i.i.d. from some distribution — or at least be exchangable (i.e. their distribution should be invariant to permutation). This is sometimes the case — but it often is not. In our pandemic scenario, the distribution on patient features might quickly change in unexpected ways as the disease moves between different populations, as might the relationship between features and outcomes, as treatments advance. In other settings in which consequential decisions are being made about people — like lending and hiring decisions — people might intentionally manipulate their features in response to the predictive algorithms you deploy, in an attempt to get the outcome they want. Or you might be trying to predict outcomes in time series data, in which there are explicit dependencies across time. In all of these scenarios, exchangeability is violated.
2. Weak Guarantees. Marginal coverage guarantees are averages over people. 95% marginal coverage means that the true label falls within the predicted interval for 95% of people. It need not mean anything for people like you. For example, if you are part of a demographic group that makes up less than 5% of the population, it is entirely consistent with the guarantees of a 95% marginal prediction interval that labels for people from your demographic group fall outside of their intervals 100% of the time. This can be both an accuracy and a fairness concern — marginal prediction works well for “typical” members of a population, but not necessarily for everyone else.

What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:

Multivalidity Ideally, we would want conditional guarantees — i.e. the promise that for every $x$, that we would have $\Pr_{y}\left[y \in [\ell(x), u(x)] | x \right] \approx 0.95$. In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the “unrealized” (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector $x$). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don’t have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups $G$ corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group $S \in G$ in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group $S \in G$, $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)] | x \in S \right] \approx 0.95$. Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve — at least for any such group that isn’t too small.

And what kinds of assumptions do we need?

Adversarial Data Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences — i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of $1-\delta + O(1/T)$ after $T$ rounds on the underlying distribution, as conformal prediction is able to, we offer empirical coverage rates of $1-\delta \pm O(1/\sqrt{T})$ (since for us there is no underlying distribution). This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.

More Generally Our techniques are not specific to prediction intervals. We can do the same thing for many other distributional quantities. We work this out in the case of predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.

Techniques At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 — and is closely related to an earlier, beautiful argument by Sergiu Hart — but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard told us their “true” distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution — and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That’s not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.

But lets put our game theory hats on. Suppose we’ve been making predictions for awhile. We can write down some measure of our error so far — say the maximum, over all demographic groups in $G$, of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the increase in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first — this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle — it doesn’t give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.

The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That’s what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm — but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.

Finally, I had more fun writing this paper — learning about old techniques from the game theoretic calibration literature — than I’ve had in awhile. I hope a few people enjoy reading it!

by Aaron Roth at January 15, 2021 02:00 PM UTC

### Assistant Professor at University of Toronto, Toronto, Canada (apply by January 28, 2021)

from CCI: jobs

The Department of Computer Science at the University of Toronto invites applications for up to two full-time tenure stream positions in the areas of Security and Cryptography. The appointments will be at the rank of Assistant Professor and will commence on July 1, 2021, or shortly thereafter.

Website: https://jobs.utoronto.ca/job/Toronto-Assistant-Professor-Security-and-Cryptography-ON/543569117/
Email: recruit@cs.toronto.edu

by shacharlovett at January 14, 2021 10:40 PM UTC

### Faculty position in Algorithms and Complexity at University of Edinburgh (apply by February 21, 2021)

from CCI: jobs

The School of Informatics at the University of Edinburgh invites applicants for Lecturer (“Assistant Professor”), and Reader (“Associate Professor”) in Algorithms and Complexity.

Website: https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/295
Email: kousha@inf.ed.ac.uk

by shacharlovett at January 14, 2021 06:21 PM UTC

### Priming Random Restrictions

from Richard Lipton

Can we expand the base of their use?

 Technion commemoration src

Bella Subbotovskaya was doing Boolean complexity lower bounds in 1961. She originated the method of restrictions of Boolean functions and proved that the formula size of the parity function of ${n}$ variables is at least ${n^{3/2}}$. The parity bound was improved to quadratic in 1971 by Valeriy Khrapchenko, who passed away only last year and was the only child of the Soviet literature and arts leader Mikhail Khrapchenko. Then Johan Håstad strengthened the whole method to quadratic.

Today we discuss not possible further sharpenings but rather how to extend the ways of applying it.

Subbotovskaya was most unfortunately killed in a mysterious manner in 1982. George Szpiro in 2007 wrote an article reviewing the circumstances amid her questioning by the KGB for co-founding and hosting the “Jewish People’s University” amid rampant anti-Semitism in Soviet academies. She was also a talented violist and vocalist and investigated computer-assisted musical composition and analysis. See also her MacTutor biography.

As presented by Stasys Jukna in his 2012 book Boolean Function Complexity, Subbotovskaya gave an algorithm for optimizing the trimming of a formula over the ${\wedge,\vee,\neg}$ basis at its leaves by substituting for ${m}$ of the ${n}$ variables, one at a time. She used an averaging argument that readily extends to give the same-order bounds on random restrictions. Random restrictions have other advantages that have led to wonderful work besides those above, including the recent oracle separation of ${\mathsf{BQP}}$ from the polynomial hierarchy. We want to discuss targeting the restrictions toward possible particular sensitivities of the Boolean functions computed by the formulas.

## The Method

Subbotovskaya’s argument considers every ${\wedge}$ or ${\vee}$ gate ${g}$, in a minimal formula ${\phi}$ computing the function ${f(x_1,\dots,x_n)}$, that has ${x_i}$ or ${\bar{x}_i}$ as an input, for some ${i = 1}$ to ${n}$. Note that because ${\phi}$ is a formula, one of the two possible assignments ${a_i}$ to ${x_i}$ renders the other input ${T_{g,i}}$ to ${g}$ immaterial. If ${\phi}$ has ${s_n}$ leaves, then we can not only choose ${i}$ so that ${x_i}$ or ${\bar{x}_i}$ occurs at least ${\frac{s_n}{n}}$ times, we can set it to wipe out at least half of the subtrees ${T_{g,i}}$.

The final note is that doing so does not double-cancel any variable: if ${T_{g_i}}$ contains ${x_i}$ or ${\bar{x}_i}$, then assigning ${1 - a_i}$ to that occurrence leaves the function computed by ${g}$ unchanged—but that contradicts the minimality of ${\phi}$. Hence no subformula ${T_{g,i}}$ contains ${x_i}$ or ${\bar{x}_i}$. The upshot is that we can set ${x_i}$ so that the resulting formula ${\phi'}$ in ${n-1}$ variables has ${s_{n-1} \leq s_n - \frac{3s_n}{2n}}$ leaves. It is convenient to weaken this inequality by

$\displaystyle s_{n-1} \leq s_n\left(1 - \frac{3}{2}\frac{1}{n}\right) \leq s_n\left(1 - \frac{1}{n}\right)^{3/2},$

so that by iterating we can exploit the identity that for any ${k \leq n}$,

$\displaystyle \left(1 - \frac{1}{n}\right)\left(1 - \frac{1}{n-1}\right)\cdots\left(1 - \frac{1}{k+1}\right) = \frac{k}{n}.$

The upshot is that for any ${m}$, we can set ${n-m}$ variables so that the size ${s_{m}}$ of the resulting formula in ${m}$ variables obeys the bound

$\displaystyle s_{m} \leq s_n\left(\frac{m}{n}\right)^{3/2},\qquad\text{so}\qquad s_n \geq s_{m}\left(\frac{n}{m}\right)^{3/2}.$

Because the parity function is such that any way of setting ${n-1}$ variables leaves the formula needing (at least) one read of the remaining variable, we can carry this all the way down to ${m=1}$ to deduce ${s_n \geq n^{3/2}}$. Note: this is concrete, not asymptotic.

Via induction on gates ${g}$ deeper than those at the leaves, and techniques that adapt to how closely the two subtrees of ${g}$ are balanced, Håstad increased the exponent from ${3/2}$ to ${2}$ in all cases, on pain of carrying some lower-order terms that make the bounds asymptotic. His lemmas rely more on the restrictions being random, including one that estimates the probability that the restriction already leaves a formula dependent on only one variable.

## An Intuitive Framing

For sake of exposition we wave away the lower-order terms and write ${\gtrdot}$ for the numerical inequalities. We frame the restrictions with notation that can help us target subsets ${I}$ of variables to restrict, outside of which ${f}$ remains sensitive.

Definition 1 Let ${f : \{0,1\}^n \rightarrow \{0,1\}}$ be a Boolean function. We say that

$\displaystyle \textsf{free}(f,I,\alpha)$

holds provided ${I \subset [n]}$ and ${\alpha: I \rightarrow \{0,1\}}$ and for some ${x \neq y}$ in ${\{0,1\}^n}$ the following hold:

1. The inequality ${f(x) \neq f(y)}$ holds.

2. For all ${k \in I}$,

$\displaystyle x_k = y_k = \alpha(k).$

Thus ${\textsf{free}(f,I,\alpha)}$ holds provided the value ${f(x)}$ is not determined by just knowing ${x}$ on the set of positions in ${I}$. The remaining ${m}$ positions are needed.

Theorem 2 (Main Theorem of Random Restriction Theory) Let ${f : \{0,1\}^n \rightarrow \{0,1\}}$ be a Boolean function. Suppose that for a constant positive fraction of ${I,\alpha}$ with ${I}$ of size ${n-m}$,

$\displaystyle \textsf{free}(f,I,\alpha)$

holds. Then

$\displaystyle L(f) \gtrdot \frac{n^2}{m^2}.$

## An Example and Nonexample

To see how this scheme applies in the case of the parity function, we can set ${m=1}$. No matter which inputs are selected, the value of the function stays undetermined after ${n-1}$ inputs are set. That is, the following is always true:

$\displaystyle \textsf{free}(\textit{parity},I,\alpha),$

for all ${I}$ of size at most ${n-1}$ and all ${\alpha}$. Thus the main theorem says that the formula lower bound for the parity function is order ${n^2}$.

Note that the only property of the parity function ${f}$ being used is the global way it satisfies ${\textsf{free}(f,I,\alpha)}$. We want to look for other patterns that can yield good bounds. Of course, patterns of sensitivity of Boolean functions have been looked at in many ways. We want to try to build a bridge from them to number theory.

For a non-example, consider the set ${E}$ of even numbers. Let us number the variables so that ${x_1}$ stands for the ones place in binary notation, ${x_2}$ for the twos place, and so on. Then ${E}$ depends only on ${x_1}$. Whether ${\textsf{free}(f,I,\alpha)}$ holds depends entirely on ${1}$ not belonging to ${I}$. If ${m = n/c}$ for some ${c > 1}$ then it holds with constant probability, and the framework gives

$\displaystyle L(E) \gtrdot c^2.$

Well, OK, we really know ${L(E) = 1}$ so this is where the overhead hidden by ${\gtrdot}$ can’t be ignored. But at least this illustrates how the framework doesn’t get too excited about a frequently alternating numerical function. Similar remarks apply when the function is multiples of ${2^k}$ where ${k}$ is small.

## A Prime Objective

We would like to apply the random restrictions method to the Boolean function ${\pi(x)}$ that identifies whether the ${n}$-bit binary string ${x}$ denotes a prime number. It has been known for decades that the Boolean circuit complexity of ${\pi}$ is polynomial. This follows from the existence of a polynomial random algorithm for ${\pi}$ due to Solvay-Strassen or the later Miller-Rabin algorithm and Adleman’s connection between such algorithms and Boolean circuit complexity.

There are strong lower bounds on special circuit models, but nothing so general as formulas, let alone unrestricted circuits. Certainly ${\pi}$ is a harder function that parity, but we don’t know of a quadratic formula lower bound for it. Can we prove one in the framework?

Our intuition comes from considering various ways in which the primes are dense and uniformly—but not too regularly—distributed. Two simple ways are:

1. The average size gap between prime numbers is ${O(\log n)}$.

2. The maximum size gap between prime numbers is ${O(\log^{2} n)}$.

The first is of course a consequence of the prime number theorem. The second follows from the widely believed Cramér conjecture.

We will not really need this conjecture—we will not need it to hold everywhere. Our basic framework will only need it to hold for a positive fraction of small intervals of numbers that we sample. What we need instead is for a positive-fraction version to hold under a more liberal notion of “gap.”

Here is an outline to try to prove a formula size lower bound of ${n^{2-\epsilon}}$ on ${\pi(x)}$, for some ${\epsilon > 0}$. Let

$\displaystyle t = n^{1-\epsilon}.$

We will use the random restriction method for ${p = 1/t}$. Apply the method. The issue is now if we leave only ${m}$ input bits unset, then what is the probability that

$\displaystyle \textsf{free}(\pi,I,\alpha)?$

Certainly we cannot always leave ${1}$ bit unset like with the XOR function. Look at this:

$\displaystyle 11*111.$

The input is either

$\displaystyle 111111 = 63 \text{ or } 110111 = 55 .$

And both are not a prime. In general, our finger has come to rest on the question of what is known or can be shown about the behavior of the set of primes on “hypercubic” sets—that is, sets that add or subtract all sums from a selected subset of the powers of ${2}$.

## Open Problems

There must be other lower bounds on the primes. We cannot seem to find them. Any help?

by RJLipton+KWRegan at January 14, 2021 04:51 PM UTC

### To cheer you up in difficult times 17: Amazing! The Erdős-Faber-Lovász conjecture (for large n) was proved by Dong Yeap Kang, Tom Kelly, Daniela Kühn, Abhishek Methuku, and Deryk Osthus!

from Gil Kalai

Dong Yeap Kang, Tom Kelly, Daniela Kühn, Abhishek Methuku, and Deryk Osthus have just uploaded a paper to the arXive, A proof of the Erdős-Faber-Lovász conjecture. (I am thankful to Nati Linial and Ryan Alweiss for telling me about it.)

Here is one formulation of the conjecture:

Erdős-Faber-Lovász Conjecture: Let $F$ be a family of $k$-subsets of $\{1,2,\dots,n\}$. Suppose that every two sets in $F$ have at most one element in common, then $F$ can be divided into at most $n$ families so that every two sets in each of these families are disjoint.

Using some standard hypergraph jargon the conjecture can be stated as follows:

Erdős-Faber-Lovász Conjecture: The chromatic index of any linear hypergraph on $n$ vertices is at most $n$.

Here a linear hypergraph is a hypergraph where every two edges share at most one vertex. The chromatic index of a hypergraph is the number of colors needed to color the edges so that  every two intersecting edges have distinct colors.

The new result asserts

Theorem  (Kang, Kelly, Kühn, Methuku, and Osthus): For every sufficiently large n, every linear hypergraph H on n vertices has chromatic index at most n.

A stability versions of this result, which confirm a prediction of Jeff Kahn, is also provided. Of course, it will be interesting to settle the conjecture for all values of n.

This is an amazing breakthrough! Congratulations!

(from Wikipedea)

Daniella Kühn, Deryk Osthus, Tom Kelly, and Abhishek Methuku  (I will add a picture of Dong Yeap Kang when I will find Update: found! see below.)

## A little more

Background and links: Here is the Wikipedea article. We mentioned the conjecture in this post and this one.

Matching and Fractional version. In 1982 Paul Seymour proved that any linear hypergraph $H$ has a matching of size $e(H)/n$. In 1992 Jeff Kahn and Paul Seymour proved that the fractional chromatic index of any linear hypergraph on $n$ vertices is at most $n$.

Approximate version via the semi random method. In 1992 Jeff Kahn proved that chromatic index of any linear hypergraph on $n$ vertices is at most $n(1+o(1))$. Kahn used the Rödl nibble (aka semi-random) method. Kahn extended the result to list-coloring.

### Open problems

There are many related open questions and some are mentioned in the paper. It will be interesting to prove EFL conjecture for all values of $n$. EFL-conjecture extends to  beautiful generalization of Vizing’s theorem which is still open. Also the EFL-conjecture for list coloring is still open. See also this 1994 article by Jeff Kahn on Hypergraphs matching, covering and coloring problems.

Another interesting problem is:

The restricted Larman’s conjecture: Let $F$ be an intersecting family of $k$-subsets of $\{1,2,\dots,n\}$.  Then $F$ can be divided into at most $n$ 2-intersecting families, namely so that every two sets in each of these families have at least $2$ elements in common.

Little Update: A few more details on the last problem. As far as I know for the restricted Larman conjecture it is not even known that $F$ can always be divided into $(1+o(1))$ 2-intersecting families.

Now, call G strongly 2-intersecting if the intersection of all sets in $G$ has at least two elements.  Furedi and Seymour proved that F can be fractionally covered by $n$ strongly 2-intersecting subfamilies. They conjectured that F can always be covered by $n$ strongly 2-intersecting families, however Jeff Kahn and I disproved this conjecture.

by Gil Kalai at January 14, 2021 01:51 PM UTC

### Postdoc at HSE University (apply by February 14, 2021)

from CCI: jobs

HSE University’s Faculty of Computer Science now receives applications for the postdoctoral positions in several laboratories.
At present, there are ten open postdoctoral positions in ten out of twelve Faculty’s laboratories. Research areas range from algebraic topology and bioinformatics to data analysis, neural networks and process mining.

Website: https://cs.hse.ru/en/news/433773574.html
Email: abasov@hse.ru

by shacharlovett at January 14, 2021 12:42 PM UTC

### Launching TOC4Fairness

from TOC for Fairness

I am excited to launch the Simons Foundation’s Collaboration on the Theory of Algorithmic Fairness, funded by the Simons Foundation. This is a large-scale multi-institutional effort to accelerate the (already powerful) impact of TOC on the emerging area of algorithmic fairness. Beyond the ambitious research goals of this project (which we will discuss in future posts), we hope it will play an important role in community building and bridge building to other fields of research. In addition to launching this blog and our site, we are also launching a research seminar that will feature a diverse set of talks. We are very exited to have Annette Zimmermann as our inaugural speaker (a week from today).

If you want to join our seminar, please email toc4fairness-director@cs.stanford.edu and we will add you to our email list (which we will be careful not to overuse).

by Omer Reingold at January 13, 2021 03:45 PM UTC

### TR21-005 | Robust testing of low-dimensional functions | Anindya De, Elchanan Mossel, Joe Neeman

from ECCC papers

A natural problem in high-dimensional inference is to decide if a classifier $f:\mathbb{R}^n \rightarrow \{-1,1\}$ depends on a small number of linear directions of its input data. Call a function $g: \mathbb{R}^n \rightarrow \{-1,1\}$, a linear $k$-junta if it is completely determined by some $k$-dimensional subspace of the input space. A recent work of the authors showed that linear $k$-juntas are testable. Thus there exists an algorithm to distinguish between: 1. $f: \mathbb{R}^n \rightarrow \{-1,1\}$ which is a linear $k$-junta with surface area $s$, 2. $f$ is $\epsilon$-far from any linear $k$-junta with surface area $(1+\epsilon)s$, where the query complexity of the algorithm is independent of the ambient dimension $n$. Following the surge of interest in noise-tolerant property testing, in this paper we prove a noise-tolerant (or robust) version of this result. Namely, we give an algorithm which given any $c>0$, $\epsilon>0$, distinguishes between 1. $f: \mathbb{R}^n \rightarrow \{-1,1\}$ has correlation at least $c$ with some linear $k$-junta with surface area $s$. 2. $f$ has correlation at most $c-\epsilon$ with any linear $k$-junta with surface area at most $s$. The query complexity of our tester is $k^{\mathrm{poly}(s/\epsilon)}$. Using our techniques, we also obtain a fully noise tolerant tester with the same query complexity for any class $\mathcal{C}$ of linear $k$-juntas with surface area bounded by $s$. As a consequence, we obtain a fully noise tolerant tester with query complexity $k^{O(\mathrm{poly}(\log k/\epsilon))}$ for the class of intersection of $k$-halfspaces (for constant $k$) over the Gaussian space. Our query complexity is independent of the ambient dimension $n$. Previously, no non-trivial noise tolerant testers were known even for a single halfspace.

### How to have irritating files in subdirectory…

from Sariel Har-Peled

If you are formatting a paper for lipics, but don’t want all their files in the paper directory, add the following to the top of the main latex file:

\makeatletter
\def\input@path{{lipics/}{../lipics/}}
\makeatother
\documentclass[a4paper,USenglish,cleveref,autoref,thm-restate]%
{socg-lipics-v2019}

This is under the assumption that lipics/ subdirectory contains the .cls file.

by Sariel at January 12, 2021 03:53 PM UTC

### CALL FOR PROPOSALS: EC 2021 WORKSHOPS AND TUTORIALS

The EC 2021 conference invites proposals for tutorials and workshops to be held in conjunction with the ACM Conference on Economics and Computation (ACM EC). EC’21 will be held either virtually or in a hybrid format. In the latter case, tutorial/workshop organizers will have the opportunity to decide whether to hold their event in person or remotely.

### Tutorials

Tutorials provide an opportunity to educate the community about emerging topics of interest, or about topics from related fields that merit additional attention from the EC community.  A tutorial is an opportunity to invite colleagues and young researchers to get excited about a well-defined topic, to prepare them to dive into the literature and to guide them to the most exciting developments and open problems. Typically, tutorials run for half a day (3-3.5 hours) and consist of a series of presentations by experts in the field.

Tutorial proposals should contain:

• the title of the tutorial
• the names, contact information, and short biographies of the organizers
• a description of the tutorial theme
• the organization/format of the tutorial
• any related tutorials that have been run
• required facilities for the tutorial

Tutorial proposals should be emailed to ec21-tutorial-chairs@acm.org.

### Workshops

Workshops provide an opportunity to bring together researchers to discuss emerging areas of research in an informal forum. Workshop schedules should be designed to promote discussion and debate. A workshop may include invited talks, contributed talks, panel discussions, poster sessions, open problem sessions, presentations of work in progress, or any other activities that stimulate new ideas for research. It is up to the workshop organizers to determine the format and technical content of each workshop and to solicit contributions, but we encourage workshops that devote some time to contributed content.

Workshop proposals should contain:

• the title of the workshop
• the names, contact information, and short biographies of the organizers
• the names of confirmed or candidate participants
• a description of the workshop theme
• the reviewing process for participants
• the organization/format of the workshop
• any previous versions of the workshop
• required facilities for the workshop
• the desired workshop length (half-day, full day, etc.)

For accepted workshops, the desired length will be honored as closely as possible.

We especially encourage proposals that bring together participants with diverse backgrounds and experience.

Workshop proposals should be emailed to ec21-workshop-chairs@acm.org.

### Key dates:

March 1, 2021: Due date for submitting workshop or tutorial proposal

March 22, 2021: Tutorial & workshop proposal accept/reject notifications

July 19, 2021: Conference Tutorials

July 20-22, 2021: Conference technical program

July 23, 2021: Conference Workshops

### Tutorial co-chairs

Shengwu Li, Harvard University

Sigal Oren, Ben-Gurion University

### Workshop co-chairs

Hu Fu, University of British Columbia

Annie Liang, Northwestern University

by Noam Nisan at January 11, 2021 08:41 AM UTC

### The ACM History Committee call for proposals looks wrong

In November I (and prob everyone in the ACM) got an email that was a call for proposal from the ACM History committee. Here is a pointer to it.

One of the passages in it just seems wrong to me:

This fellowship program is designed to support research and/or curatorial projects related to ACM's professional and educational activities and/or to ACM's rich institutional history including its organization, SIG activities, and conferences.

I will give  examples of history projects that are interesting but do not fit the description. I challenge you to give me a history project that is interesting that does fit the bill.

1) Compare and contrast how NP-completeness developed in America and in the USSR. For the USSR side it would be an expansion of  Trakhtenbrot's article here. OH, no good, Leonid Levin was not a member of the ACM.

2) Measure how various CS fields have changed by looking at the topics on the CALL FOR PAPERS for a conference. This would be a MASSIVE expansion of my blog post on how CCC call for papers, changed , see here. OH, I could only do ACM conferences. STOC but not FOCS. Okay, it makes perfect sense to study only ACM conference. Oh wait. IT MAKES NO SENSE WHATSOEVER.

3) When did mathematicians begin looking at P vs NP seriously? (e.g.,  In  The Handbook of Mathematical Logic from 1977 only one article mentions P vs NP: Michael Rabin's article on Decidable theories mentioned that even decidable theories may take a long time to decide and noted the importance of the P vs NP problem). How did MATH and CS  interact early on and now? This would need  one to look at math journals. How many of those are ACM? I would guess very few.

4) When did engineers begin looking at P vs NP seriously, or even notice it? Same issue as the last item.

5) Looking at how Programming lang design has changed one would have to only look at conference and journals that were ACM. And for those that were ACM but then dropped ACM you might only be able to use them up to the cutoff year.

6) Which academic studies eventually lead to practical products people can use? This is a really broad topic so one could narrow it down to just academic studies that appeared in ACM journals or conferences. Is that a good way to narrow the focus? Spoiler alert: NO.

7) I recently filled out a survey about the future of theory and if it is insular (the topic of another post surely). Most of the questions were about STOC-FOCS'' The people who did the survey clearly do not care that one is ACM and one is IEEE. Does anyone? I ask non-rhetorically.

Despite the capitol letters, I am not so much mad as puzzled. So I ask non-rhetorically:

Q1) Did the ACM really mean for people to do history in a way where you can only use ACM materials?

Q2) If no then please rewrite the Call for Proposals.

Q3) If yes then give examples of studies that would be interesting.

I am not asking to be snarky, I really want to know. And I note that interesting is subjective.

(ADDED LATER- two historians who have worked with this kind of history left comments that indicate the grant is FINE with non-ACM stuff, so Q1 above seems to have the answer NO. Given that they should rewrite the call for proposal next time they do this. ALSO- the historians left pointers to VERY INTERESTING papers they wrote.)

by gasarch (noreply@blogger.com) at January 11, 2021 04:39 AM UTC

### TR21-004 | Junta Distance Approximation with Sub-Exponential Queries | Vishnu Iyer, Avishay Tal, Michael Whitmeyer

from ECCC papers

Leveraging tools of De, Mossel, and Neeman [FOCS, 2019], we show two different results pertaining to the tolerant testing of juntas. Given black-box access to a Boolean function $f:\{\pm1\}^{n} \to \{\pm1\}$ we give a poly$(k, \frac{1}{\varepsilon})$ query algorithm that distinguishes between functions that are $\gamma$-close to $k$-juntas and $(\gamma+\varepsilon)$-far from $k'$-juntas, where $k' = O(\frac{k}{\varepsilon^2})$. In the non-relaxed setting, we extend our ideas to give a $2^{\tilde{O}(\sqrt{k}/\varepsilon)}$ (adaptive) query algorithm that distinguishes between functions that are $\gamma$-close to $k$-juntas and $(\gamma+\varepsilon)$-far from $k$-juntas. To the best of our knowledge, this is the first subexponential-in-$k$ query algorithm for approximating the distance of $f$ to being a $k$-junta (previous results of Blais, Canonne, Eden, Levi, and Ron [SODA, 2018] and De, Mossel, and Neeman [FOCS, 2019] required exponentially many queries in $k$). Our techniques are Fourier analytical and introduce the new notion of "normalized influences'' that might be of independent interest.

### Open problem session of HUJI-COMBSEM: Problem #5, Gil Kalai – the 3ᵈ problem

from Gil Kalai

This post continues to describe problems presented at our open problems session back in November 2020. Here is the first post in the series.  Today’s problem was presented by me, and it was an old 1989 conjecture of mine.

A convex set K is centrally symmetric if $x \in K$ implies that $-x \in K$.

### Conjecture: Let P be a centrally symmetric d-polytope. Then P has at least 3ᵈ non-empty faces.

Here we count P itself as a face.

Before we continue let me mention a winter school on “The Interplay between High-Dimensional Geometry and Probability”, January 11-15. Among other exciting lectures Bo’az Klartag will give 3 lectures on Yuansi Chen’s work on the KLS conjecture, that we discussed in this post.

Now back to the 3ᵈ conjecture. A case of equality is the $d$-dimensional cube $C_d$. The faces of $C_d$ corresponds to (-1,1,*)-vectors of length $d$. Given such a vector it corresponds to a face of the cube whose vertices are obtained by replacing the *’s with ±1 in all possible ways.  Of course the dual of $C_d$ also has $3^d$ non-empty faces.

Schlegel diagram of the octahedral prism, a 4-dimensional Hanner polytope.  (attribution: Robert Webb’s Stella software as the creator of this image along with a link to the website: http://www.software3d.com/Stella.php. via Wikipedea.)

Here are some points raised in the discussion and further comments.

1. There are many cases of equality – all Hanner polytopes. Those are obtained from intervals by taking the two operations, a) Cartesian product, b) moving from a polytope P to its polar-dual P*, and apply them in all possible ways.
2. Nati asked if a $(2+\epsilon)^d$ lower bound is known. To the best of my knowledge even this is not known.
3. Gideon asked about the connections with Mahler’s conjecture that for every d-dimentional centrally symmetric body K, $vol(K)vol(K*) \ge 4^n/n!$  . Indeed the cases of equality are conjectured to be the same – the class of Hanner polytopes. (Mahler’s conjecture and its relations to combinatorics would deserve a separate post.) The recent paper Flag numbers and floating bodies by F Besau, C Schütt, EM Werner looks very relevant.
4. For simplicial polytopes the conjecture follows from results of Stanley (1986) and is related t0 a result by Barany and Lovasz (1982).
5. The conjecture is related to a 1979 theorem of Figiel, Lindenstrauss, and Milman that asserts that for every centrally symmetric $d$ polytope $\log f_0(P) \log f_{d-1}(P) \ge \gamma d$, for some universal constant $\gamma$.
6. A somewhat more general conjecture would be that the number of $k$-faces of the barycentric subdivision of $K$ is always as large as the number of $k$-faces of the barycentric subdivision of $C_d$. The case $k=0$ is the original conjecture and I regard the case $k=d-1$ (sometimes referred to as the “flag-number conjecture”) as even closer to Mahler’s conjecture. See the paper by Schmidt and Ziegler, Ten Problems in Geometry.
7. In this paper by Raman Sanyal, Axel Werner, and Günter M. Ziegler, the conjecture is proved for $d \le 4$. The proof is rather involved and rely on an extension of the lower bound theorem. (It does not extend to polyhedral spheres.)  The paper also gives counterexamples for two much stronger conjectures that I made in my 1989 paper.

1. Hanner polytopes form an exciting family of polytopes. They have the maximum Banach-Mazur distance from the Euclidean ball, and it is conjectured that only these polytopes attain this maximum. They also have the 2-3 Helly-property (namely every three pairwise intersecting translates have a point in common. It was proved by Lima and Jensen that this property holds only by Hanner polytopes!
2. The proof for the conjecture for d=4 involves the “extended lower bound inequality”.
3. Another conjecture of mine of a similar spirit is: a polyhedral d-manifold (more generally, a regular CW manifold) in dimension d which is not a sphere has at least $2^{(3d+4)/2}$ faces. (Including the empty face.)
4. Flag $(d-1)$-dimensional spheres also have at least $3^d$ faces. This follows from a theorem by Roy Meshulam.
5. Completely balanced $(d-1)$-dimensional spheres also have at least $3^d$ faces. (I think.) (What about zonotopes, you may ask.)
6. The class of spheres in items 4 and 5 are simplicial. So an extensions of these classes to classes of polyhedral spheres (and regular CW spheres) would be interesting.
7. d-Polytopes with no triangular 2-faces also have at least $3^d$ non-empty faces this follows from a theorem of Blind and Blind.
8. Looking around in papers that referred to my old paper I discovered the following appealing class of polytopes and interesting conjecture about them. A polytope P is 2-level if for every facet F of P there is a hyperplane parallel to the hyperplane through F that contains all vertices not in F. The conjecture is that for such a d-polytope P, $f_0(P)f_{d-1}(P)\le d2^{d+1}$. This conjecture was made in the paper Enumeration of 2-level polytopes by  Bohn,  Faenza,  Fiorini,  Fisikopoulos, Macchia, and Pashkovich. (Update: Giulia commented that the conjecture on 2-level polytopes by Bohn et al. was recently solved by Andrey Kupavskii and Stefan Weltge: https://arxiv.org/abs/2008.07153.)
9. Interesting classes of polytopes arising from graphs that are very near the conjecture can be found in the paper Face numbers of centrally symmetric polytopes from split graphs by Freij, Henze, Schmitt, and Ziegler. See also Independence complexes of perfect graphs by Freij and Henze in this report.

Update (Jan 12,2021): Looking at old correspondence with Ragnar Freij, Matthias Henze, and Günter Ziegler, I came across the following conjecture that related the flag number conjecture and Mahler’s conjecture. Let P be a centrally symmetric d-polytope. Let $1+m(P)=vol(K)/vol(K*)/ (4^d/d!)$, and let $1+fn(P)=mf(P)/(2^dd!).$ For a centrally symmetric $d$-polytope $P$, Mahler’s conjecture asserts that $m(P) \ge 0$ and the flag number conjecture asserts that $fn(P) \ge 0$.

### Conjecture:  fn(P) ≥ 4 m(P)

It is also conjectured that equality holds for Jensen’s polytopes of perfect graphs.

by Gil Kalai at January 10, 2021 03:23 PM UTC

### TR21-003 | Inverse-Exponential Correlation Bounds and Extremely Rigid Matrices from a New Derandomized XOR Lemma | Lijie Chen, Xin Lyu

from ECCC papers

In this work we prove that there is a function $f \in \textrm{E}^\textrm{NP}$ such that, for every sufficiently large $n$ and $d = \sqrt{n}/\log n$, $f_n$ ($f$ restricted to $n$-bit inputs) cannot be $(1/2 + 2^{-d})$-approximated by $\textrm{F}_2$-polynomials of degree $d$. We also observe that a minor improvement (e.g., improving $d$ to $n^{1/2+\varepsilon}$ for any $\varepsilon > 0$) over our result will imply $\textrm{E}^\textrm{NP}$ cannot be computed by depth-$3$ $\textrm{AC}^0$-circuits of $2^{n^{1/2 + \varepsilon}}$ size, which is a notoriously hard open question in complexity theory. Using the same proof techniques, we are also able to construct extremely rigid matrices over $\textrm{F}_2$ in $\textrm{P}^\textrm{NP}$. More specifically, we show that for every constant $\varepsilon \in (0,1)$, there is a $\textrm{P}^\textrm{NP}$ algorithm which on input $1^n$ outputs an $n\times n$ $\textrm{F}_2$-matrix $H_n$ satisfying $\mathcal{R}_{H_n}(2^{\log^{1 - \varepsilon} n}) \ge (1/2 - \exp(-\log^{2/3 \cdot \varepsilon} n) ) \cdot n^2$, for every sufficiently large $n$. This improves the recent $\textrm{P}^\textrm{NP}$ constructions of rigid matrices in [Alman and Chen, FOCS 2019] and [Bhangale et al., FOCS 2020], which only gives $\Omega(n^2)$ rigidity. The key ingredient in the proof of our new results is a new derandomized XOR lemma based on approximate linear sums, which roughly says that given an $n$-input function $f$ which cannot be $0.99$-approximated by certain linear sum of $s$ many functions in $\mathcal{F}$ within $\ell_1$-distance, one can construct a new function $\textrm{Amp}^f$ with $\widetilde{O}(n)$ input bits, which cannot be $(1/2+s^{\Omega(1)})$-approximated by $\mathcal{F}$-functions. Taking $\mathcal{F}$ to be a function collection containing low-degree $\textrm{F}_2$-polynomials or low-rank $\textrm{F}_2$-matrices, our results are then obtained by first using the algorithmic method to construct a function which is weakly hard against linear sums of $\mathcal{F}$ in the above sense, and then apply the derandomized XOR lemma to $f$. We obtain our new derandomized XOR lemma by giving a generalization of the famous hardcore lemma by Impagliazzo. Our generalization in some sense constructs a non-Boolean hardcore of a weakly hard function $f$ with respect to $\mathcal{F}$-functions, from the weak inapproximability of $f$ by any linear sum of $\mathcal{F}$ with bounded $\ell_p$-norm. This generalization recovers the original hardcore lemma by considering the $\ell_{\infty}$-norm. Surprisingly, when we switch to the $\ell_1$-norm, we immediately rediscover Levin's proof of Yao's XOR Lemma. That is, these first two proofs of Yao's XOR Lemma can be unified with our new perspective. For proving the correlation bounds, our new derandomized XOR lemma indeed works with the $\ell_{4/3}$-norm.

### TR21-002 | Fooling Constant-Depth Threshold Circuits | Pooya Hatami, William Hoza, Avishay Tal, Roei Tell

from ECCC papers

We present new constructions of pseudorandom generators (PRGs) for two of the most widely-studied non-uniform circuit classes in complexity theory. Our main result is a construction of the first non-trivial PRG for linear threshold (LTF) circuits of arbitrary constant depth and super-linear size. This PRG fools circuits with depth $d\in\mathbb{N}$ and $n^{1+\delta}$ wires, where $\delta = 2^{-O(d)}$, using seed length $O(n^{1-\delta})$ and with error $2^{-n^{\delta}}$. This tightly matches the best known lower bounds for this circuit class. As a consequence of our result, all the known hardness for LTF circuits has now effectively been translated into pseudorandomness. This brings the extensive effort in the last decade to construct PRGs and deterministic circuit-analysis algorithms for this class to the point where any subsequent improvement would yield breakthrough lower bounds. Our second contribution is a PRG for De Morgan formulas of size $s$ whose seed length is $s^{1/3+o(1)} \cdot \mathrm{polylog}(1/\epsilon)$ for error $\epsilon$. In particular, our PRG can fool formulas of sub-cubic size $s=n^{3-\Omega(1)}$ with an exponentially small error $\epsilon=\exp({-n^{\Omega(1)}})$. This significantly improves the inverse-polynomial error of the previous state-of-the-art for such formulas by Impagliazzo, Meka, and Zuckerman (FOCS 2012), and again tightly matches the best currently-known lower bounds for this class. In both settings, a key ingredient in our constructions is a pseudorandom restriction procedure that has tiny failure probability, but simplifies the function to a non-natural "hybrid computational model" that combines several computational models. As part of our proofs we also construct "extremely low-error" PRGs for related circuit classes; for example, we construct a PRG for the class of functions computable by an arbitrary function of $s$ linear threshold functions that can handle even the extreme setting of parameters $s=n/\mathrm{polylog}(n)$ and $\epsilon=2^{-n/\mathrm{polylog}(n)}$.

### Samuel Eilenberg Assistant Professor at University of Warsaw (apply by February 12, 2020)

from CCI: jobs

The Faculty of Mathematics, Informatics and Mechanics at University of Warsaw (MIM UW) invites applications for positions of an assistant professor (“adiunkt” in Polish) in Computer Science, starting on 1st October 2021 (or 1st Feb 2022).

Note: compared to a standard assistant professor position, Samuel Eilenberg Assistant Professor comes with reduced teaching load and increased salary.

Website: https://www.mimuw.edu.pl/sites/default/files/konkursy/wmim_1210_ek_01_2021_en.pdf
Email: kowalik@mimuw.edu.pl

by shacharlovett at January 08, 2021 05:14 PM UTC

### Assistant professor at University of Warsaw (apply by February 12, 2021)

from CCI: jobs

The Faculty of Mathematics, Informatics and Mechanics at University of Warsaw (MIM UW) invites applications for positions of an assistant professor (“adiunkt” in Polish) in Computer Science, starting on 1st October 2021 (or 1st Feb 2022).

Website: https://www.mimuw.edu.pl/sites/default/files/konkursy/wmim_1210_ek_03_2021_en.pdf
Email: kowalik@mimuw.edu.pl

by shacharlovett at January 08, 2021 05:11 PM UTC

### To all Trumpists who comment on this blog

from Scott Aaronson

The violent insurrection now unfolding in Washington DC is precisely the thing you called me nuts, accused me of “Trump Derangement Syndrome,” for warning about since 2016. Crazy me, huh, always seeing brownshirts around the corner? And you called the other side violent anarchists? This is all your doing. So own it. Wallow in it. May you live the rest of your lives in shame.

Update (Jan. 7): As someone who hasn’t always agreed with BLM’s slogans and tactics, I viewed the stunning passivity of the police yesterday against white insurrectionists in the Capitol as one of the strongest arguments imaginable for BLM’s main contentions.

by Scott at January 06, 2021 09:01 PM UTC

### My Survey on Hilbert's 10th for particular (d,n) is ready for you to help me on!

Hilbert's 10th problem is  (in modern terminology) to find an algorithm that will, given a poly

p(x1,...,xn) in Z[x1,...,xn], determine if it has a solution in the integers.

This was shown undecidable by the combined efforts of Davis-Putnam-Robinson and Matiyasevich.

I posted here about the question of: For which (d,n) is H10 undecidable when restricted to degree d and number of vars n?

I was  invited to make an article out of the blog post and what I learned from the comments  for  the Algorithms column of BEATCS. That first draft is

Did I miss an important reference? (Note- I had meant to get out Matiyasevich's book on H10 but have been unable to because of the Pandemic, so if you have it my have stuff I need to know.)

Did I misspell Matiyasevich?

Did I say something stupid?

Incidentally I am usually the one who is asking someone else to do an open problems column, a book review, a guest blog. Nice to be the askee instead of the asker for a change of pace.

by gasarch (noreply@blogger.com) at January 06, 2021 01:13 AM UTC

### Predictions For 2021

from Richard Lipton

This year is a Blum integer, 43*47

 2017 article, not via Zoom

Allan Lichtman correctly predicted the 2020 presidential election, based on a 7-6 edge in “keys” to Joe Biden. His model, which he developed in 1981 with the late Vladimir Keilis-Borok, simply says that the incumbent party will lose the presidential election if 6 or more keys go against them. It has been right in every election since 1984.

Today we make some predictions for the new year, 2021.

Lichtman has not ventured a prediction for today’s Senate runoff elections in Georgia. Last November 19, he penned an article eviscerating the performance of poll aggregation in the November 3 election. We have considered writing a post about that and about statistical predictions during the pandemic, in the direction of taking more moderate assessments of how various polls, predictions, and policies have fared.

Polls of today’s races currently show a 1–2 percentage point edge for the Democratic candidates, but there are multiple caveats: The difference is just about one standard deviation, well within the margin of error even after aggregation. The polls have gyrated with multiple lead flips this past month; two polls from the past few days collected here show both Republican candidates tied or ahead. Many pollsters have avoided these races, and the five most recent polls all have sample sizes well under 1,000. The count beginning tonight may show a zigzag blue-to-red-to-“blur” shift for reasons of which votes are tallied when that we discussed generally here.

With that said, let us address non-election predictions.

## Previous Predictions

Back in 2019 we made some predictions that were easy:

• No circuit lower bound ${1000n}$ or better will be proved for SAT. Well that’s a freebie.

• A computer scientist will win a Nobel Prize. No—indeed, less close than other years.

• At least five claims that ${{\mathsf{P}=\mathsf{NP}}}$ and five that ${{\mathsf{P} \neq \mathsf{NP}}}$ will be made. Gerhard Woeginger’s claims page is still paused at 2016, so we can only judge from our own window of communications made to us or commented in the blog. All we can say is that the number in each case is “order-of” ${5}$.

• A “provably” secure crypto-system will be broken. For this one we don’t have to check any claims. We just pocket the “yes” answer.

We skipped 2020—maybe that was smart of us? Our point about predictions in 2019 was the importance of “when” as opposed to “what.” With regard to some predictions, we can take the pandemic is as a global postponement of “when.”

We claim complete and ongoing timeliness with another 2019 prediction, however. In full, we wrote:

• Quantum supremacy will be proved—finally. But be careful: there is a problem with this whole direction. See the next section.

The section two years ago talked about the difficulties of proving such a claim. A claim was duly made in October 2019, which we evaluated here, but arguments over “proof” have not abated. They have just recently been summarized by Gil Kalai here, and also here by Scott Aaronson in regard to an independent claim by researchers from USTC in China.

## Predictions for 2021

Here are some new predictions for 2021. We will try and skip easy ones like the above. We will try to make some that are surprising. We’ve made predictions about evidence for life on exoplanets before; we do something different there too. We do not venture any about the course of the pandemic.

1. Quantum algorithms: ${\mathsf{BQP}}$ will be proved to be in sub-exponential time.

2. Further work with AlphaFold will lead to a proof of a sub-exponential time algorithm for protein folding.

3. In consequence or independently, a SAT algorithm will be discovered that runs in sub-exponential time, thereby refuting various forms of the Exponential Time Hypothesis. What do we mean by “sub-exponential” time here? Dick says the time for the SAT algorithm will be ${2^{\sqrt{n}\log n}}$. Ken more modestly says time ${2^{O(n/\log n)}}$.

4. Another Clay prize problem (besides ${\mathsf{P}}$ versus ${\mathsf{NP}}$) will be claimed as solved, and then retracted.

5. A new kind of sorting network of size ${O(n \log n)}$ will be discovered, inspired by a connection to ${O(n\log n)}$ integer multiplication.

6. A new factoring algorithm will be discovered that beats the number field sieve.

7. A lower bound argument will show that clique requires formula size ${n^{10}}$ over the ${\vee,\wedge,\neg}$ basis.

8. Significant chemical evidence will be observed for the existence of life on a nonexoplanet.

9. Dogs will reach 25% of the proficiency of cats at navigating home obstacle courses, but not before the third quarter of 2021.

10. Privacy-preserving apps for doodling during Zoom meetings will become hot sellers by midyear.

## Open Problems

What are your predictions for 2021?

by RJLipton+KWRegan at January 05, 2021 08:25 PM UTC

### SODA 2021: Funds for Student Registration Fees

from Kamathematics

UPDATE: An update from Shang-Hua Teng:

Dear TCS Students:

SIAM has today reopened the application portal for SODA21 Student Registration Waiver. The new deadline will be this Friday, January 8 at 11:59pm EST.

SODA21 will be held virtually on Jan. 10 – Jan. 13, 2021.

Please note that having paper(s) in SODA and its associated Symposiums is not a requirement. For this year’s application, the letter of recommendation from advisors is not required as well. The goal of the support from Google, Microsoft, and ACM is to increase opportunity for students to attend this premier TCS conference.

Looking forward to seeing you in SODA’21.

ORIGINAL MESSAGE: Please see below for a message from Shang-Hua Teng, regarding the possibility of waivers for SODA 2021 registration for students.

Dear TCS students:

By now, it is hard to overestimate the impact of the COVID19 pandemic to society. However, like every challenge, it has created some opportunities. For example, essentially all major conferences in TCS this year have been transformed into virtual ones, making them more accessible to scholars/students across the world (of course at the expense of traditional interactions).

ACM-SIAM Symposium on Discrete Algorithms (SODA21) will be held virtually this year, on Jan. 10 – 13, 2021. As you may know, this is the premier conference on algorithms .

See https://www.siam.org/conferences/cm/conference/soda21

Thanks to our industry partners and ACM SIGACT group, SODA has some funds for covering student registrations. I am writing to informing you this opportunity and encourage you to apply:
See:
1. https://awards.siam.org/
2. https://www.siam.org/conferences/cm/lodging-and-support/travel-support/soda21-travel-support
That deadline is Dec. 27, 2020. Like before, having papers in SODA is not prerequisite.

Shang-Hua Teng
On Behalf of SODA Steering Committee

by Gautam at January 05, 2021 07:25 PM UTC

### Finding global minima with kernel approximations

from Francis Bach

Last month, I showed how global optimization based only on accessing function values can be hard with no convexity assumption. In a nutshell, with limited smoothness, the number of function evaluations has to grow exponentially fast in dimension, which is a rather negative statement. On the positive side, this number does not grow as fast for highly smooth functions. However, optimal algorithms end up approximating the whole function to minimize, and then minimizing the approximation, which takes exponential time in general. In this post, I will describe very recent work with Alessandro Rudi and Ulysse Marteau-Ferey [1] that essentially jointly approximates the function and minimizes the approximation.

Our task for this post will be to minimize a function $f: \mathbb{R}^d \to \mathbb{R}$, for which we know the global minimizer is already located in a certain bounded set $\Omega$ (typically a large ball). We can query values of $f$ at any point in $\Omega$.

It turns out it can be formulated as a convex optimization problem (there has to be a catch in this statement, stay tuned).

## All optimization problems are convex!

We consider the following minimization problem: $$\tag{1} \sup_{c \in \mathbb{R}} \ c \ \ \mbox{ such that } \ \ \forall x \in \Omega, \ f(x) \geqslant c.$$ As illustrated below, this corresponds to finding the largest lower bound $c$ on the function.

Assuming $f$ attains its global minimizer $f_\ast$ in $\Omega$, then the optimal value of the problem above is attained and is clearly $f_\ast$, that is, the problem in Eq. (1) leads to the minimal value of $f$.

Moreover, each constraint in Eq. (1) is a linear constraint in $c$, and the objective is linear. Hence we obtain a linear programming problem which is a particularly simple instance of a convex optimization problem. Thus, all optimization problems are convex!

The catch here is that the set of inequalities is dense, that is, in general, $\Omega$ contains infinitely many elements, and thus the number of constraints is infinite.

Duality. It is worth deriving the convex optimization dual (being loose here in terms of regularity and integrability conditions). We add a Lagrange multiplier $\mu(x) \geqslant 0$ for each of the constraints $f(x) \, – c \geqslant 0$, and consider the Lagrangian
$$\mathcal{L}(c,\mu) = c + \int_{\Omega} \! \mu(x) \big( f(x) \, – c\big) dx.$$ Maximizing it with respect to $c$ leads to the constraint $\int_\Omega \mu(x) dx = 1$, that is, $\mu$ represents a probability distribution, and the dual problem is $$\inf_{\mu \in \mathbb{R}^\Omega} \int_{\Omega} \! \mu(x) f(x) dx \ \ \mbox{ such that } \int_\Omega\! \mu(x) dx = 1 \mbox{ and } \forall x \in \Omega, \ \mu(x) \geqslant 0.$$ In words, we minimize the expectation of $f$ with respect to all probability distributions on $\Omega$ with densities. Here, the infimum is obtained from any probability measure that puts mass only on the minimizer of $f$ on $\Omega$. This measure is typically not absolutely continuous, and thus the infimum is not attained (no big deal).

This is also a convex optimization problem, with a linear objective function on the infinite-dimensional set of probability measures.

Overall, we get convex but (apparently) useless problems. We will first need an efficient way to represent non-negativity of functions (this will be applied to $f(x) \, – c$).

## Positivity with “sums of squares”

We now start to make some assumptions on the function $f$ to be minimized. We will assume that it can be written as $$\forall x \in \Omega, \ f(x) = \langle \phi(x), F \phi(x) \rangle,$$ for some feature vector $\phi(x)$ in some Hilbert space $\mathcal{H}$, and $F$ a symmetric linear operator on $\mathcal{H}$.

We will need infinite dimensions later, but you can safely assume that $\mathcal{H}$ is a classical Euclidean space (e.g., $\mathbb{R}^{{\rm dim}(\mathcal{H})}$), and that we have $$\forall x \in \Omega, \ f(x) = \phi(x)^\top F \phi(x) = \sum_{i,j=1}^{{\rm dim}(\mathcal{H})} F_{ij} \phi(x)_i \phi(x)_j.$$

We will also assume that the constant function on $\Omega$ can be represented as well, that is, there exists $u \in \mathcal{H}$ such that $\forall x \in \Omega, \ \langle u, \phi(x) \rangle = 1.$ This is to ensure that we can represent within the optimization problem above the function $f(x) \, – c = \langle \phi(x), (F – c \, u \otimes u) \phi(x) \rangle$.

The most classical example is the set of polynomials of degree $2r$ for which we can choose $\phi(x)$ to be of dimension $d+r \choose r$ and contain all monomials are degree less than $r$. We will go infinite-dimensional and link this to kernels later in this post.

You may wonder why we are not simply modeling the function $f$ with a classical formulation $f(x) = \langle w, \phi(x) \rangle$. This is precisely to be able to enforce positivity.

Positivity through “sum-of-squares”. We can assume that the operator $F$ is positive-semidefinite (which we will denote $F \succcurlyeq 0$), that is, all of its eigenvalues are non-negative, or, for all $h \in \mathcal{H}$, $\langle h, F h \rangle \geqslant 0$. Then this implies that the function $f$ is pointwise positive. Thus: $$\tag{2} F \succcurlyeq 0 \ \ \Rightarrow \ \ \forall x \in \Omega, \ f(x) = \langle \phi(x), F \phi(x) \rangle \geqslant 0.$$

We refer to such functions as “sum-of-squares”. Indeed, given the spectral decomposition of $F$ as $$F = \sum_{i \in I} \lambda_i h_i \otimes h_i,$$ with $I$ a countable set, and $(h_i)_{i \in I}$ an orthonormal family in $\mathcal{H}$, we have $\lambda_i \geqslant 0$ for all $i \in I$, and thus we can write $$f(x) = \langle \phi(x), F \phi(x) \rangle = \sum_{i \in I} \lambda_i \langle \phi(x) , (h_i \otimes h_i) \phi(x) \rangle = \sum_{i=1}^n \big( \sqrt{\lambda_i} \langle h_i , \phi(x) \rangle \big)^2.$$ That is, the function $f$ is a sum of squares of functions.

A key question is whether the implication in Eq. $(2)$ can be reversed. In other words, can we write all non-negative functions as sum-of-squares? The answer will depend on the chosen feature map $\phi: \Omega \to \mathcal{H}$.

Polynomials. For polynomials, the answer is known to be positive in dimension 1, but negative in higher dimensions (see the nice review in [2]): that is, unless $d=1$, or $d=2$ and $r=2$, there are polynomials of degree $2r$ in dimension $d$ that are non-negative but are not the sum of squares of polynomials.

Keeping in mind the difference being non-negative and being a sum of squares, let’s look how this can be used for global optimization of polynomials.

## Polynomial optimization with sums-of-squares

We now consider that $f$ is a polynomial of degree $2r$, and that $\phi(x)$ is of dimension $d+r \choose r$ and contain all monomials of degree less than $r$. We consider representing the constraint $f(x)\geqslant c$ in Eq. (1), as $f(x) = c + \langle \phi(x), A \phi(x) \rangle$ for some positive-definite operator (here a matrix) $A$. With the notations above, this is equivalent to $F = c u \otimes u + A$ and $A \succcurlyeq 0$. This leads to the following optimization problem (no localization set $\Omega$ is used): $$\tag{3} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ \mbox{ such that } \ \forall x \in \mathbb{R}^d, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle .$$ This in now a problem in finite dimension, which is a semidefinite programming problem, and can thus a priori be solved in polynomial time in $d+r \choose r$. See [3, 4] for more details.

At this point, beyond the potentially large dimension of the convex problem when $r$ and $d$ grow, there is a major concern: is the problem in Eq. (3) equivalent to the one in Eq. (1)? This is true if and only if the polynomial $f(x) \, – f_\ast$ is a a sum-of-squares, which given the discussion above may not happen as soon as $d \geqslant 2$.

In order to apply to all polynomials, the problem has to be changed. A simple reformulation is possible [3] by assuming that the global minimum of $f$ is attained within the set $\Omega = \{ x \in \mathbb{R}^d, \ x^\top x \leqslant R^2\}$ for some $R > 0$. It turns out (and this is a highly non-trivial fact) that all polynomials $f(x)$ which are positive on $\Omega$ can be written as $f(x) = q(x) + ( R^2 – x^\top x ) p(x)$, where $q(x)$ and $p(x)$ are sum-of-squares polynomials. The only issue is that the degrees of $p$ and $q$ are not known in advance so a sequence (a so-called “hierarchy”) of problems have to be solved with increasing degrees, each providing a lower bound on $f_\ast$.

When the degree is large enough so that the sum-of-squares representation exists for $f – f_\ast$, then the following problem $$\max_{c \in \mathbb{R}, A, B \succcurlyeq 0} c \mbox{ such that } \forall x \in \mathbb{R}^d, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle + ( R^2 – x^\top x ) \langle \phi(x), B \phi(x) \rangle$$ is a semi-definite program, whose solution leads to $c = f_\ast$.

This remains a large semi-definite program whose order is not known in advance and for which complicated solvers are needed. I will now present our recent work where we go infinite-dimensional, and some of these issues are alleviated (but some others are created).

## Going infinite-dimensional

We now consider $\phi(x)$ to be infinite-dimensional. In order to allow finite-dimensional manipulations of various objects, we consider feature maps which are obtained from positive-definite kernels. More precisely, we consider the so-called Sobolev space $\mathcal{H}$ of functions on $\Omega$ with all partial derivatives up to order $s$ which are square-integrable. When $s > d/2$, then this is a reproducing kernel Hilbert space, that is, there exists a positive-definite kernel $k: \Omega \times \Omega \to \mathbb{R}$ such that the feature map $\phi(x) = k(\cdot,x)$ allows to access function values as $h(x) = \langle h, \phi(x) \rangle$ for all $h \in \mathcal{H}$. For the algorithm, we will only need access to the kernel; for example, for $s = d/2+1/2$, we can choose the usual exponential $k(x,y) = \exp( – \alpha \| x – y \|_2)$.

We now consider the same problem as for polynomials, but with this new feature map: $$\tag{4} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ \mbox{ such that } \ \forall x \in \Omega, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle .$$ In order to have an exact reformulation, we need that there exists some $A_\ast$ such that $\forall x \in \Omega, \ f(x) \, – f_\ast = \langle \phi(x), A_\ast \phi(x) \rangle$. It turns out to be true [1] with appropriate geometric conditions on $x$ (such as isolated second-order strict global minimizers within $\Omega$) and regularity (at least $s+3$ derivatives); see [1] for details.

This is all very nice, but the problem in Eq. (4) is still an infinite-dimensional problem (note that we have gone from a continuum of inequalities to elements of a Hilbert space, but that’s not much of a gain).

## Controlled approximation through sampling

We can now leverage classical and more recent techniques from kernel methods. The main idea is to subsample the set of equalities in Eq. (4), that is consider $n$ points $x_1,\dots,x_n \in \Omega$, and solve instead the regularized sampled problem $$\tag{5} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ – \, \lambda \, {\rm tr}(A) \ \mbox{ such that }\ \forall i \in \{1,\dots,n\}, \ f(x_i) = c + \langle \phi(x_i), A \phi(x_i) \rangle ,$$ where $\lambda > 0$ is a regularization parameter.

Approximation guarantees. Beyond computational considerations, how big does $n$ need to be to obtain a value of $c$ which is at most $\varepsilon$ away from $f_\ast$? From last post, if $f$ is $m$-times differentiable, we cannot have $n$ smaller than $\varepsilon^{-d/m}$. As shown in [1], for points $x_1,\dots,x_n \in \Omega$ selected uniformly at random, then $n$ can be taken to of the order $\varepsilon^{-d/(m – d/2 – 3)}$ for a well-chosen Sobolev space $s$ and regularization parameter $\lambda$. This is not the optimal rate $\varepsilon^{-d/(m – d/2)}$ , but close when $m$ and $d$ are large.

Finite-dimensional algorithm. The problem in Eq. (5) is still infinite-dimensional. We can leverage a recent work, again with Ulysse Marteau-Ferey and Alessandro Rudi [5], that shows that a representer theorem exists for this type of problems: the optimization for the operator $A$ may be restricted to $A$ of the form $$A = \sum_{i,j} C_{ij} \phi(x_i) \otimes \phi(x_j)$$ for a positive definite matrix $C \in \mathbb{R}^{n \times n}$. This leads to a semi-definite programming problem of size $n$, whose running-time complexity is polynomial in $n$ (e.g, $O(n^{3.5})$), with standard general-purpose toolboxes (such as CVX).

In order to tackle large values $n$ which are larger than $1000$, simple iterative algorithms can be designed (see [1] for a damped Newton algorithm which does not require any singular value decompositions, with complexity $O(n^3)$ per iteration). Classical tools for efficient kernel methods, such as column sampling / Nyström method (see [6, 7] and references therein) or random features (see [8, 9] and references therein), could also be used to avoid a cubic or quadratic complexity in $n$.

Comparison with random search. Instead of solving the problem in Eq. (5), we could output the minimal value of $f(x_i)$ for $i \in \{1,\dots,n\}$, which is exactly random search if the points $x_i$ are random (this also happens to correspond to using $\lambda=0$ in Eq. (5)). This can only achieve an error $\varepsilon$ for $n$ of order $\varepsilon^{-d}$; in other words, it cannot leverage smoothness, while our algorithm can. The main technical reason is that subsampling inequalities provide weaker guarantees than subsampling equalities, and thus representing the non-negativity of a function through an equality (which our method essentially does) provides significant benefits.

## Duality

As always in convex optimization, it is beneficial to look at dual problems. It turns out that iterative algorithms are easier to build for the finite-dimensional dual problem to Eq. (5). We rather look now at the dual of Eq. (4) and make the parallel with the primal original problem in Eq. (1) and its dual.

Introducing a Lagrange multiplier $\mu(x)$ for the constraint $f(x) = c + \langle \phi(x), A \phi(x) \rangle$ (this time not constrained to be non-negative because we have an equality constraint), we can form the Lagrangian $$\mathcal{L}(c,A,\mu) = c + \int_{\Omega} \! \mu(x) \big( f(x) \, – c \, – \langle \phi(x), A \phi(x) \rangle \big) dx.$$ Maximizing with respect to $c$ again leads to the constraint $\int_\Omega \mu(x) dx = 1$, while minimizing with respect to $A \succcurlyeq 0$ leads to the constraint $\int_\Omega \mu(x) \phi(x) \otimes \phi(x) dx \succcurlyeq 0 .$ The dual problem thus becomes $$\inf_{\mu \in \mathbb{R}^\Omega} \int_{\Omega} \! \mu(x) f(x) dx \ \ \mbox{ such that } \int_\Omega\! \mu(x) dx = 1 \mbox{ and } \ \int_\Omega \mu(x) \phi(x) \otimes \phi(x)dx \succcurlyeq 0.$$ The impact of the representation of non-negative functions by quadratic forms in $\phi(x)$ is to replace the non-negativity contraint on all $\mu(x)$ by the positivity of the matrix $\int_\Omega \mu(x) \phi(x) \otimes \phi(x)dx$, which is weaker in general, but equivalent if $\phi(x)$ is “large” enough.

Note that the dual problem through a signed measure that should concentrate around the minimizers of $f$ leads to a natural candidate $\int_\Omega x \mu(x) dx$ for the optimizer (see more details in [1], and a similar dual formulation for polynomials in [10]).

## Simulations

In order to highlight how the optimization method works, I will first present how it minimizes the multi-modal function in dimension $d=2$ pictured below on $[-1,1]^2$.

We consider sampling a sequence of points $x_1, x_2,\dots \in \mathbb{R}^2$ from a quasi-random sequence (so that it fills the space more evenly that purely random). The random points are displayed in purple while the solution of our algorithm is displayed in red, with increasing numbers $n$ of sampled points. On the right plot, the model $c + \langle \phi(x), A \phi(x) \rangle$ is displayed, showing that the model becomes more accurate as $n$ grows.

We also consider a problem in dimension $d=8$, where we compare random sampling and our algorithm (“gloptikernel”), as $n$ grows (see [1] for the experimental set-up).

Clearly, the chosen baseline (random sampling) is rudimentary, though optimal up to logarithmic terms for non-smooth problems where only Lipschitz-continuity is assumed. We can also compare to gradient descent with random restarts (“random+GD” below, with the same overall number of function evaluations). As shown below, when the gradient information is robust, this baseline performs similarly (left plot), while when a high frequency component is added to the function (which makes the gradient information non reliable, right plot), the local descent algorithm needs more function evaluations.

It would be interesting to compare this new algorithms to other global optimization algorithms based on function evaluations, such as CMA-ES or Bayesian optimization.

## Conclusion

In this post, I presented an algorithm for global optimization that can provably leverage smoothness with an algorithm which has a running time which is polynomial in the dimension and the number of function evaluations. For large smoothness factors, the number of such evaluations has a dependence on the precision $\varepsilon$ which has no exponential dependence in dimension in the exponent (but still in the constants: we cannot optimize multivariate polynomials in polynomial time…).

These ideas can be extended in a number of ways.

General relaxation tool for dense inequalities. We can generalize the previous developments to all problems with a continuum of constraints on a variable $\theta$, such as, $$\forall x \in \Omega, \ H(\theta,x) \geqslant 0.$$ The corresponding Lagrange multipliers also satisfy $$\forall x \in \Omega, \ \mu(x) \geqslant 0$$ (there may be other ones like in the problem above). Now the question is: should we use the non-negative modelisation with quadratic forms in the primal or in the dual? Or equivalently, replace the non-negativity constraint by the positivity of the integral of $\phi(x) \otimes \phi(x)$ in the dual or in the primal?

The answer depends on the expected smoothness of the function which is non-negative. In the optimization problem, we expect $f(x) \, – c$ to be smooth at the optimum, while $\mu(x)$ which is approximating a Dirac clearly isn’t; therefore, the choice is clear (at least in retrospect since we originally tried to model $\mu(x)$ as a quadratic form…).

Constrained optimization. Following [3], we can apply the same algorithmic technique to constrained optimization, by formulating the problem of minimizing $f(x)$ such that $g(x) > 0$ as maximizing $c$ such that $f(x) = c + p(x) + g(x)q(x)$, and $p, q$ non-negative functions. We can (at least in principle) then replace the non-negative constraints by $p(x) = \langle \phi(x),A\phi(x)\rangle$ and $q(x) =\langle \phi(x),B\phi(x)\rangle$ for positive operators $A$ and $B$. All this remains to be explored in more details.

Acknowledgements. I would like to thank Alessandro Rudi and Ulysse Marteau-Ferey for proofreading this blog post and making good clarifying suggestions.

## References

[1] Alessandro Rudi, Ulysse Marteau-Ferey, Francis Bach. Finding Global Minima via Kernel Approximations. Technical report arXiv:2012.11978, 2020.
[2] Walter Rudin. Sums of squares of polynomialsThe American Mathematical Monthly107(9), 813-821, 2000.
[3] Jean-Bernard Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796–817, 2001.
[4] Jean-Bernard Lasserre. Moments, Positive Polynomials and their Applications, volume 1. World Scientific, 2010.
[5] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric models for non-negative functions. Advances in Neural Information Processing Systems, 33, 2020.
[6] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. Conference on Learning Theory, pages 185–209, 2013.
[7] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 2015.
[8] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. Advances in Neural Information Processing Systems, 30, 2017.
[9] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714–751, 2017.
[10] Jean-Bernard Lasserre. The moment-SOS hierarchy and the Christoffel-Darboux kernel. Technical Report arXiv:2011.08566, 2020.

by admin at January 05, 2021 12:32 PM UTC

### News for December 2020

Happy 2021! We now cover the last papers of 2020: a (smaller) spread with graph property testing, distribution property testing, and circuit complexity.

Almost Optimal Bounds for Sublinear-Time Sampling of k-Cliques: Sampling Cliques is Harder Than Counting by Talya Eden, Dana Ron, and Will Rosenbaum (arXiv). In many settings, sampling and counting are tightly coupled; the classic example being permanent estimation and sampling random bipartite matchings.This paper investigates the problem of sampling uniform random cliques, and unearths an interesting separation of approximately uniform sampling from approximate counting. Let’s consider the fundamental problem of edge and triangle counting, in graphs of constant arboricity. This class, which contains all minor-closed families, has deep connections to clique counting; clique counting can be done in linear time for bounded arboricity graphs.) The input connected graph $G$ has $n$ vertices, $m$ edges, and $t$ triangles. We assume the standard query model, with uniform random vertices, neighbor, and degree queries. Our aim is to get $(1+\varepsilon)$-approximations for $m$ and $t$ (in general, any $k$-clique count). It is known from previous work that edge estimation can be done in $O(1)$ time and triangle estimation can be done in $O(n/t)$ time (ignoring log and $\varepsilon$ dependencies). Moreover, these bounds is optimal. Can one get similar bounds for approximately sampling a uniform random edge/triangle? Previous work of Eden-Rosenbaum achieve such a bound for sampling edges. Surprisingly, this paper shows that triangle sampling is fundamentally harder, and requires $\Omega(n^{3/4}/\sqrt{t})$ queries. The main result gives a precise and optimal bound for uniform sampling $k$-cliques in graphs of arboricity $\alpha$.

Testing and reconstruction via decision trees by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). A property tester is an (sublinear) algorithm that distinguishes an object with a property from those that are far. Reconstruction is a much stronger sublinear algorithm, which actually provides query access to the closest object. Suppose we have query access to an $n$-variate Boolean function $f$, and the property $\mathcal{P}$ of interest is size-$s$ decision trees. A reconstructor would give query access to a function $g \in \mathcal{P}$, such that $dist(f,g) = O(OPT_f)$, where $OPT_f$ is the distance of $f$ to $\mathcal{P}$. Observe that a reconstructor automatically gives a distance estimator (just estimate $dist(f,g)$ by random sampling). One of the main results of this paper is a (weaker, bicriteria) reconstructor, where $g$ is guaranteed to have decision-tree complexity $s^{O(\log^2s)}$. Each query to $g$ is computed in $poly(\log s, 1/\varepsilon)\cdot n$ time. This reconstructor leads to a number of testing results, and reconstructors for other properties like low Fourier degree. Another interesting result proven: if there exists an efficient reconstructor that gets $g$ of size $s$, then there exists an efficient proper learner for decision trees (a big open problem).

Testing Product Distributions: A Closer Look by Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, and N. V. Vinodchandran (arXiv). This paper is on distribution testing, where all distributions involved are promised to be product distributions. Let $P, Q$ be distributions over $\Sigma^n$, where $Sigma$ is some (small) finite set. The problems studied are identity testing ($Q$ is known/fixed, and one gets samples from unknown $P$) and closeness testing (both unknown). It is known from previous work that these problems can basically be solved with $O(n|\Sigma|\varepsilon^{-2})$ samples. Note that the domain size is exponential in $n$; thus, these bounds give an exponential improvement over identity/closeness testing in the vanilla (non-product) setting. Tolerant testing is of special significance here. Because the domain is so large, it makes sense to allow for some distance even in the “yes” case. The main result of this paper is to completely map out the complexities for identity/closeness testing where we wish to distinguish $d_1(P,Q) < \varepsilon_1$ from $d_2(P,Q) > \varepsilon_2$, where $d_1, d_2$ are potentially different distance measures. For example, the paper considers total variation (TV), KL, Hellinger, and $\chi^2$ distances. An important improvement achieved is for identity testing where $d_1$ is Hellinger and $d_2$ is TV distance: for certain values of $\varepsilon_1, \varepsilon_2$, one can get a tester of optimal sample complexity $O(\sqrt{n|\Sigma|})$. Previous bounds only gave a $O(n|\Sigma|)$ sample complexity. There are a number of new lower bounds for different combinations of $d_1, d_2$ distance pairs.

by Seshadhri at January 05, 2021 05:26 AM UTC

### Postdoctoral Fellow at The University of Texas at Austin (apply by January 25, 2021)

from CCI: jobs

The Computer Science Dept. at UT Austin invites applications for a Postdoctoral Fellow in theoretical computer science for the 21-22 academic year to work with David Zuckerman. Research interests should overlap with his: pseudorandomness, computational complexity, coding theory, and more. Review of applicants will begin on Jan. 25, but applications will be accepted until the position is filled.

Website: https://utaustin.wd1.myworkdayjobs.com/UTstaff/job/UT-MAIN-CAMPUS/Postdoctoral-Fellow_R_00011430-1
Email: maguilar@cs.utexas.edu

by shacharlovett at January 04, 2021 08:52 PM UTC

### Tenure -track Assistant Professor at Bocconi University (apply by January 8, 2021)

from CCI: jobs

Bocconi University, based in Milan, is Italy’s premiere private university. We invite applications for two tenure-track assistant professor positions in Computer Science. Compensation, which is negotiated on a case-by-case basis, will be highly competitive. The language of instruction is English.

Website: https://lucatrevisan.wordpress.com/2020/12/11/bocconi-is-looking-for-assistant-professors-in-computer-science/
Email: l.trevisan@unibocconi.it

by shacharlovett at January 04, 2021 05:32 PM UTC

### Application Deadline for Assistant Professor Positions at Bocconi

from Luca Trevisan

Bocconi University is looking for two Tenure-Track Assistant Professors in Computer Science, for positions starting in Fall 2021: the application deadline is this Friday, January 8, 2021.

### Assistant Professor in Computer Science (with possiblity of Tenure Track) at Department of Computer Science, Technical Faculty of IT, Aalborg University (apply by February 10, 2021)

from CCI: jobs

Department of Computer Science at Aalborg University’s Technical Faculty of IT is looking to appoint a number of Assistant Professors (with possibility of tenure-track) for our Aalborg and Copenhagen Campus, starting June 1, 2021 or soon after this.

Website: https://www.stillinger.aau.dk/vis-stilling/?vacancy=1136722
Email: jesper@cs.aau.dk

by shacharlovett at January 04, 2021 10:12 AM UTC

### Associate Professor in Computer Science at Department of Computer Science, Technical Faculty of IT, Aalborg University (apply by February 10, 2021)

from CCI: jobs

Department of Computer Science at Aalborg University’s Technical Faculty of IT is looking to appoint a number of Associate Professors for our Aalborg and Copenhagen Campus, starting June 1, 2021 or soon after this.

Website: https://www.stillinger.aau.dk/vis-stilling/?vacancy=1136728
Email: jesper@cs.aau.dk

by shacharlovett at January 04, 2021 10:08 AM UTC

### TR21-001 | Computation Over the Noisy Broadcast Channel with Malicious Parties | Raghuvansh Saxena, Gillat Kol, Klim Efremenko, Dmitry Paramonov

from ECCC papers

We study the $n$-party noisy broadcast channel with a constant fraction of malicious parties. Specifically, we assume that each non-malicious party holds an input bit, and communicates with the others in order to learn the input bits of all non-malicious parties. In each communication round, one of the parties broadcasts a bit to all other parties, and the bit received by each party is flipped with a fixed constant probability (independently for each recipient). How many rounds are needed? Assuming there are no malicious parties, Gallager gave an $\mathcal{O}(n \log \log n)$-round protocol for the above problem, which was later shown to be optimal. This protocol, however, inherently breaks down in the presence of malicious parties. We present a novel $n \cdot \tilde{\mathcal{O}}\left(\sqrt{\log n}\right)$-round protocol, that solves this problem even when almost half of the parties are malicious. Our protocol uses a new type of error correcting code, which we call a locality sensitive code and which may be of independent interest. Roughly speaking, these codes map "close" messages to "close" codewords, while messages that are not close are mapped to codewords that are very far apart. We view our result as a first step towards a theory of property preserving interactive coding, i.e., interactive codes that preserve useful properties of the protocol being encoded. In our case, the naive protocol over the noiseless broadcast channel, where all the parties broadcast their input bit and output all the bits received, works even in the presence of malicious parties. Our simulation of this protocol, unlike Gallager's, preserves this property of the original protocol.

### Postdoc in Parallel and Distributed Algorithms at University of Warwick (apply by January 13, 2021)

from CCI: jobs

In connection with a research grant of Professor Artur Czumaj, we are looking for excellent candidates for a postdoc position in the area of the design and analysis of parallel and distributed algorithms, focusing on fundamental research in this area.