Authors: Batuhan Arasli, Sennur Ulukus
Download: PDF
Abstract: We propose a novel infection spread model based on a random connection graph
which represents connections between $n$ individuals. Infection spreads via
connections between individuals and this results in a probabilistic cluster
formation structure as well as a non-i.i.d. (correlated) infection status for
individuals. We propose a class of two-step sampled group testing algorithms
where we exploit the known probabilistic infection spread model. We investigate
the metrics associated with two-step sampled group testing algorithms. To
demonstrate our results, for analytically tractable exponentially split cluster
formation trees, we calculate the required number of tests and the expected
number of false classifications in terms of the system parameters, and identify
the trade-off between them. For such exponentially split cluster formation
trees, for zero-error construction, we prove that the required number of tests
is $O(\log_2n)$. Thus, for such cluster formation trees, our algorithm
outperforms any zero-error non-adaptive group test, binary splitting algorithm,
and Hwang's generalized binary splitting algorithm. Our results imply that, by
exploiting probabilistic information on the connections of individuals, group
testing can be used to reduce the number of required tests significantly even
when infection rate is high, contrasting the prevalent belief that group
testing is useful only when infection rate is low.
Authors: Jan van den Brand, Yin Tat Lee, Yang P. Liu, Thatchaphol Saranurak, Aaron Sidford, Zhao Song, Di Wang
Download: PDF
Abstract: In this paper we provide new randomized algorithms with improved runtimes for
solving linear programs with two-sided constraints. In the special case of the
minimum cost flow problem on $n$-vertex $m$-edge graphs with integer
polynomially-bounded costs and capacities we obtain a randomized method which
solves the problem in $\tilde{O}(m+n^{1.5})$ time. This improves upon the
previous best runtime of $\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) and, in the
special case of unit-capacity maximum flow, improves upon the previous best
runtimes of $m^{4/3+o(1)}$ (Liu-Sidford 2020, Kathuria 2020) and
$\tilde{O}(m\sqrt{n})$ (Lee-Sidford 2014) for sufficiently dense graphs.
For $\ell_1$-regression in a matrix with $n$-columns and $m$-rows we obtain a randomized method which computes an $\epsilon$-approximate solution in $\tilde{O}(mn+n^{2.5})$ time. This yields a randomized method which computes an $\epsilon$-optimal policy of a discounted Markov Decision Process with $S$ states and $A$ actions per state in time $\tilde{O}(S^2A+S^{2.5})$. These methods improve upon the previous best runtimes of methods which depend polylogarithmically on problem parameters, which were $\tilde{O}(mn^{1.5})$ (Lee-Sidford 2015) and $\tilde{O}(S^{2.5}A)$ (Lee-Sidford 2014, Sidford-Wang-Wu-Ye 2018).
To obtain this result we introduce two new algorithmic tools of independent interest. First, we design a new general interior point method for solving linear programs with two sided constraints which combines techniques from (Lee-Song-Zhang 2019, Brand et al. 2020) to obtain a robust stochastic method with iteration count nearly the square root of the smaller dimension. Second, to implement this method we provide dynamic data structures for efficiently maintaining approximations to variants of Lewis-weights, a fundamental importance measure for matrices which generalize leverage scores and effective resistances.
Authors: Linus Hamilton, Ankur Moitra
Download: PDF
Abstract: In recent years there has been significant effort to adapt the key tools and
ideas in convex optimization to the Riemannian setting. One key challenge has
remained: Is there a Nesterov-like accelerated gradient method for geodesically
convex functions on a Riemannian manifold? Recent work has given partial
answers and the hope was that this ought to be possible.
Here we dash these hopes. We prove that in a noisy setting, there is no analogue of accelerated gradient descent for geodesically convex functions on the hyperbolic plane. Our results apply even when the noise is exponentially small. The key intuition behind our proof is short and simple: In negatively curved spaces, the volume of a ball grows so fast that information about the past gradients is not useful in the future.
Authors: Manoj Kumar
Download: PDF
Abstract: In this paper, a necessary and sufficient condition for satisfiability of a
boolean formula, in CNF, has been determined. It has been found that the
maximum cardinality of satisfiable boolean formula increases exponentially,
with increase in number of variables. Due to which, any algorithm require
exponential time, in worst case scenario, depending upon the number of
variables in a boolean formula, to check satisfiability of the given boolean
formula. Which proves the non-existence of a polynomial time algorithm for
satisfiability problem. As satisfiability is a NP-complete problem, and
non-existence of a polynomial time algorithm to solve satisfiability proves
exclusion of satisfiability from class P. Which implies P is not equal to NP.
Further, the necessary and sufficient condition can be used to optimize
existing algorithms, in some cases, the unsatisfiability of a given boolean
function can be determined in polynomial time. For this purpose, a novel
function has been defined, that can be used to determine cardinality of a given
boolean formula, and occurances of a literal in the given formula, in
polynomial time.
Authors: Grzegorz Gluch, Michael Kapralov, Silvio Lattanzi, Aida Mousavifar, Christian Sohler
Download: PDF
Abstract: Given a graph $G$ that can be partitioned into $k$ disjoint expanders with
outer conductance upper bounded by $\epsilon\ll 1$, can we efficiently
construct a small space data structure that allows quickly classifying vertices
of $G$ according to the expander (cluster) they belong to? Formally, we would
like an efficient local computation algorithm that misclassifies at most an
$O(\epsilon)$ fraction of vertices in every expander. We refer to such a data
structure as a \textit{spectral clustering oracle}. Our main result is a
spectral clustering oracle with query time $O^*(n^{1/2+O(\epsilon)})$ and
preprocessing time $2^{O(\frac{1}{\epsilon} k^4 \log^2(k))}
n^{1/2+O(\epsilon)}$ that provides misclassification error $O(\epsilon \log k)$
per cluster for any $\epsilon \ll 1/\log k$. More generally, query time can be
reduced at the expense of increasing the preprocessing time appropriately (as
long as the product is about $n^{1+O(\epsilon)}$) -- this in particular gives a
nearly linear time spectral clustering primitive. The main technical
contribution is a sublinear time oracle that provides dot product access to the
spectral embedding of $G$ by estimating distributions of short random walks
from vertices in $G$. The distributions themselves provide a poor approximation
to the spectral embedding, but we show that an appropriate linear
transformation can be used to achieve high precision dot product access. We
then show that dot product access to the spectral embedding is sufficient to
design a clustering oracle. At a high level our approach amounts to hyperplane
partitioning in the spectral embedding of $G$, but crucially operates on a
nested sequence of carefully defined subspaces in the spectral embedding to
achieve per cluster recovery guarantees.
Authors: Sven Fiergolla, Petra Wolf
Download: PDF
Abstract: The Run Length Encoding (RLE) compression method is a long standing simple
lossless compression scheme which is easy to implement and achieves a good
compression on input data which contains repeating consecutive symbols. In its
pure form RLE is not applicable on natural text or other input data with short
sequences of identical symbols. We present a combination of preprocessing steps
that turn arbitrary input data in a byte-wise encoding into a bit-string which
is highly suitable for RLE compression. The main idea is to first read all most
significant bits of the input byte-string, followed by the second most
significant bit, and so on. We combine this approach by a dynamic byte
remapping as well as a Burrows-Wheeler-Scott transform on a byte level.
Finally, we apply a Huffman Encoding on the output of the bit-wise RLE encoding
to allow for more dynamic lengths of code words encoding runs of the RLE. With
our technique we can achieve a lossless average compression which is better
than the standard RLE compression by a factor of 8 on average.
Authors: Ashish Goel, Anilesh Kollagunta Krishnaswamy, Kamesh Munagala
Download: PDF
Abstract: We study social choice rules under the utilitarian distortion framework, with
an additional metric assumption on the agents' costs over the alternatives. In
this approach, these costs are given by an underlying metric on the set of all
agents plus alternatives. Social choice rules have access to only the ordinal
preferences of agents but not the latent cardinal costs that induce them.
Distortion is then defined as the ratio between the social cost (typically the
sum of agent costs) of the alternative chosen by the mechanism at hand, and
that of the optimal alternative chosen by an omniscient algorithm. The
worst-case distortion of a social choice rule is, therefore, a measure of how
close it always gets to the optimal alternative without any knowledge of the
underlying costs. Under this model, it has been conjectured that Ranked Pairs,
the well-known weighted-tournament rule, achieves a distortion of at most 3
[Anshelevich et al. 2015]. We disprove this conjecture by constructing a
sequence of instances which shows that the worst-case distortion of Ranked
Pairs is at least 5. Our lower bound on the worst case distortion of Ranked
Pairs matches a previously known upper bound for the Copeland rule, proving
that in the worst case, the simpler Copeland rule is at least as good as Ranked
Pairs. And as long as we are limited to (weighted or unweighted) tournament
rules, we demonstrate that randomization cannot help achieve an expected
worst-case distortion of less than 3. Using the concept of approximate
majorization within the distortion framework, we prove that Copeland and
Randomized Dictatorship achieve low constant factor fairness-ratios (5 and 3
respectively), which is a considerable generalization of similar results for
the sum of costs and single largest cost objectives. In addition to all of the
above, we outline several interesting directions for further research in this
space.
This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts.
In any science, ML included, the goals of theory and practice are not disjoint. We study the same general phenomena from different perspectives. We can think of questions in computation at a high level as trying to map the unknown terrain of the computational cost of achieving certain quality of output, given some conditions on the input.
Practitioners aim to find points on this terrain’s practically relevant regions, while theoreticians are more interested in its broad landscape, including at points far out into infinity. Both theoreticians and practitioners care about discontinuities, when small changes in one aspect correspond to large changes in another. As theoreticians, we care about computational/statistical tradeoffs, particularly about points where a small difference in quality yields an exponential difference in time or sample complexity. In practice, models such as GPT-3 demonstrate that a quantitative increase in computational resources can correspond to a qualitative increase in abilities.
Since theory and practice study the same phenomenon in different ways, their relation can vary. Sometimes theory is forward-looking or prescriptive – giving “proof of concepts” results that can inspire future application or impossibility results that can rule out certain directions. Sometimes it is backward-looking or descriptive – explaining phenomena uncovered by experiments and putting them in a larger context.
Before we talk about what is machine-learning theory, we should talk about what is machine learning. It’s common to see claims of the form “machine learning is X”:
and probably many more.
I view machine learning as addressing the following setup:
There is a system that interacts with the world around it in some manner, and we have some sense of whether this interaction is successful or unsuccessful.
For example, a self-driving car is “successful” if it gets passengers from point A to point B safely, quickly, and comfortably. It is “unsuccessful” if it gets into accidents, takes a long time, or drives in a halting or otherwise uncomfortable manner.
The system is the “machine.” It is “learning” since we adapt so that it becomes more successful. Ideally, we would set to be the most successful system. However, what we actually do is at least thrice-removed from this ideal:
The magic of machine learning is that sometimes (though not always!) we can still get good results despite these gaps. Much of the theory of machine learning is about understanding under what conditions can we bridge some of these gaps.
The above discussion explains the “machine Learning is just X” takes. The expressivity of our models falls under approximation theory. The gap between the success we want to achieve and the metric we can measure often corresponds to the difference between population and sample performance, which becomes a question of statistics. The study of our algorithms’ performance falls under optimization.
The metric gap is perhaps the widest of them all. While in some settings (e.g., designing systems to play video games) we can directly measure a system’s success, this is typically the exception rather than rule. Often:
Nevertheless, the hope is that if we optimize the hell out of the metrics we can measure, the system will perform well in the way we care about. In other words:
A priori, it is not clear that this should be the case, and indeed it’s not always is. Part of the magic of auto-differentiation is that it allows optimization of more complex metrics that match more closely with the success metric we have in mind. But, except for very special circumstances, the two metrics can never be the same. The mismatch between the goal we have and the metric we optimize can manifest in one of the following general ways:
At the moment, we have no good way to predict when a system will behave according to Goodhart’s law versus the Anna Karenina principle. It seems that when it comes to learning representations, machine learning systems follow the Anna Karenina principle: all successful models tend to learn very similar representations of their data. In contrast, when it comes to making decisions, we get manifestations of Goodhart’s law, and optimizing for one metric can give very different results than the other.
The model gap is the gap between the set of all possible systems and the set that we actually optimize over. Even if a system can be captured as a finite function (say a function mapping pixel images to classes), a simple counting argument shows that the vast majority of these functions require far too many gates to be efficiently computable by any reasonable computational model. (In case you are curious, the counting argument also works for quantum circuits.) Hence we necessarily have to deal with a subclass of all possible functions.
This raises the question of whether we should expect a realizable system to exist, let alone for us to find it. Often machine learning is used in artifical intelligence applications, where we are trying to mimic human performance. In such cases, human performance is an “existence proof” that some reasonably-sized circuit is successful in the goal. But whether this circuit can be embedded in our model family is still an open question.
Remember when I said that sometimes it’s not about the journey but about the destination? I lied. The algorithm gap implies that in modern machine learning, there is a certain sense in which it is always about the journey. In modern machine learning, the system is typically parameterized by a vector of real numbers. Hence the metric we optimize over is to minimize some loss function . We use local search algorithms, which start off at some vector chosen via some distribution, and then iteratively takes small steps . For each , the next step is chosen from some distribution of vectors close to such that (hopefully) in expectation .
Modern machine learning systems are non convex, which means that the final point we end up in depends on the starting point, the algorithms, and the randomness. Hence, we can’t have a neat “separation of concerns” and decouple the architecture, metric, and algorithm. When we say that a system obeys the “Anna Karenina principle,” we really mean that for natural algorithms to optimize natural metrics on natural architectures, successful outputs (ones that do well on the metric) are likely to be similar to one another. The use of the “natural” qualifier is a warning sign that we don’t fully understand the conditions under which this happens, but it is clear that some conditions are necessary. Due to non-convexity, it is typically possible to find a “bad minima” that would be very good in some specific metric but terrible in other ones.
Given the above discussion, what kind of theoretical results should we expect in machine learning? Let’s try to classify the type of results we see in general theoretical computer science. In the discussion below, I will focus on the limitations of these theoretical results and use humorous names, but make no mistake: all of these categories correspond to valuable theoretical insights.
The quicksort algorithm provides a canonical example of algorithm analysis. At this point, we have a reasonably complete understanding of quicksort. We can analyze the distribution of its running time down to the constant.
Hence we have an efficient algorithm, used in practice, with rigoroulsy proven theorems that precisely characterize its performance. This is the “gold standard” of analysis, but also an unrealistic and unachievable goal in almost any other setting.
A more common situation is that we have algorithms that work in practice and algorithms with rigorous analysis, but they are not the same algorithms. A canonical example is linear programming. The simplex algorithm often works well in practice but was known to take exponential time on certain instances. For a long time, it was an open question whether or not there is an algorithm for linear programming that take polynomial time in the worst case.
In 1979, Khachiyan gave the Ellipsoid algorithm that runs in polynomial time, but with a polynomial so large that it is impractical. Still, the Ellipsoid algorithm was a great breakthrough, giving hope for better algorithms and a new approach for defining progress measures for linear programming instances. Indeed in 1984, Karmarkar came up with the interior points algorithm, with much better time dependence.
The interior-point algorithm is practical and often outperforms the simplex method on large enough instances. However, the algorithm as implemented is not identical to the one analyzed- implementations use different step sizes and other heuristics for which we do not have a precise analysis.
Nevertheless, even if (unlike quicksort) the rigorously analyzed algorithm is not identical to the practical implementations, the story of linear programming shows how crucial “proofs of concept” can be to introducing new ideas and techniques.
The flip side of “proof of concept” results are impossiblity results that rule out even “proof of concept ” algorithms. Impossibility results always come with fine print and caveats (for example, NP-hardness results always refer to worst case complexity). However, they still teach us about the structure of the problem and help define the contours of what we can expect to achieve.
“End to end analysis” is when we can prove the guarantees we need on algorithms we actually want to use. “Proof of concept” is when we prove the guarantees we need on impractical algorithms. A “character witness” result is when we prove something positive about an algorithm people use in practice, even if that positive property falls short of the guarantees we actually want.
While the term “character witness” sounds derogatory, such results can sometimes yield truly profound insights. A canonical example again comes from linear programming. While the simplex algorithm can be exponential in the worst case, in a seminal work, Spielman and Teng showed that it does run in polynomial-time if the input is slightly perturbed- this is so-called smoothed analysis. While the actual polynomial and the level of perturbation do not yield practical bounds, this is still an important result. It gives formal meaning to the intuition that the simplex only fails on “pathological” instances and initiated a new mode of analyzing algorithms between worst-case and average-case complexity.
In machine learning, a “character witness” result can take many forms. For example, some “character witness” results are analyses of algorithms under certain assumptions on the data, that even if not literally true, seem like they could be “morally true”. Another type of “character witness” result shows that an algorithm would yield the right results if it is allowed to run for an infinite or exponentially long time. Evaluating the significance of such results can be challenging. The main question is whether the analysis teaches us something we didn’t know before.
A third type of “character witness” results are quantitative bounds that are too weak for practical use but are still non-vacuous. For example, approximation algorithms with too big of an approximation factor, or generalization bounds with too big of a guaranteed gap. In such cases, one would hope that these bounds will be at least correlated with performance: algorithms with better bounds will also have higher quality output.
The name “toy problem” also sounds derogatory, but toy problems or toy models can be extremely important in science, and machine learning is no different. A toy model is a way to abstract the salient issues of a model to enable analysis. Results on “toy models” can teach us about general principles that hold in more complex models.
When choosing a toy model, it is important not to mistake models that share superficial similarity with models that keep the salient issues we want to study. For example, consider the following two variants of deep neural networks:
Since the practical intuition is that bigger models are better, it may seem that such “ultra-wide” models are not toy models at all and should capture state of the art deep networks. However, the neural tangent kernel results show that these models become kernel models that do not learn their representation at all.
Intuitively, making activation functions linear seems pointless since the composition of linear functions is linear, and hence such linear networks are no more expressive than a layer one linear net. Thus such linear networks seem too much of a “toy model” (maybe a “happy meal model”?). Yet, it turns out that despite their limited expressivity, deep linear networks capture an essential feature of deep networks- the bias induced by the gradient descent algorithm. For example, running gradient descent on a depth two linear network translate to regularizing by a proxy of rank. In general, gradient descent on a deep linear network induces a very different geometry on the manifold of linear functions.
Hence, although the first model seems much more “real” than the second one, there are some deep learning questions where the second model is more relevant.
One might think that the whole point of theory is to provide rigorous results: if we are not going to prove theorems, we might as well use benchmarks. Yet, there are important theoretical insights we can get from experiments. A favorite example of mine is the Zhang et al result that deep neural networks can fit random labels. This simple experiment ruled out in one fell swoop a whole direction for proving generalization bounds on deep nets. A theory work does not have to involve theorems: well-chosen experiments can provide important theoretical insights.
The above was a stream-of-consciousness and very rough personal overview of questions and results in ML theory. In the coming seminar we will see results of all the types above, as well as many open questions.
Acknowledgements: Thanks to Yamini Bansal and Preetum Nakkiran for helpful comments (though they are not to blame for any mistakes!)
Rigid heptagon linkage (\(\mathbb{M}\)). Improvements in the size of unit distance graphs whose unit distance representation is forced to contain a regular heptagon, from 59 to 35 edges. Based on a math stackexchange question. See also an old page on the same problem from Erich Friedman’s site, moved from its old location at Stetson University.
How Graphviz thinks the USA is laid out (\(\mathbb{M}\), via).
Adding comments to your static blog with Mastodon (\(\mathbb{M}\), via). It’s a cute method for using JavaScript to pull the comments onto the blog page itself, but the method I’ve been using (just add a link from the page to the Mastodon thread and let people follow it from there to Mastodon) seems simpler.
xkcd on depth-first and breadth-first search (\(\mathbb{M}\)), just in time for my lecture reviewing these algorithms.
The many ways of splitting a rectangle — or, how to use mathematics to make using Zoom much more complicated for no particular benefit (\(\mathbb{M}\), via). I am very fond of rectangle partition problems but I don’t want to have to think about them just to talk to multiple people online.
Doubly triangular number (\(\mathbb{M}\)), new Wikipedia article on triangular numbers with a triangular-number index. These numbers come up in counting pairs of pairs of things, for instance in the illustration, which describes colorings of the corners of a square (up to symmetry) as pairs of colorings of pairs of opposite corners.
How assigning impossible-to-solve problems can get mathematics students (or researchers) to take a step back and look at the bigger picture (\(\mathbb{M}\)). Patrick Honner in Quanta. I think it needs interactivity, though. It would be mean to do this on a problem set or exam.
Mathematical Art Exhibition from the January 2020 Joint Mathematical Meetings (\(\mathbb{M}\), via).
My latest puzzle (a belated Christmas present to myself) is the Hanayama Twist (\(\mathbb{M}\)). I’m very pleased with it: an elegant symmetric two-piece design, solid feel in the hand, and a solution that is surprisingly complicated but not tediously long. My only complaint is that the solution is very linear, with almost no ways to go wrong if you keep moving on from things you’ve already done. Anyway here’s a map I drew to help me, also used as an example of an implicit graph in my graph algorithms lectures.
In pre-covid lectures I hand-wrote notes as I talked, both to keep content fresh and to avoid racing through it too quickly. When we locked down I tried pre-recorded lectures with prepared slides, but it took too long to record and I missed the spontaneity. This term I’m back to live zoom lectures with prepared slides, and it seems to work: zoom chat keeps it spontaneous and I haven’t found myself going faster than when we were in person. Lesson learned, just in time to (I hope) unlock in fall. (\(\mathbb{M}\); see discussion for Pat Morin’s home whiteboard cat interruption.)
Two useful folds and a decorative one (\(\mathbb{M}\)). Paula Beardall Krieg. The useful ones seal a foil snack bag and wrap a sandwich securely in paper, without needing any fasteners.
Thoughts on the Pythagorean theorem (\(\mathbb{M}\)). Or, what did Euclid actually mean by saying that two squares are equal to a third square? And how does this view relate to type-theoretic foundations? From the xena automatic theorem-proving project. Both the name and the horrifying illustrations for the blog posts come from the author’s daughter. For some heavier going, see the post on perfectoid spaces and sphere eversion.
Regular star patterns for expanded US flags (\(\mathbb{M}\), via). I suspect that the likelihood of quickly adding Puerto Rico or the District of Columbia as states is not high, but just in case, there are some nice regular point arrangements available.
The figure below shows two pentagons inscribed in each other: each vertex of one pentagon lies on a line through a side of the other pentagon (\(\mathbb{M}\)). You can’t do this with quadrilaterals in the Euclidean plane, but you can in the complex projective plane. From a thread of fun math facts for Christian Lawson-Perfect’s birthday.
New preprint claims a proof of the Erdős-Faber-Lovász conjecture for all sufficiently large \(k\) (\(\mathbb{M}\), via), by Dong Yeap Kang, Tom Kelly, Daniela Kühn, Abhishek Methuku, and Deryk Osthus. The conjecture states that the union of \(k\) edge-disjoint \(k\)-cliques is \(k\)-colorable. See its Wikipedia article for a real-world-ish formulation with faculty committees as cliques and chairs in a common meeting room as colors.
This is a post about a new paper Online Multivalid Learning: Means, Moments, and Prediction Intervals, that is joint work with Varun Gupta, Christopher Jung, Georgy Noarov, and Mallesh Pai. It is cross-posted to the new TOC4Fairness blog.
Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients' symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically --- and you need to make decisions as you go. What can you do?
The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with marginal prediction intervals: i.e intervals $(\ell(x), u(x))$ such that over the randomness of some data distribution over labelled examples $(x,y)$: $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)]\right] \approx 0.95$ These would be 95% marginal prediction intervals --- but in general you could pick your favorite coverage probability $1-\delta$.
Conformal prediction has a lot going for it --- its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:
What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:
Multivalidity Ideally, we would want conditional guarantees --- i.e. the promise that for every $x$, that we would have $\Pr_{y}\left[y \in [\ell(x), u(x)] | x \right] \approx 0.95$. In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the "unrealized" (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector $x$). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don't have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups $G$ corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group $S \in G$ in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group $S \in G$, $\Pr_{(x,y)}\left[y \in [\ell(x), u(x)] | x \in S \right] \approx 0.95$. Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve --- at least for any such group that isn't too small.
And what kinds of assumptions do we need?
Adversarial Data Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences --- i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of $1-\delta + O(1/T)$ after $T$ rounds on the underlying distribution, as conformal prediction is able to, (for us there is no underlying distribution) we offer empirical coverage rates of $1-\delta \pm O(1/\sqrt{T})$. This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.
More Generally Our techniques are not specific to prediction intervals. We can do the same thing for predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.
Techniques At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 --- and is closely related to an earlier, beautiful argument by Sergiu Hart --- but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard told us their "true" distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution --- and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That's not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.
But lets put our game theory hats on. Suppose we've been making predictions for awhile. We can write down some measure of our error so far --- say the maximum, over all demographic groups in $G$, of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the increase in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first --- this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle --- it doesn't give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.
The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That's what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm --- but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.
Finally, I had more fun writing this paper --- learning about old techniques from the game theoretic calibration literature --- than I've had in awhile. I hope a few people enjoy reading it!
This is a post about a new paper Online Multivalid Learning: Means, Moments, and Prediction Intervals, that is joint work with Varun Gupta, Christopher Jung, Georgy Noarov, and Mallesh Pai.
Suppose you go and train the latest, greatest machine learning architecture to predict something important. Say (to pick an example entirely out of thin air) you are in the midst of a pandemic, and want to predict the severity of patients’ symptoms in 2 days time, so as to triage scarce medical resources. Since you will be using these predictions to make decisions, you would like them to be accurate in various ways: for example, at the very least, you will want your predictions to be calibrated, and you may also want to be able to accurately quantify the uncertainty of your predictions (say with 95% prediction intervals). It is a fast moving situation, and data is coming in dynamically — and you need to make decisions as you go. What can you do?
The first thing you might do is ask on twitter! What you will find is that the standard tool for quantifying uncertainty in settings like this is conformal prediction. The conformal prediction literature has a number of elegant techniques for endowing arbitrary point prediction methods with marginal prediction intervals: i.e intervals such that over the randomness of some data distribution over labelled examples : These would be 95% marginal prediction intervals — but in general you could pick your favorite coverage probability .
Conformal prediction has a lot going for it — its tools are very general and flexible, and lead to practical algorithms. But it also has two well known shortcomings:
What kinds of improvements might we hope for? Lets start with how to strengthen the guarantee:
Multivalidity Ideally, we would want conditional guarantees — i.e. the promise that for every , that we would have . In other words, that somehow for each individual, the prediction interval was valid for them specifically, over the “unrealized” (or unmeasured) randomness of the world. Of course this is too much to hope for. In a rich feature space, we have likely never seen anyone exactly like you before (i.e. with your feature vector ). So strictly speaking, we have no information at all about your conditional label distribution. We still have to average over people. But we don’t have to average over everybody. An important idea that has been investigated in several different contexts in recent years in the theory literature on fairness is that we might articulate a very rich collection of (generally intersecting) demographic groups corresponding to relevant subsets of the data domain, and ask for things that we care about to hold true as averaged over any group in the collection. In the case of prediction intervals, this would correspond to asking for something like that simultaneously for every demographic group , . Note here that an individual might be a member of many different demographic groups, and can interpret the guarantees of their prediction interval as averages over any of those demographic groups, at their option. This is what we can achieve — at least for any such group that isn’t too small.
And what kinds of assumptions do we need?
Adversarial Data Actually, its not clear that we need any! Many learning problems which initially appear to require distributional assumptions turn out to be solvable even in the worst case over data sequences — i.e. even if a clever adversary, with full knowledge of your algorithm, and with the intent only to sabotage your learning guarantees, is allowed to adaptively choose data to present to your algorithm. This is the case for calibrated weather prediction, as well as general contextual prediction. It turns out to be the case for us as well. Instead of promising coverage probabilities of after rounds on the underlying distribution, as conformal prediction is able to, we offer empirical coverage rates of (since for us there is no underlying distribution). This kind of guarantee is quite similar to what conformal prediction guarantees about empirical coverage.
More Generally Our techniques are not specific to prediction intervals. We can do the same thing for many other distributional quantities. We work this out in the case of predicting label means, and predicting variances of the residuals of arbitrary prediction methods. For mean prediction, this corresponds to an algorithm for providing multi-calibrated predictions in the sense of Hebert-Johnson et al, in an online adversarial environment. For variances and other higher moments, it corresponds to an online algorithm for making mean-conditioned moment multicalibrated predictions in the sense of Jung et al.
Techniques At the risk of boring my one stubbornly remaining reader, let me say a few words about how we do it. We generalize an idea that dates back to an argument that Fudenberg and Levine first made in 1995 — and is closely related to an earlier, beautiful argument by Sergiu Hart — but that I just learned about this summer, and thought was just amazing. It applies broadly to solving any prediction task that would be easy, if only you were facing a known data distribution. This is the case for us. If, for each arriving patient at our hospital, a wizard told us their “true” distribution over outcome severity, we could easily make calibrated predictions by always predicting the mean of this distribution — and we could similarly read off correct 95% coverage intervals from the CDF of the distribution. So what? That’s not the situation we are in, of course. Absent a wizard, we first need to commit to some learning algorithm, and only then will the adversary decide what data to show us.
But lets put our game theory hats on. Suppose we’ve been making predictions for awhile. We can write down some measure of our error so far — say the maximum, over all demographic groups in , of the deviation of our empirical coverage so far from our 95% coverage target. For the next round, define a zero sum game, in which we (the learner) want to minimize the increase in this measure of error, and the adversary wants to maximize it. The defining feature of zero-sum games is that how well you can do in them is independent of which player has to announce their distribution on play first — this is the celebrated Minimax Theorem. So to evaluate how well the learner could do in this game, we can think about the situation involving a Wizard above, in which for each arriving person, before we have to make a prediction for them, we get to observe their true label distribution. Of course in this scenario we can do well, because for all of our goals, our measure of success is based on how well our predictions match observed properties of these distributions. The Minimax theorem tells us that (at least in principle — it doesn’t give us the algorithm), there must therefore also be a learning algorithm that can do just as well, but against an adversary.
The minimax argument is slick, but non-constructive. To actually pin down a concrete algorithm, we need to solve for the equilibrium in the corresponding game. That’s what we spend much of the paper doing, for each of the prediction tasks that we study. For multicalibration, we get a simple, elementary algorithm — but for the prediction interval problem, although we get a polynomial time algorithm, it involves solving a linear program with a separation oracle at each round. Finding more efficient and practical ways to do this strikes me as an important problem.
Finally, I had more fun writing this paper — learning about old techniques from the game theoretic calibration literature — than I’ve had in awhile. I hope a few people enjoy reading it!
The Department of Computer Science at the University of Toronto invites applications for up to two full-time tenure stream positions in the areas of Security and Cryptography. The appointments will be at the rank of Assistant Professor and will commence on July 1, 2021, or shortly thereafter.
Website: https://jobs.utoronto.ca/job/Toronto-Assistant-Professor-Security-and-Cryptography-ON/543569117/
Email: recruit@cs.toronto.edu
The School of Informatics at the University of Edinburgh invites applicants for Lecturer (“Assistant Professor”), and Reader (“Associate Professor”) in Algorithms and Complexity.
For a full advert for the post, and how to apply, please see the link below.
Website: https://elxw.fa.em3.oraclecloud.com/hcmUI/CandidateExperience/en/sites/CX_1001/job/295
Email: kousha@inf.ed.ac.uk
Can we expand the base of their use?
Technion commemoration src |
Bella Subbotovskaya was doing Boolean complexity lower bounds in 1961. She originated the method of restrictions of Boolean functions and proved that the formula size of the parity function of variables is at least . The parity bound was improved to quadratic in 1971 by Valeriy Khrapchenko, who passed away only last year and was the only child of the Soviet literature and arts leader Mikhail Khrapchenko. Then Johan Håstad strengthened the whole method to quadratic.
Today we discuss not possible further sharpenings but rather how to extend the ways of applying it.
Subbotovskaya was most unfortunately killed in a mysterious manner in 1982. George Szpiro in 2007 wrote an article reviewing the circumstances amid her questioning by the KGB for co-founding and hosting the “Jewish People’s University” amid rampant anti-Semitism in Soviet academies. She was also a talented violist and vocalist and investigated computer-assisted musical composition and analysis. See also her MacTutor biography.
As presented by Stasys Jukna in his 2012 book Boolean Function Complexity, Subbotovskaya gave an algorithm for optimizing the trimming of a formula over the basis at its leaves by substituting for of the variables, one at a time. She used an averaging argument that readily extends to give the same-order bounds on random restrictions. Random restrictions have other advantages that have led to wonderful work besides those above, including the recent oracle separation of from the polynomial hierarchy. We want to discuss targeting the restrictions toward possible particular sensitivities of the Boolean functions computed by the formulas.
Subbotovskaya’s argument considers every or gate , in a minimal formula computing the function , that has or as an input, for some to . Note that because is a formula, one of the two possible assignments to renders the other input to immaterial. If has leaves, then we can not only choose so that or occurs at least times, we can set it to wipe out at least half of the subtrees .
The final note is that doing so does not double-cancel any variable: if contains or , then assigning to that occurrence leaves the function computed by unchanged—but that contradicts the minimality of . Hence no subformula contains or . The upshot is that we can set so that the resulting formula in variables has leaves. It is convenient to weaken this inequality by
so that by iterating we can exploit the identity that for any ,
The upshot is that for any , we can set variables so that the size of the resulting formula in variables obeys the bound
Because the parity function is such that any way of setting variables leaves the formula needing (at least) one read of the remaining variable, we can carry this all the way down to to deduce . Note: this is concrete, not asymptotic.
Via induction on gates deeper than those at the leaves, and techniques that adapt to how closely the two subtrees of are balanced, Håstad increased the exponent from to in all cases, on pain of carrying some lower-order terms that make the bounds asymptotic. His lemmas rely more on the restrictions being random, including one that estimates the probability that the restriction already leaves a formula dependent on only one variable.
For sake of exposition we wave away the lower-order terms and write for the numerical inequalities. We frame the restrictions with notation that can help us target subsets of variables to restrict, outside of which remains sensitive.
Definition 1 Let be a Boolean function. We say that
holds provided and and for some in the following hold:
- The inequality holds.
- For all ,
Thus holds provided the value is not determined by just knowing on the set of positions in . The remaining positions are needed.
Theorem 2 (Main Theorem of Random Restriction Theory) Let be a Boolean function. Suppose that for a constant positive fraction of with of size ,
holds. Then
To see how this scheme applies in the case of the parity function, we can set . No matter which inputs are selected, the value of the function stays undetermined after inputs are set. That is, the following is always true:
for all of size at most and all . Thus the main theorem says that the formula lower bound for the parity function is order .
Note that the only property of the parity function being used is the global way it satisfies . We want to look for other patterns that can yield good bounds. Of course, patterns of sensitivity of Boolean functions have been looked at in many ways. We want to try to build a bridge from them to number theory.
For a non-example, consider the set of even numbers. Let us number the variables so that stands for the ones place in binary notation, for the twos place, and so on. Then depends only on . Whether holds depends entirely on not belonging to . If for some then it holds with constant probability, and the framework gives
Well, OK, we really know so this is where the overhead hidden by can’t be ignored. But at least this illustrates how the framework doesn’t get too excited about a frequently alternating numerical function. Similar remarks apply when the function is multiples of where is small.
We would like to apply the random restrictions method to the Boolean function that identifies whether the -bit binary string denotes a prime number. It has been known for decades that the Boolean circuit complexity of is polynomial. This follows from the existence of a polynomial random algorithm for due to Solvay-Strassen or the later Miller-Rabin algorithm and Adleman’s connection between such algorithms and Boolean circuit complexity.
There are strong lower bounds on special circuit models, but nothing so general as formulas, let alone unrestricted circuits. Certainly is a harder function that parity, but we don’t know of a quadratic formula lower bound for it. Can we prove one in the framework?
Our intuition comes from considering various ways in which the primes are dense and uniformly—but not too regularly—distributed. Two simple ways are:
The first is of course a consequence of the prime number theorem. The second follows from the widely believed Cramér conjecture.
We will not really need this conjecture—we will not need it to hold everywhere. Our basic framework will only need it to hold for a positive fraction of small intervals of numbers that we sample. What we need instead is for a positive-fraction version to hold under a more liberal notion of “gap.”
Here is an outline to try to prove a formula size lower bound of on , for some . Let
We will use the random restriction method for . Apply the method. The issue is now if we leave only input bits unset, then what is the probability that
Certainly we cannot always leave bit unset like with the XOR function. Look at this:
The input is either
And both are not a prime. In general, our finger has come to rest on the question of what is known or can be shown about the behavior of the set of primes on “hypercubic” sets—that is, sets that add or subtract all sums from a selected subset of the powers of .
There must be other lower bounds on the primes. We cannot seem to find them. Any help?
Dong Yeap Kang, Tom Kelly, Daniela Kühn, Abhishek Methuku, and Deryk Osthus have just uploaded a paper to the arXive, A proof of the Erdős-Faber-Lovász conjecture. (I am thankful to Nati Linial and Ryan Alweiss for telling me about it.)
Here is one formulation of the conjecture:
Erdős-Faber-Lovász Conjecture: Let be a family of -subsets of . Suppose that every two sets in have at most one element in common, then can be divided into at most families so that every two sets in each of these families are disjoint.
Using some standard hypergraph jargon the conjecture can be stated as follows:
Erdős-Faber-Lovász Conjecture: The chromatic index of any linear hypergraph on vertices is at most .
Here a linear hypergraph is a hypergraph where every two edges share at most one vertex. The chromatic index of a hypergraph is the number of colors needed to color the edges so that every two intersecting edges have distinct colors.
The new result asserts
Theorem (Kang, Kelly, Kühn, Methuku, and Osthus): For every sufficiently large n, every linear hypergraph H on n vertices has chromatic index at most n.
A stability versions of this result, which confirm a prediction of Jeff Kahn, is also provided. Of course, it will be interesting to settle the conjecture for all values of n.
This is an amazing breakthrough! Congratulations!
(from Wikipedea)
Daniella Kühn, Deryk Osthus, Tom Kelly, and Abhishek Methuku (I will add a picture of Dong Yeap Kang when I will find Update: found! see below.)
Background and links: Here is the Wikipedea article. We mentioned the conjecture in this post and this one.
Matching and Fractional version. In 1982 Paul Seymour proved that any linear hypergraph has a matching of size . In 1992 Jeff Kahn and Paul Seymour proved that the fractional chromatic index of any linear hypergraph on vertices is at most .
Approximate version via the semi random method. In 1992 Jeff Kahn proved that chromatic index of any linear hypergraph on vertices is at most . Kahn used the Rödl nibble (aka semi-random) method. Kahn extended the result to list-coloring.
There are many related open questions and some are mentioned in the paper. It will be interesting to prove EFL conjecture for all values of . EFL-conjecture extends to beautiful generalization of Vizing’s theorem which is still open. Also the EFL-conjecture for list coloring is still open. See also this 1994 article by Jeff Kahn on Hypergraphs matching, covering and coloring problems.
Another interesting problem is:
The restricted Larman’s conjecture: Let be an intersecting family of -subsets of . Then can be divided into at most 2-intersecting families, namely so that every two sets in each of these families have at least elements in common.
Little Update: A few more details on the last problem. As far as I know for the restricted Larman conjecture it is not even known that can always be divided into 2-intersecting families.
Now, call G strongly 2-intersecting if the intersection of all sets in has at least two elements. Furedi and Seymour proved that F can be fractionally covered by strongly 2-intersecting subfamilies. They conjectured that F can always be covered by strongly 2-intersecting families, however Jeff Kahn and I disproved this conjecture.
HSE University’s Faculty of Computer Science now receives applications for the postdoctoral positions in several laboratories.
At present, there are ten open postdoctoral positions in ten out of twelve Faculty’s laboratories. Research areas range from algebraic topology and bioinformatics to data analysis, neural networks and process mining.
Website: https://cs.hse.ru/en/news/433773574.html
Email: abasov@hse.ru
I am excited to launch the Simons Foundation’s Collaboration on the Theory of Algorithmic Fairness, funded by the Simons Foundation. This is a large-scale multi-institutional effort to accelerate the (already powerful) impact of TOC on the emerging area of algorithmic fairness. Beyond the ambitious research goals of this project (which we will discuss in future posts), we hope it will play an important role in community building and bridge building to other fields of research. In addition to launching this blog and our site, we are also launching a research seminar that will feature a diverse set of talks. We are very exited to have Annette Zimmermann as our inaugural speaker (a week from today).
If you want to join our seminar, please email toc4fairness-director@cs.stanford.edu and we will add you to our email list (which we will be careful not to overuse).
If you are formatting a paper for lipics, but don’t want all their files in the paper directory, add the following to the top of the main latex file:
\makeatletter
\def\input@path{{lipics/}{../lipics/}}
\makeatother
\documentclass[a4paper,USenglish,cleveref,autoref,thm-restate]%
{socg-lipics-v2019}
This is under the assumption that lipics/ subdirectory contains the .cls file.
The EC 2021 conference invites proposals for tutorials and workshops to be held in conjunction with the ACM Conference on Economics and Computation (ACM EC). EC’21 will be held either virtually or in a hybrid format. In the latter case, tutorial/workshop organizers will have the opportunity to decide whether to hold their event in person or remotely.
Tutorials provide an opportunity to educate the community about emerging topics of interest, or about topics from related fields that merit additional attention from the EC community. A tutorial is an opportunity to invite colleagues and young researchers to get excited about a well-defined topic, to prepare them to dive into the literature and to guide them to the most exciting developments and open problems. Typically, tutorials run for half a day (3-3.5 hours) and consist of a series of presentations by experts in the field.
Tutorial proposals should contain:
Tutorial proposals should be emailed to ec21-tutorial-chairs@acm.org.
Workshops provide an opportunity to bring together researchers to discuss emerging areas of research in an informal forum. Workshop schedules should be designed to promote discussion and debate. A workshop may include invited talks, contributed talks, panel discussions, poster sessions, open problem sessions, presentations of work in progress, or any other activities that stimulate new ideas for research. It is up to the workshop organizers to determine the format and technical content of each workshop and to solicit contributions, but we encourage workshops that devote some time to contributed content.
Workshop proposals should contain:
For accepted workshops, the desired length will be honored as closely as possible.
We especially encourage proposals that bring together participants with diverse backgrounds and experience.
Workshop proposals should be emailed to ec21-workshop-chairs@acm.org.
March 1, 2021: Due date for submitting workshop or tutorial proposal
March 22, 2021: Tutorial & workshop proposal accept/reject notifications
July 19, 2021: Conference Tutorials
July 20-22, 2021: Conference technical program
July 23, 2021: Conference Workshops
Shengwu Li, Harvard University
Sigal Oren, Ben-Gurion University
Hu Fu, University of British Columbia
Annie Liang, Northwestern University
In November I (and prob everyone in the ACM) got an email that was a call for proposal from the ACM History committee. Here is a pointer to it.
One of the passages in it just seems wrong to me:
This fellowship program is designed to support research and/or curatorial projects related to ACM's professional and educational activities and/or to ACM's rich institutional history including its organization, SIG activities, and conferences.
I will give examples of history projects that are interesting but do not fit the description. I challenge you to give me a history project that is interesting that does fit the bill.
1) Compare and contrast how NP-completeness developed in America and in the USSR. For the USSR side it would be an expansion of Trakhtenbrot's article here. OH, no good, Leonid Levin was not a member of the ACM.
2) Measure how various CS fields have changed by looking at the topics on the CALL FOR PAPERS for a conference. This would be a MASSIVE expansion of my blog post on how CCC call for papers, changed , see here. OH, I could only do ACM conferences. STOC but not FOCS. Okay, it makes perfect sense to study only ACM conference. Oh wait. IT MAKES NO SENSE WHATSOEVER.
3) When did mathematicians begin looking at P vs NP seriously? (e.g., In The Handbook of Mathematical Logic from 1977 only one article mentions P vs NP: Michael Rabin's article on Decidable theories mentioned that even decidable theories may take a long time to decide and noted the importance of the P vs NP problem). How did MATH and CS interact early on and now? This would need one to look at math journals. How many of those are ACM? I would guess very few.
4) When did engineers begin looking at P vs NP seriously, or even notice it? Same issue as the last item.
5) Looking at how Programming lang design has changed one would have to only look at conference and journals that were ACM. And for those that were ACM but then dropped ACM you might only be able to use them up to the cutoff year.
6) Which academic studies eventually lead to practical products people can use? This is a really broad topic so one could narrow it down to just academic studies that appeared in ACM journals or conferences. Is that a good way to narrow the focus? Spoiler alert: NO.
7) I recently filled out a survey about the future of theory and if it is insular (the topic of another post surely). Most of the questions were about ``STOC-FOCS'' The people who did the survey clearly do not care that one is ACM and one is IEEE. Does anyone? I ask non-rhetorically.
Despite the capitol letters, I am not so much mad as puzzled. So I ask non-rhetorically:
Q1) Did the ACM really mean for people to do history in a way where you can only use ACM materials?
Q2) If no then please rewrite the Call for Proposals.
Q3) If yes then give examples of studies that would be interesting.
I am not asking to be snarky, I really want to know. And I note that interesting is subjective.
(ADDED LATER- two historians who have worked with this kind of history left comments that indicate the grant is FINE with non-ACM stuff, so Q1 above seems to have the answer NO. Given that they should rewrite the call for proposal next time they do this. ALSO- the historians left pointers to VERY INTERESTING papers they wrote.)
This post continues to describe problems presented at our open problems session back in November 2020. Here is the first post in the series. Today’s problem was presented by me, and it was an old 1989 conjecture of mine.
A convex set K is centrally symmetric if implies that .
Here we count P itself as a face.
Before we continue let me mention a winter school on “The Interplay between High-Dimensional Geometry and Probability”, January 11-15. Among other exciting lectures Bo’az Klartag will give 3 lectures on Yuansi Chen’s work on the KLS conjecture, that we discussed in this post.
Now back to the 3ᵈ conjecture. A case of equality is the -dimensional cube . The faces of corresponds to (-1,1,*)-vectors of length . Given such a vector it corresponds to a face of the cube whose vertices are obtained by replacing the *’s with ±1 in all possible ways. Of course the dual of also has non-empty faces.
Schlegel diagram of the octahedral prism, a 4-dimensional Hanner polytope. (attribution:
Here are some points raised in the discussion and further comments.
Update (Jan 12,2021): Looking at old correspondence with Ragnar Freij, Matthias Henze, and Günter Ziegler, I came across the following conjecture that related the flag number conjecture and Mahler’s conjecture. Let P be a centrally symmetric d-polytope. Let , and let For a centrally symmetric $d$-polytope , Mahler’s conjecture asserts that and the flag number conjecture asserts that .
It is also conjectured that equality holds for Jensen’s polytopes of perfect graphs.
The Faculty of Mathematics, Informatics and Mechanics at University of Warsaw (MIM UW) invites applications for positions of an assistant professor (“adiunkt” in Polish) in Computer Science, starting on 1st October 2021 (or 1st Feb 2022).
Note: compared to a standard assistant professor position, Samuel Eilenberg Assistant Professor comes with reduced teaching load and increased salary.
Website: https://www.mimuw.edu.pl/sites/default/files/konkursy/wmim_1210_ek_01_2021_en.pdf
Email: kowalik@mimuw.edu.pl
The Faculty of Mathematics, Informatics and Mechanics at University of Warsaw (MIM UW) invites applications for positions of an assistant professor (“adiunkt” in Polish) in Computer Science, starting on 1st October 2021 (or 1st Feb 2022).
Website: https://www.mimuw.edu.pl/sites/default/files/konkursy/wmim_1210_ek_03_2021_en.pdf
Email: kowalik@mimuw.edu.pl
The violent insurrection now unfolding in Washington DC is precisely the thing you called me nuts, accused me of “Trump Derangement Syndrome,” for warning about since 2016. Crazy me, huh, always seeing brownshirts around the corner? And you called the other side violent anarchists? This is all your doing. So own it. Wallow in it. May you live the rest of your lives in shame.
Update (Jan. 7): As someone who hasn’t always agreed with BLM’s slogans and tactics, I viewed the stunning passivity of the police yesterday against white insurrectionists in the Capitol as one of the strongest arguments imaginable for BLM’s main contentions.
Hilbert's 10th problem is (in modern terminology) to find an algorithm that will, given a poly
p(x1,...,xn) in Z[x1,...,xn], determine if it has a solution in the integers.
This was shown undecidable by the combined efforts of Davis-Putnam-Robinson and Matiyasevich.
I posted here about the question of: For which (d,n) is H10 undecidable when restricted to degree d and number of vars n?
I was invited to make an article out of the blog post and what I learned from the comments for the Algorithms column of BEATCS. That first draft is
Did I miss an important reference? (Note- I had meant to get out Matiyasevich's book on H10 but have been unable to because of the Pandemic, so if you have it my have stuff I need to know.)
Did I misspell Matiyasevich?
Did I say something stupid?
Please leave comments or email me directly if you have suggestions.
Incidentally I am usually the one who is asking someone else to do an open problems column, a book review, a guest blog. Nice to be the askee instead of the asker for a change of pace.
This year is a Blum integer, 43*47
2017 article, not via Zoom |
Allan Lichtman correctly predicted the 2020 presidential election, based on a 7-6 edge in “keys” to Joe Biden. His model, which he developed in 1981 with the late Vladimir Keilis-Borok, simply says that the incumbent party will lose the presidential election if 6 or more keys go against them. It has been right in every election since 1984.
Today we make some predictions for the new year, 2021.
Lichtman has not ventured a prediction for today’s Senate runoff elections in Georgia. Last November 19, he penned an article eviscerating the performance of poll aggregation in the November 3 election. We have considered writing a post about that and about statistical predictions during the pandemic, in the direction of taking more moderate assessments of how various polls, predictions, and policies have fared.
Polls of today’s races currently show a 1–2 percentage point edge for the Democratic candidates, but there are multiple caveats: The difference is just about one standard deviation, well within the margin of error even after aggregation. The polls have gyrated with multiple lead flips this past month; two polls from the past few days collected here show both Republican candidates tied or ahead. Many pollsters have avoided these races, and the five most recent polls all have sample sizes well under 1,000. The count beginning tonight may show a zigzag blue-to-red-to-“blur” shift for reasons of which votes are tallied when that we discussed generally here.
With that said, let us address non-election predictions.
Back in 2019 we made some predictions that were easy:
We skipped 2020—maybe that was smart of us? Our point about predictions in 2019 was the importance of “when” as opposed to “what.” With regard to some predictions, we can take the pandemic is as a global postponement of “when.”
We claim complete and ongoing timeliness with another 2019 prediction, however. In full, we wrote:
The section two years ago talked about the difficulties of proving such a claim. A claim was duly made in October 2019, which we evaluated here, but arguments over “proof” have not abated. They have just recently been summarized by Gil Kalai here, and also here by Scott Aaronson in regard to an independent claim by researchers from USTC in China.
Here are some new predictions for 2021. We will try and skip easy ones like the above. We will try to make some that are surprising. We’ve made predictions about evidence for life on exoplanets before; we do something different there too. We do not venture any about the course of the pandemic.
What are your predictions for 2021?
UPDATE: An update from Shang-Hua Teng:
Dear TCS Students:
SIAM has today reopened the application portal for SODA21 Student Registration Waiver. The new deadline will be this Friday, January 8 at 11:59pm EST.
SODA21 will be held virtually on Jan. 10 – Jan. 13, 2021.
Please note that having paper(s) in SODA and its associated Symposiums is not a requirement. For this year’s application, the letter of recommendation from advisors is not required as well. The goal of the support from Google, Microsoft, and ACM is to increase opportunity for students to attend this premier TCS conference.
Looking forward to seeing you in SODA’21.
ORIGINAL MESSAGE: Please see below for a message from Shang-Hua Teng, regarding the possibility of waivers for SODA 2021 registration for students.
Dear TCS students:
By now, it is hard to overestimate the impact of the COVID19 pandemic to society. However, like every challenge, it has created some opportunities. For example, essentially all major conferences in TCS this year have been transformed into virtual ones, making them more accessible to scholars/students across the world (of course at the expense of traditional interactions).
ACM-SIAM Symposium on Discrete Algorithms (SODA21) will be held virtually this year, on Jan. 10 – 13, 2021. As you may know, this is the premier conference on algorithms .
See https://www.siam.org/conferences/cm/conference/soda21
Thanks to our industry partners and ACM SIGACT group, SODA has some funds for covering student registrations. I am writing to informing you this opportunity and encourage you to apply:
See:
1. https://awards.siam.org/
2. https://www.siam.org/conferences/cm/lodging-and-support/travel-support/soda21-travel-support
That deadline is Dec. 27, 2020. Like before, having papers in SODA is not prerequisite.
Shang-Hua Teng
On Behalf of SODA Steering Committee
Last month, I showed how global optimization based only on accessing function values can be hard with no convexity assumption. In a nutshell, with limited smoothness, the number of function evaluations has to grow exponentially fast in dimension, which is a rather negative statement. On the positive side, this number does not grow as fast for highly smooth functions. However, optimal algorithms end up approximating the whole function to minimize, and then minimizing the approximation, which takes exponential time in general. In this post, I will describe very recent work with Alessandro Rudi and Ulysse Marteau-Ferey [1] that essentially jointly approximates the function and minimizes the approximation.
Our task for this post will be to minimize a function \(f: \mathbb{R}^d \to \mathbb{R}\), for which we know the global minimizer is already located in a certain bounded set \(\Omega\) (typically a large ball). We can query values of \(f\) at any point in \(\Omega\).
It turns out it can be formulated as a convex optimization problem (there has to be a catch in this statement, stay tuned).
We consider the following minimization problem: $$\tag{1} \sup_{c \in \mathbb{R}} \ c \ \ \mbox{ such that } \ \ \forall x \in \Omega, \ f(x) \geqslant c.$$ As illustrated below, this corresponds to finding the largest lower bound \(c\) on the function.
Assuming \(f\) attains its global minimizer \(f_\ast\) in \(\Omega\), then the optimal value of the problem above is attained and is clearly \(f_\ast\), that is, the problem in Eq. (1) leads to the minimal value of \(f\).
Moreover, each constraint in Eq. (1) is a linear constraint in \(c\), and the objective is linear. Hence we obtain a linear programming problem which is a particularly simple instance of a convex optimization problem. Thus, all optimization problems are convex!
The catch here is that the set of inequalities is dense, that is, in general, \(\Omega\) contains infinitely many elements, and thus the number of constraints is infinite.
Duality. It is worth deriving the convex optimization dual (being loose here in terms of regularity and integrability conditions). We add a Lagrange multiplier \(\mu(x) \geqslant 0\) for each of the constraints \(f(x) \, – c \geqslant 0\), and consider the Lagrangian
$$ \mathcal{L}(c,\mu) = c + \int_{\Omega} \! \mu(x) \big( f(x) \, – c\big) dx.$$ Maximizing it with respect to \(c\) leads to the constraint \(\int_\Omega \mu(x) dx = 1\), that is, \(\mu\) represents a probability distribution, and the dual problem is $$ \inf_{\mu \in \mathbb{R}^\Omega} \int_{\Omega} \! \mu(x) f(x) dx \ \ \mbox{ such that } \int_\Omega\! \mu(x) dx = 1 \mbox{ and } \forall x \in \Omega, \ \mu(x) \geqslant 0.$$ In words, we minimize the expectation of \(f\) with respect to all probability distributions on \(\Omega\) with densities. Here, the infimum is obtained from any probability measure that puts mass only on the minimizer of \(f\) on \(\Omega\). This measure is typically not absolutely continuous, and thus the infimum is not attained (no big deal).
This is also a convex optimization problem, with a linear objective function on the infinite-dimensional set of probability measures.
Overall, we get convex but (apparently) useless problems. We will first need an efficient way to represent non-negativity of functions (this will be applied to \(f(x) \, – c\)).
We now start to make some assumptions on the function \(f\) to be minimized. We will assume that it can be written as $$\forall x \in \Omega, \ f(x) = \langle \phi(x), F \phi(x) \rangle,$$ for some feature vector \(\phi(x)\) in some Hilbert space \(\mathcal{H}\), and \(F\) a symmetric linear operator on \(\mathcal{H}\).
We will need infinite dimensions later, but you can safely assume that \(\mathcal{H}\) is a classical Euclidean space (e.g., \(\mathbb{R}^{{\rm dim}(\mathcal{H})}\)), and that we have $$\forall x \in \Omega, \ f(x) = \phi(x)^\top F \phi(x) = \sum_{i,j=1}^{{\rm dim}(\mathcal{H})} F_{ij} \phi(x)_i \phi(x)_j.$$
We will also assume that the constant function on \(\Omega\) can be represented as well, that is, there exists \(u \in \mathcal{H}\) such that \(\forall x \in \Omega, \ \langle u, \phi(x) \rangle = 1.\) This is to ensure that we can represent within the optimization problem above the function \(f(x) \, – c = \langle \phi(x), (F – c \, u \otimes u) \phi(x) \rangle\).
The most classical example is the set of polynomials of degree \(2r\) for which we can choose \(\phi(x)\) to be of dimension \(d+r \choose r\) and contain all monomials are degree less than \(r\). We will go infinite-dimensional and link this to kernels later in this post.
You may wonder why we are not simply modeling the function \(f\) with a classical formulation \(f(x) = \langle w, \phi(x) \rangle\). This is precisely to be able to enforce positivity.
Positivity through “sum-of-squares”. We can assume that the operator \(F\) is positive-semidefinite (which we will denote \(F \succcurlyeq 0\)), that is, all of its eigenvalues are non-negative, or, for all \(h \in \mathcal{H}\), \(\langle h, F h \rangle \geqslant 0\). Then this implies that the function \(f\) is pointwise positive. Thus: $$ \tag{2} F \succcurlyeq 0 \ \ \Rightarrow \ \ \forall x \in \Omega, \ f(x) = \langle \phi(x), F \phi(x) \rangle \geqslant 0. $$
We refer to such functions as “sum-of-squares”. Indeed, given the spectral decomposition of \(F\) as $$ F = \sum_{i \in I} \lambda_i h_i \otimes h_i, $$ with \(I\) a countable set, and \((h_i)_{i \in I}\) an orthonormal family in \(\mathcal{H}\), we have \(\lambda_i \geqslant 0\) for all \(i \in I\), and thus we can write $$ f(x) = \langle \phi(x), F \phi(x) \rangle = \sum_{i \in I} \lambda_i \langle \phi(x) , (h_i \otimes h_i) \phi(x) \rangle = \sum_{i=1}^n \big( \sqrt{\lambda_i} \langle h_i , \phi(x) \rangle \big)^2.$$ That is, the function \(f\) is a sum of squares of functions.
A key question is whether the implication in Eq. \((2)\) can be reversed. In other words, can we write all non-negative functions as sum-of-squares? The answer will depend on the chosen feature map \(\phi: \Omega \to \mathcal{H}\).
Polynomials. For polynomials, the answer is known to be positive in dimension 1, but negative in higher dimensions (see the nice review in [2]): that is, unless \(d=1\), or \(d=2\) and \(r=2\), there are polynomials of degree \(2r\) in dimension \(d\) that are non-negative but are not the sum of squares of polynomials.
Keeping in mind the difference being non-negative and being a sum of squares, let’s look how this can be used for global optimization of polynomials.
We now consider that \(f\) is a polynomial of degree \(2r\), and that \(\phi(x)\) is of dimension \(d+r \choose r\) and contain all monomials of degree less than \(r\). We consider representing the constraint \(f(x)\geqslant c\) in Eq. (1), as \(f(x) = c + \langle \phi(x), A \phi(x) \rangle\) for some positive-definite operator (here a matrix) \(A\). With the notations above, this is equivalent to \(F = c u \otimes u + A\) and \(A \succcurlyeq 0\). This leads to the following optimization problem (no localization set \(\Omega\) is used): $$ \tag{3} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ \mbox{ such that } \ \forall x \in \mathbb{R}^d, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle .$$ This in now a problem in finite dimension, which is a semidefinite programming problem, and can thus a priori be solved in polynomial time in \(d+r \choose r\). See [3, 4] for more details.
At this point, beyond the potentially large dimension of the convex problem when \(r\) and \(d\) grow, there is a major concern: is the problem in Eq. (3) equivalent to the one in Eq. (1)? This is true if and only if the polynomial \(f(x) \, – f_\ast\) is a a sum-of-squares, which given the discussion above may not happen as soon as \(d \geqslant 2\).
In order to apply to all polynomials, the problem has to be changed. A simple reformulation is possible [3] by assuming that the global minimum of \(f\) is attained within the set \(\Omega = \{ x \in \mathbb{R}^d, \ x^\top x \leqslant R^2\}\) for some \(R > 0\). It turns out (and this is a highly non-trivial fact) that all polynomials \(f(x) \) which are positive on \(\Omega\) can be written as \(f(x) = q(x) + ( R^2 – x^\top x ) p(x)\), where \(q(x)\) and \(p(x) \) are sum-of-squares polynomials. The only issue is that the degrees of \(p\) and \(q\) are not known in advance so a sequence (a so-called “hierarchy”) of problems have to be solved with increasing degrees, each providing a lower bound on \(f_\ast\).
When the degree is large enough so that the sum-of-squares representation exists for \(f – f_\ast\), then the following problem $$ \max_{c \in \mathbb{R}, A, B \succcurlyeq 0} c \mbox{ such that } \forall x \in \mathbb{R}^d, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle + ( R^2 – x^\top x ) \langle \phi(x), B \phi(x) \rangle$$ is a semi-definite program, whose solution leads to \(c = f_\ast\).
This remains a large semi-definite program whose order is not known in advance and for which complicated solvers are needed. I will now present our recent work where we go infinite-dimensional, and some of these issues are alleviated (but some others are created).
We now consider \(\phi(x)\) to be infinite-dimensional. In order to allow finite-dimensional manipulations of various objects, we consider feature maps which are obtained from positive-definite kernels. More precisely, we consider the so-called Sobolev space \(\mathcal{H}\) of functions on \(\Omega\) with all partial derivatives up to order \(s\) which are square-integrable. When \(s > d/2\), then this is a reproducing kernel Hilbert space, that is, there exists a positive-definite kernel \(k: \Omega \times \Omega \to \mathbb{R}\) such that the feature map \(\phi(x) = k(\cdot,x)\) allows to access function values as \(h(x) = \langle h, \phi(x) \rangle\) for all \(h \in \mathcal{H}\). For the algorithm, we will only need access to the kernel; for example, for \(s = d/2+1/2\), we can choose the usual exponential \(k(x,y) = \exp( – \alpha \| x – y \|_2)\).
We now consider the same problem as for polynomials, but with this new feature map: $$\tag{4} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ \mbox{ such that } \ \forall x \in \Omega, \ f(x) = c + \langle \phi(x), A \phi(x) \rangle .$$ In order to have an exact reformulation, we need that there exists some \(A_\ast\) such that \(\forall x \in \Omega, \ f(x) \, – f_\ast = \langle \phi(x), A_\ast \phi(x) \rangle\). It turns out to be true [1] with appropriate geometric conditions on \(x\) (such as isolated second-order strict global minimizers within \(\Omega\)) and regularity (at least \(s+3\) derivatives); see [1] for details.
This is all very nice, but the problem in Eq. (4) is still an infinite-dimensional problem (note that we have gone from a continuum of inequalities to elements of a Hilbert space, but that’s not much of a gain).
We can now leverage classical and more recent techniques from kernel methods. The main idea is to subsample the set of equalities in Eq. (4), that is consider \(n\) points \(x_1,\dots,x_n \in \Omega\), and solve instead the regularized sampled problem $$\tag{5} \max_{c \in \mathbb{R}, A \succcurlyeq 0} c \ – \, \lambda \, {\rm tr}(A) \ \mbox{ such that }\ \forall i \in \{1,\dots,n\}, \ f(x_i) = c + \langle \phi(x_i), A \phi(x_i) \rangle , $$ where \(\lambda > 0\) is a regularization parameter.
Approximation guarantees. Beyond computational considerations, how big does \(n\) need to be to obtain a value of \(c\) which is at most \(\varepsilon\) away from \(f_\ast\)? From last post, if \(f\) is \(m\)-times differentiable, we cannot have \(n\) smaller than \(\varepsilon^{-d/m}\). As shown in [1], for points \(x_1,\dots,x_n \in \Omega\) selected uniformly at random, then \(n\) can be taken to of the order \(\varepsilon^{-d/(m – d/2 – 3)}\) for a well-chosen Sobolev space \(s\) and regularization parameter \(\lambda\). This is not the optimal rate \(\varepsilon^{-d/(m – d/2)}\) , but close when \(m\) and \(d\) are large.
Finite-dimensional algorithm. The problem in Eq. (5) is still infinite-dimensional. We can leverage a recent work, again with Ulysse Marteau-Ferey and Alessandro Rudi [5], that shows that a representer theorem exists for this type of problems: the optimization for the operator \(A\) may be restricted to \(A\) of the form $$A = \sum_{i,j} C_{ij} \phi(x_i) \otimes \phi(x_j)$$ for a positive definite matrix \(C \in \mathbb{R}^{n \times n}\). This leads to a semi-definite programming problem of size \(n\), whose running-time complexity is polynomial in \(n\) (e.g, \(O(n^{3.5})\)), with standard general-purpose toolboxes (such as CVX).
In order to tackle large values \(n\) which are larger than \(1000\), simple iterative algorithms can be designed (see [1] for a damped Newton algorithm which does not require any singular value decompositions, with complexity \(O(n^3)\) per iteration). Classical tools for efficient kernel methods, such as column sampling / Nyström method (see [6, 7] and references therein) or random features (see [8, 9] and references therein), could also be used to avoid a cubic or quadratic complexity in \(n\).
Comparison with random search. Instead of solving the problem in Eq. (5), we could output the minimal value of \(f(x_i)\) for \(i \in \{1,\dots,n\}\), which is exactly random search if the points \(x_i\) are random (this also happens to correspond to using \(\lambda=0\) in Eq. (5)). This can only achieve an error \(\varepsilon\) for \(n\) of order \(\varepsilon^{-d}\); in other words, it cannot leverage smoothness, while our algorithm can. The main technical reason is that subsampling inequalities provide weaker guarantees than subsampling equalities, and thus representing the non-negativity of a function through an equality (which our method essentially does) provides significant benefits.
As always in convex optimization, it is beneficial to look at dual problems. It turns out that iterative algorithms are easier to build for the finite-dimensional dual problem to Eq. (5). We rather look now at the dual of Eq. (4) and make the parallel with the primal original problem in Eq. (1) and its dual.
Introducing a Lagrange multiplier \(\mu(x)\) for the constraint \(f(x) = c + \langle \phi(x), A \phi(x) \rangle\) (this time not constrained to be non-negative because we have an equality constraint), we can form the Lagrangian $$\mathcal{L}(c,A,\mu) = c + \int_{\Omega} \! \mu(x) \big( f(x) \, – c \, – \langle \phi(x), A \phi(x) \rangle \big) dx.$$ Maximizing with respect to \(c\) again leads to the constraint \(\int_\Omega \mu(x) dx = 1\), while minimizing with respect to \(A \succcurlyeq 0\) leads to the constraint \(\int_\Omega \mu(x) \phi(x) \otimes \phi(x) dx \succcurlyeq 0 .\) The dual problem thus becomes $$ \inf_{\mu \in \mathbb{R}^\Omega} \int_{\Omega} \! \mu(x) f(x) dx \ \ \mbox{ such that } \int_\Omega\! \mu(x) dx = 1 \mbox{ and } \ \int_\Omega \mu(x) \phi(x) \otimes \phi(x)dx \succcurlyeq 0.$$ The impact of the representation of non-negative functions by quadratic forms in \(\phi(x)\) is to replace the non-negativity contraint on all \(\mu(x)\) by the positivity of the matrix \(\int_\Omega \mu(x) \phi(x) \otimes \phi(x)dx\), which is weaker in general, but equivalent if \(\phi(x)\) is “large” enough.
Note that the dual problem through a signed measure that should concentrate around the minimizers of \(f\) leads to a natural candidate \(\int_\Omega x \mu(x) dx\) for the optimizer (see more details in [1], and a similar dual formulation for polynomials in [10]).
In order to highlight how the optimization method works, I will first present how it minimizes the multi-modal function in dimension \(d=2\) pictured below on \([-1,1]^2\).
We consider sampling a sequence of points \(x_1, x_2,\dots \in \mathbb{R}^2\) from a quasi-random sequence (so that it fills the space more evenly that purely random). The random points are displayed in purple while the solution of our algorithm is displayed in red, with increasing numbers \(n\) of sampled points. On the right plot, the model \(c + \langle \phi(x), A \phi(x) \rangle\) is displayed, showing that the model becomes more accurate as \(n\) grows.
We also consider a problem in dimension \(d=8\), where we compare random sampling and our algorithm (“gloptikernel”), as \(n\) grows (see [1] for the experimental set-up).
Clearly, the chosen baseline (random sampling) is rudimentary, though optimal up to logarithmic terms for non-smooth problems where only Lipschitz-continuity is assumed. We can also compare to gradient descent with random restarts (“random+GD” below, with the same overall number of function evaluations). As shown below, when the gradient information is robust, this baseline performs similarly (left plot), while when a high frequency component is added to the function (which makes the gradient information non reliable, right plot), the local descent algorithm needs more function evaluations.
It would be interesting to compare this new algorithms to other global optimization algorithms based on function evaluations, such as CMA-ES or Bayesian optimization.
In this post, I presented an algorithm for global optimization that can provably leverage smoothness with an algorithm which has a running time which is polynomial in the dimension and the number of function evaluations. For large smoothness factors, the number of such evaluations has a dependence on the precision \(\varepsilon\) which has no exponential dependence in dimension in the exponent (but still in the constants: we cannot optimize multivariate polynomials in polynomial time…).
These ideas can be extended in a number of ways.
General relaxation tool for dense inequalities. We can generalize the previous developments to all problems with a continuum of constraints on a variable \(\theta\), such as, $$ \forall x \in \Omega, \ H(\theta,x) \geqslant 0.$$ The corresponding Lagrange multipliers also satisfy $$ \forall x \in \Omega, \ \mu(x) \geqslant 0$$ (there may be other ones like in the problem above). Now the question is: should we use the non-negative modelisation with quadratic forms in the primal or in the dual? Or equivalently, replace the non-negativity constraint by the positivity of the integral of \(\phi(x) \otimes \phi(x)\) in the dual or in the primal?
The answer depends on the expected smoothness of the function which is non-negative. In the optimization problem, we expect \(f(x) \, – c\) to be smooth at the optimum, while \(\mu(x)\) which is approximating a Dirac clearly isn’t; therefore, the choice is clear (at least in retrospect since we originally tried to model \(\mu(x)\) as a quadratic form…).
Constrained optimization. Following [3], we can apply the same algorithmic technique to constrained optimization, by formulating the problem of minimizing \(f(x)\) such that \(g(x) > 0\) as maximizing \(c\) such that \(f(x) = c + p(x) + g(x)q(x)\), and \(p, q\) non-negative functions. We can (at least in principle) then replace the non-negative constraints by \(p(x) = \langle \phi(x),A\phi(x)\rangle\) and \(q(x) =\langle \phi(x),B\phi(x)\rangle\) for positive operators \(A\) and \(B\). All this remains to be explored in more details.
Acknowledgements. I would like to thank Alessandro Rudi and Ulysse Marteau-Ferey for proofreading this blog post and making good clarifying suggestions.
[1] Alessandro Rudi, Ulysse Marteau-Ferey, Francis Bach. Finding Global Minima via Kernel Approximations. Technical report arXiv:2012.11978, 2020.
[2] Walter Rudin. Sums of squares of polynomials. The American Mathematical Monthly, 107(9), 813-821, 2000.
[3] Jean-Bernard Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization, 11(3):796–817, 2001.
[4] Jean-Bernard Lasserre. Moments, Positive Polynomials and their Applications, volume 1. World Scientific, 2010.
[5] Ulysse Marteau-Ferey, Francis Bach, and Alessandro Rudi. Non-parametric models for non-negative functions. Advances in Neural Information Processing Systems, 33, 2020.
[6] Francis Bach. Sharp analysis of low-rank kernel matrix approximations. Conference on Learning Theory, pages 185–209, 2013.
[7] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 2015.
[8] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. Advances in Neural Information Processing Systems, 30, 2017.
[9] Francis Bach. On the equivalence between kernel quadrature rules and random feature expansions. Journal of Machine Learning Research, 18(1):714–751, 2017.
[10] Jean-Bernard Lasserre. The moment-SOS hierarchy and the Christoffel-Darboux kernel. Technical Report arXiv:2011.08566, 2020.
Happy 2021! We now cover the last papers of 2020: a (smaller) spread with graph property testing, distribution property testing, and circuit complexity.
Almost Optimal Bounds for Sublinear-Time Sampling of k-Cliques: Sampling Cliques is Harder Than Counting by Talya Eden, Dana Ron, and Will Rosenbaum (arXiv). In many settings, sampling and counting are tightly coupled; the classic example being permanent estimation and sampling random bipartite matchings.This paper investigates the problem of sampling uniform random cliques, and unearths an interesting separation of approximately uniform sampling from approximate counting. Let’s consider the fundamental problem of edge and triangle counting, in graphs of constant arboricity. This class, which contains all minor-closed families, has deep connections to clique counting; clique counting can be done in linear time for bounded arboricity graphs.) The input connected graph \(G\) has \(n\) vertices, \(m\) edges, and \(t\) triangles. We assume the standard query model, with uniform random vertices, neighbor, and degree queries. Our aim is to get \((1+\varepsilon)\)-approximations for \(m\) and \(t\) (in general, any \(k\)-clique count). It is known from previous work that edge estimation can be done in \(O(1)\) time and triangle estimation can be done in \(O(n/t)\) time (ignoring log and \(\varepsilon\) dependencies). Moreover, these bounds is optimal. Can one get similar bounds for approximately sampling a uniform random edge/triangle? Previous work of Eden-Rosenbaum achieve such a bound for sampling edges. Surprisingly, this paper shows that triangle sampling is fundamentally harder, and requires \(\Omega(n^{3/4}/\sqrt{t})\) queries. The main result gives a precise and optimal bound for uniform sampling \(k\)-cliques in graphs of arboricity \(\alpha\).
Testing and reconstruction via decision trees by Guy Blanc, Jane Lange, and Li-Yang Tan (arXiv). A property tester is an (sublinear) algorithm that distinguishes an object with a property from those that are far. Reconstruction is a much stronger sublinear algorithm, which actually provides query access to the closest object. Suppose we have query access to an \(n\)-variate Boolean function \(f\), and the property \(\mathcal{P}\) of interest is size-\(s\) decision trees. A reconstructor would give query access to a function \(g \in \mathcal{P}\), such that \(dist(f,g) = O(OPT_f)\), where \(OPT_f\) is the distance of \(f\) to \(\mathcal{P}\). Observe that a reconstructor automatically gives a distance estimator (just estimate \(dist(f,g)\) by random sampling). One of the main results of this paper is a (weaker, bicriteria) reconstructor, where \(g\) is guaranteed to have decision-tree complexity \(s^{O(\log^2s)}\). Each query to \(g\) is computed in \(poly(\log s, 1/\varepsilon)\cdot n\) time. This reconstructor leads to a number of testing results, and reconstructors for other properties like low Fourier degree. Another interesting result proven: if there exists an efficient reconstructor that gets \(g\) of size \(s\), then there exists an efficient proper learner for decision trees (a big open problem).
Testing Product Distributions: A Closer Look by Arnab Bhattacharyya, Sutanu Gayen, Saravanan Kandasamy, and N. V. Vinodchandran (arXiv). This paper is on distribution testing, where all distributions involved are promised to be product distributions. Let \(P, Q\) be distributions over \(\Sigma^n\), where \(Sigma\) is some (small) finite set. The problems studied are identity testing (\(Q\) is known/fixed, and one gets samples from unknown \(P\)) and closeness testing (both unknown). It is known from previous work that these problems can basically be solved with \(O(n|\Sigma|\varepsilon^{-2})\) samples. Note that the domain size is exponential in \(n\); thus, these bounds give an exponential improvement over identity/closeness testing in the vanilla (non-product) setting. Tolerant testing is of special significance here. Because the domain is so large, it makes sense to allow for some distance even in the “yes” case. The main result of this paper is to completely map out the complexities for identity/closeness testing where we wish to distinguish \(d_1(P,Q) < \varepsilon_1\) from \(d_2(P,Q) > \varepsilon_2\), where \(d_1, d_2\) are potentially different distance measures. For example, the paper considers total variation (TV), KL, Hellinger, and \(\chi^2\) distances. An important improvement achieved is for identity testing where \(d_1\) is Hellinger and \(d_2\) is TV distance: for certain values of \(\varepsilon_1, \varepsilon_2\), one can get a tester of optimal sample complexity \(O(\sqrt{n|\Sigma|})\). Previous bounds only gave a \(O(n|\Sigma|)\) sample complexity. There are a number of new lower bounds for different combinations of \(d_1, d_2\) distance pairs.
The Computer Science Dept. at UT Austin invites applications for a Postdoctoral Fellow in theoretical computer science for the 21-22 academic year to work with David Zuckerman. Research interests should overlap with his: pseudorandomness, computational complexity, coding theory, and more. Review of applicants will begin on Jan. 25, but applications will be accepted until the position is filled.
Website: https://utaustin.wd1.myworkdayjobs.com/UTstaff/job/UT-MAIN-CAMPUS/Postdoctoral-Fellow_R_00011430-1
Email: maguilar@cs.utexas.edu
Bocconi University, based in Milan, is Italy’s premiere private university. We invite applications for two tenure-track assistant professor positions in Computer Science. Compensation, which is negotiated on a case-by-case basis, will be highly competitive. The language of instruction is English.
Website: https://lucatrevisan.wordpress.com/2020/12/11/bocconi-is-looking-for-assistant-professors-in-computer-science/
Email: l.trevisan@unibocconi.it
[application page] [call for applications]
Bocconi University is looking for two Tenure-Track Assistant Professors in Computer Science, for positions starting in Fall 2021: the application deadline is this Friday, January 8, 2021.
More information in an earlier post.
Department of Computer Science at Aalborg University’s Technical Faculty of IT is looking to appoint a number of Assistant Professors (with possibility of tenure-track) for our Aalborg and Copenhagen Campus, starting June 1, 2021 or soon after this.
Website: https://www.stillinger.aau.dk/vis-stilling/?vacancy=1136722
Email: jesper@cs.aau.dk
Department of Computer Science at Aalborg University’s Technical Faculty of IT is looking to appoint a number of Associate Professors for our Aalborg and Copenhagen Campus, starting June 1, 2021 or soon after this.
Website: https://www.stillinger.aau.dk/vis-stilling/?vacancy=1136728
Email: jesper@cs.aau.dk
In connection with a research grant of Professor Artur Czumaj, we are looking for excellent candidates for a postdoc position in the area of the design and analysis of parallel and distributed algorithms, focusing on fundamental research in this area.
Website: https://warwick.ac.uk/fac/sci/dcs/people/artur_czumaj/postdoc-advert-jan-2021
Email: aczumaj@acm.org