Theory of Computing Blog Aggregator

Last year, the CONCUR conference series inaugurated its Test-of-Time Award, whose purpose is to recognise important achievements in Concurrency Theory that were published at the CONCUR conference and have stood the test of time. This year, the following four papers were chosen to receive the CONCUR Test-of-Time Awards for the periods 1994–1997 and 1996–1999 by a jury consisting of Rob van Glabbeek (chair), Luca de Alfaro, Nathalie Bertrand, Catuscia Palamidessi, and Nobuko Yoshida: 

Last year, I interviewed the CONCUR 2020 Test-of-Time Award recipients and was asked by Javier Esparza (chair of the CONCUR SC) and Ilaria Castellani (outgoing chair of the IFIP WG 1.8 on Concurrency Theory) to do the same with the current batch of awardees. (In passing, let me thank Nathalie Bertrand and Nobuko Yoshida for their kind help with the interviews!)

This post is devoted to the interview I conducted via email with Rajeev Alur, Thomas A. Henzinger, Orna Kupferman and Moshe Y. Vardi. Reading the answers I received from that dream team of colleagues was like a masterclass for me and I trust that their thoughts on their award-winning paper will be of interest to many of the readers of this blog. Enjoy!

Luca: You receive the CONCUR ToT Award 2021 for your paper Alternating Refinement Relations, which appeared at CONCUR 1998. In that article, you gave what I consider to be a fundamental contribution, namely the introduction of refinement relations for alternating transition systems. Could you briefly explain to our readers what alternating transition systems are? Could you also tell us how you came to study the question addressed in your award-winning article and why you focused on simulation- and trace-based refinement relations? Which of the results in your paper did you find most surprising or challenging? 

Answer: When we model a system by a graph, our model abstracts away some details of the system. In particular, even when systems are deterministic, states in the model may have several successors. The nondeterminism introduced in the model often corresponds to different actions taken by the system when it responds to different inputs from its environment. Indeed, a transition in a graph that models a composite system corresponds to a step of the system that may involve some components. Alternating transition systems (ATSs) enable us to model composite systems in more detail. In an ATS, each transition corresponds to a possible move in a game between the components, which are called agents. In each move of the game, all agents choose actions, and the successor state is deterministically determined by all actions. Consequently, ATSs can distinguish between collaborative and adversarial relationships among components in a composite system. For example, the environment is typically viewed adversarially, meaning that a component may be required to meet its specification no matter how the environment behaves.

In an earlier paper, some of us introduced ATSs and Alternating Temporal Logics, which can specify properties of agents in a composite system. The CONCUR 1998 paper provided refinement relations between ATSs which correspond to alternating temporal logics. Refinement is a central issue in a formal approach to the design and analysis of reactive systems. The relation “I refines S '' intuitively means that system S has more behaviors than system I. It is useful to think about S being a specification and I an implementation. Now, if we consider a composite implementation I||E and specification S||E and we want to check that the component I refines the component S, then the traditional refinement preorders are inappropriate, as they allow I to achieve refinement of I||E with respect to S||E by constraining its environment E. Alternating refinement relations are defined with respect to ATSs that model the interaction among the underlying components, and they enable us to check, for example, that component I has fewer behaviors than component S no matter how component E behaves. They are called “alternating” because refinement may restrict implementation actions but must not restrict environment actions. In other words, refinement may admit fewer system actions but, at the same time, more environment actions. 

It was nice to see how theoretical properties of preorders in the traditional setting are carried over to the game setting, and so are the results known then about the computational price of moving to a game setting. First, the efficiency of the local preorder of simulation with respect to the global preorder of trace containment is maintained. As in the traditional setting, alternating simulation can be checked in polynomial time, whereas alternating trace-containment is much more complex. Second, the branching vs. linear characterizations of the two preorders is preserved: alternating simulation implies alternating trance containment, and the logical characterization of simulation and trace-containment by CTL and LTL, respectively, is carried over to their alternating temporal logics counterparts. The doubly-exponential complexity of alternating trace containment, as opposed to the PSPACE complexity of trace containment, is nicely related to the doubly-exponential complexity of LTL synthesis, as opposed to its PSPACE model-checking complexity,

Luca: In your paper, you give logical characterisations of your alternating refinement relations in terms of fragments of alternating temporal logic. Logical characterisations of refinement relations are classic results in our field and I find them very satisfying. Since I teach a number of those results in my courses, I'd be interested in hearing how you would motivate their interest and usefulness to a student or a colleague. What would your "sales pitch" be? 

Answer: There is extensive research on the expressive power of different formalisms. Logical characterization of refinement relations tells us something about the distinguishing power of formalisms. For example, while the temporal logic CTL* is more expressive than the temporal logic CTL, the two logics have the same distinguishing power: if you have two systems and can distinguish between them with a CTL* formula (that is, your formula is satisfied only in one of the systems), then you should be able to distinguish between the two systems also with a CTL formula. Moreover, while CTL is not more expressive than LTL, we know that CTL is “more distinguishing” than LTL. These results have to do with the logical characterizations of trace containment and simulation. The distinguishing power of a specification formalism is useful when we compare systems, in particular an implementation and its abstraction: if we know that the properties we care about are specified in some formalism L, and our system refines the abstraction according to a refinement relation in which the satisfaction of specifications in L is preserved, then we can perform verification on the abstraction.

Luca: I am interested in how research collaborations start, as I like to tell "research-life stories" to PhD students and young researchers of all ages. Could you tell us how you started your collaboration on the award-winning paper? 

Answer:Subsets of us were already collaborating on other topics related to reactive models and model checking, and all of us shared a common belief that the field was in need to move from the limited setting of closed systems to a more general setting of open systems, that is, systems that interact with an environment. Open systems occur not only when the environment is fully or partly unknown, but also when a closed system is decomposed into multiple components, each of them representing an open system. To build “openness” into models and specifications as first-class citizens quickly leads to the game-theoretic (or “alternating”) setting. It was this realization and the joint wish to provide a principled and systematic foundation for the modeling and verification of open systems which naturally led to this collaboration.

Luca: Did any of your subsequent research build explicitly on the results and the techniques you developed in your award-winning paper? Which of your subsequent results on alternating transition systems and their refinement relations do you like best? Is there any result obtained by other researchers that builds on your work and that you like in particular or found surprising? 

Answer: Various subsets of us pursued multiple research directions that developed the game-theoretic setting for modeling and verification further, and much remains to be done. Here are two examples. First, the game-theoretic setting and the alternating nature of inputs and outputs are now generally accepted as providing the proper semantic foundation for interface and contract formalisms for component-based design. Second, studying strategic behavior in multi-player games quickly leads to the importance of probabilistic behavior, say in the form of randomized decisions and strategies, of equilibria, when players have non-complementary objectives, and of auctions, when players need to spend resources for decisions. All of these are still very active topics of research in computer-aided verification, and they also form a bridge to the algorithmic game theory community.

Luca: One can view your work as a bridge between concurrency theory and multi-agent systems. What impact do you think that your work has had on the multi-agent-system community? And what has our community learnt from the work done in the field of multi-agent systems? To your mind, what are the main differences and points of contact in the work done within those communities? 

Answer: Modeling interaction in multi-agent systems is of natural interest to planning problems studied in the AI community. In 2002, the International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS) was formed and the annual International Conference on Autonomous Agents and Multiagent Systems (AAMAS) was launched. The models, logics, and algorithms developed in the concurrency and formal methods communities have had a strong influence on research presented at AAMAS conferences over the past twenty years. Coincidentally, this year our paper on Alternating-Time Temporal Logic was chosen for the IFAAMAS Influential Paper Award.

Luca: What are the research topics that you find most interesting right now? Is there any specific problem in your current field of interest that you'd like to see solved?

Answer:Research on formal verification and synthesis, including our paper, assumes that the model of the system is known. Reinforcement learning has emerged as a promising approach to the design of policies in scenarios where the model is not known and has to be learned by agents by exploration. This leads to an opportunity for research at the intersection of reactive synthesis and reinforcement learning. A potentially promising direction is to consider reinforcement learning for systems with multiple agents with both cooperative and adversarial interactions.

The realization that reactive systems have to satisfy their specifications in all environments has led to extensive research relating formal methods with game theory. Our paper added alternation to refinement relations. The transition from one to multiple players has been studied in computer science in several other contexts. For the basic problem of reachability in graphs, it amounts to moving from reachability to alternating reachability. We recently studied this shift in other fundamental graph problems, like the generation of weighted spanning trees, flows in networks, vertex covers, and more. In all these extensions, we consider a game between two players that take turns in jointly generating the outcome. One player aims at maximizing the value of the outcome (e.g., maximize the weight of the spanning tree, the amount of flow that travels in the network, or the size of the vertex cover), whereas the second aims at minimizing the value. It is interesting to see how some fundamental properties of graph algorithms are lost in the alternating setting. For example, following a greedy strategy is not beneficial in alternating spanning trees, optimal strategies in alternating flow networks may use fractional flows, and while the vertex-cover problem is NP-complete, an optimal strategy for the maximizer player can be found in polynomial time. Many more questions in this setting are still open. 

Luca: What advice would you give to a young researcher who is keen to start working on topics related to alternating transition systems and logics?

Answer: One important piece of advice to young researchers is to question the orthodoxy. Sometimes it is necessary to learn everything that is known about a topic but then take a step back, look at the bigger picture, reexamine some of the fundamental assumptions behind the established ways of thinking, change the models that everyone has been using, and go beyond the incremental improvement of previous results. This is particularly true in formal methods, where no single model or approach fits everything. And young researchers stand a much better chance of having a really fresh new thought than those who have been at it for many years.

by Luca Aceto (noreply@blogger.com) at June 18, 2021 04:37 PM UTC

Most machine learning classes and textbooks mention that there is no universal supervised learning algorithm that can do reasonably well on all learning problems. Indeed, a series of “no free lunch theorems” state that even in a simple input space, for any learning algorithm, there always exists a bad conditional distribution of outputs given inputs where this algorithm performs arbitrarily bad.

For example, from the classic book of Luc Devroye, László Györfi, and Gábor Lugosi [1, Theorem 7.2 and its extensions], for inputs uniformly distributed in \([0,1]\), for any decreasing sequence \((\varepsilon_n)_{n \geqslant 0}\) which is less than 1/16, for any learning algorithm (which takes pairs of observations and outputs a prediction function), there exists a conditional distribution on \(\{-1,1\}\) for which the expected risk of the classifier learned from \(n\) independent and identically distributed observations of (input,output) pairs is greater than \(\varepsilon_n\) for all \(n \geqslant 1\), while the best possible expected error rate is zero.

Such theorems do not imply that all learning methods are equally bad, but rather that all learning methods will suffer from some weaknessess. Throughout this blog we will try to better understand the weaknessess and strengths of popular methods through learning theory.

The key is to control the potential weaknesses of a learning method by making sure that in “favorable” scenarios, it leads to strong guarantees. When taking the simplest example of vectorial inputs in \(\mathbb{R}^d\), we can construct model classes of increasing complexity for which we can start to draw useful comparisons between learning methods.

Several aspects of the joint distribution of (input,output) \((X,Y)\) make the problem easy or hard. For concreteness and simplicity, I will focus on regression problems where the output space is \(\mathcal{Y} = \mathbb{R}\) and with the square loss, so that the optimal function, the “target” function, is \(f^\ast(x) = \mathbb{E}(Y|X=x)\). But much of the discussion extends to classification or even more complex outputs (see, e.g., [1]).

Curse of dimensionality. In the context of vectorial inputs, the slowness of universal learning algorithms can be characterized more precisely, and leads to the classical curse of dimensionality. Only assuming that the optimal target function is Lipschitz-continuous, that is, for all \(x,x’\), \(| f^\ast(x)-f^\ast(x’)| \leqslant L \| x – x’\|\) (for any arbitrary norm on \(\mathbb{R}^d\)), the optimal excess risk of a prediction function \(\hat{f}\) obtained from \(n\) observations, that is, the expected squared difference between \(f^\ast(x)\) and \(\hat{f}(x)\) cannot be less than a constant times \(n^{-2/(d+2)}\), that is, in order for this rate to be smaller than some \(\varepsilon <1 \), we need \(n\) to be larger than \(\displaystyle ( {1}/{\varepsilon} )^{d/2+1}\), with thus an exponential dependence in \(d\) (see [2]).

In other words, exponentially many observations are needed for a reasonable performance on all problems with a minimal set of assumptions (here Lipschitz-continuity), and this bad behavior is unavoidable unless extra assumptions are added, which we now describe.

Support, smoothness and latent variables

Low-dimensional support. If the data occupy only a \(r\)-dimensional subspace of \(\mathbb{R}^d\), with \(r \leqslant d\) (and typically much smaller), then one should expect a better convergence rate. This also extends to data supported on a (smooth) manifold, as illustrated below. Note that this assumption does not concern the outputs, and can reasonably be checked given the data by performing principal component analysis or some form of manifold learning. Essentially, in terms of convergence rates, \(d\) is replaced by \(r\). This is obvious if the learning algorithm has access to the \(r\)-dimensional representation, and it requires more work if not.

Left: samples from the uniform distribution in the unit disk in \(\mathbb{R}^2\). Right: samples from the uniform distribution in the unit circle in \(\mathbb{R}^2\), which is a one-dimensional smooth manifold.

Smoothness of the target function. This is done by assuming some bounded derivatives for the target function \(f^\ast\). With bounded \(s\)-th order derivatives, we could expect that the problem is easier (note that Lipschitz-continuity corresponds to \(s=1\)). This is illustrated below with one-dimensional inputs. Essentially, in terms of rates, \(d\) is replaced by \(d/s\) [2, Theorem 3.2]. Therefore, when the smoothness order \(s\) is of order \(d\), the dependence in the dimension disappears.

Regression problem with one-dimensional inputs, with target function in red, and observations in blue. Left: non-smooth (even not Lipschitz-continuous) function, right: smooth function.

Latent variables. If we assume that the target function depends only on a \(r\)-dimensional linear projection of the input, then we should expect a better complexity. The most classical example is the dependence on a subset of the \(d\) original variables. Essentially, in terms of rates, \(d\) is replaced by \(r\). This is obvious when the latent variables are known (as we can replace input data by the \(r\) latent variables), totally not otherwise, as this requires some form of adaptivity (see below).

Regression problem with two-dimensional inputs, with target function (with no noise). Left: function that cannot be written with a latent variable, right: function that has a (linear) latent variable representation (there exists a direction with no variation).

Need for adaptivity. We typically don’t know in advance if these properties are satisfied or not, as some are easily testable (support of distribution) without knowing the target function \(f^\ast\), while others are not.

The goal is to have a single method that can adapt to all of these situations (which are non-exclusive). That is, if the problem has any of these reasons to be easy, will the learning method benefit from it? Typically, most learning methods have at least one hyperparameter controlling overfitting (e.g., a regularization parameter), and the precise value of this hyperparameter will depend on the difficulty of the problem, and we will assume that we have a reasonable way to estimate this hyperparameter (e.g., cross-validation). A method is then said adaptive if with a well chosen value of the hyperparameter, we get the optimal (or close to optimal) rate of estimation that benefits from the extra assumption.

Quest for adaptivity: who wins? Among classical learning techniques, which ones are adaptive to which properties? In short, barring computational and optimization issues: $$\mbox{ local averaging } < \mbox{ positive definite kernels } < \mbox{ neural networks }.$$ Every time, the next method in the list gains adaptivity to support, smoothness and then latent variables. Note however that optimization for neural networks is more delicate (see below).

Let’s now briefly look at these methods one by one. I will assume basic knowledge of these, for more details see [4, 5] or my new book in preparation.

Local averaging

The earliest and simplest learning methods that could adapt to any target functions were local averaging methods aiming at approximating directly the conditional expectation \(f^\ast(x) = \mathbb{E}(Y|X=x)\), with the most classical examples being k-nearest neighbors and Nadaraya-Watson estimators.

These methods are naturally adaptive to having a reduced support for inputs. Indeed, in the simplest case of a distribution supported in a low-dimensional subspace, the global metric is equivalent to a local metric on the support, without any need to explicitly know the subspace, a situation which extends to smooth manifolds. In order to be adaptive, the hyperparameter has to depend on the dimension \(r\) of the manifold, e.g., for \(k\)-nearest-neighbors, \(k\) has to be taken proportional to \(n^{2/(2+r)}\) (see [3]).

However, there is no adaptivity to the smoothness of the target function or to potential latent variables unless dedicated algorithms are used such as local regression. Kernel methods and neural networks lead to such adaptivity.

From kernels to neural networks

We consider prediction functions of the form $$f(x) = \sum_{j=1}^m a_j ( b_j^\top x )_+,$$ which is the traditional single hidden layer fully connected neural network with ReLU activation functions (a constant term can be added to the linear term within the ReLU by simply appending \(1\) to \(x\), please do not call these terms “biases” as this has another meaning in statistics).

The vector \(a \in \mathbb{R}^m\) represents output weights, while the matrix \(b \in \mathbb{R}^{m \times d}\) represents input weights.

Empirical risk minimization (ERM). We assume given \(n\) i.i.d. observations \((x_1,y_1),\dots,(x_n,y_n) \in \mathbb{R}^d \times \mathbb{R}\), and we will fit models by minimizing the \(\ell_2\)-regularized empirical risk (you can call it “weight decay” but it already has a better name in machine learning and statistics). That is, we minimize $$R(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + \frac{\lambda}{2} \sum_{j=1}^m \Big\{ a_j^2 + \| b_j\|_2^2 \Big\},$$ where \(\lambda > 0\) is a regularization parameter.

Overparametrization. We will consider the limit of large number \(m\) of hidden neurons. Depending whether we optimize over both input weights \(b\) and output weights \(a\), or simply output weights (with then a proper initialization of the input weights), we get different behaviors.

Kernel regime

We assume that the input weights \(b_j \in \mathbb{R}^d\) are sampled uniformly from the Euclidean sphere of radius \(1 / \sqrt{m}\), and that we only optimize over the output weights \(a_j, j = 1,\dots,m\). This is exactly a ridge regression problem (square loss and squared Euclidean penalty), for which the theory of positive definite kernels applies and will lead to an interesting behavior for infinite \(m\) [7, 8]. That is, the solution can be obtained by using the kernel function \(\hat{k}\) defined as $$\hat{k}(x,x’) = \sum_{j=1}^m ( b_j^\top x )_+( b_j^\top x’ )_+,$$ and looking for prediction functions of the form $$f(x) = \sum_{i=1}^n \alpha_i \hat{k}(x,x_i).$$ See, e.g., [6] for details. In particular, as long as \(\hat{k}(x,x’)\) can be computed efficiently, the complexity of solving the ERM problem is independent of the number of neurons \(m\). We will now discuss in further detail the over-parameterized case where \(m=\infty\).

When \(m\) tends to infinity, If the input weights are fixed and initialized randomly from the \(\ell_2\)-sphere of radius \(1/\sqrt{m}\), then by the law of large numbers, \(\hat{k}(x,x’)\) tends to $$k(x,x’) = \mathbb{E}_{b \sim {\rm uniform} (\mathbb{S}^{d-1}) } (b^\top x )_+ (b^\top x’)_+,$$ where \(\mathbb{S}^{d-1}\) is the unit \(\ell_2\)-sphere in \(d\) dimensions. This kernel has a closed form and happens to be equal to (see [9]) $$ \frac{1}{2\pi d} \|x\|_2 \| x’\|_2 \big[ ( \pi \, – \varphi) \cos \varphi + \sin \varphi \big],$$ where \(\cos \varphi = \frac{ x^\top x’ }{{ \|x\|_2 } { \| x’\|_2 } }\). It can also be seen (see, e.g., [10] for details) as using predictors of the form $$f(x) = \int_{\mathbb{S}^{d-1}} ( b^\top x)_+ d \mu(b),$$ for some measure \(\mu\) on \(\mathbb{S}^{d-1}\), with the penalty $$\frac{\lambda}{2} \int_{\mathbb{S}^{d-1}} \big| \frac{d\mu}{d\tau}(b) \big|^2 d\tau(b),$$ for the uniform probability measure \(d\tau\) on the hypersphere \(\mathbb{S}^{d-1}\). This representation will be useful for the comparison with neural networks.

Note that here we use random features in a different way from their common use [8], where the kernel \(k\) comes first and is approximated by \(\hat{k}\) to obtain fast algorithms, typically with \(m < n\). Here we start from \(\hat{k}\) and find the limiting \(k\) to understand the behavior of overparameterization. The resulting function space is a (Sobolev) space of functions on the sphere with all \(s\)-th order derivatives which are square-integrable, with \(s = d/2+3/2\) [10].

Finally, the number of neurons \(m\) needed to reach the kernel regime is well understood, at least in simple situations [15, 16].

We can now look at the various forms of adaptivity of kernel methods based on Sobolev spaces.

Adaptivity to reduced support. Like model averaging techniques, adaptivity to input data supported on a subspace is rather straightforward, and it extends to smooth manifolds (see, e.g., [12] for a proof for the Gaussian kernel).

Adaptive to smoothness. A key attractive feature of kernel methods is that they can circumvent the curse of dimensionality for smooth target functions, and, by simply ajusting the regularization parameter \(\lambda\), ridge regression will typically adapt to the smoothness of the target function, and thus benefit from easy problems.

The simplest instance is when the target function is within the space of functions defined above, where we immediately get estimation rates which are independent of dimension (at least in the exponent). This however requires at least \(s>d/2\) derivatives (because Sobolev spaces are reproducing kernel Hilbert spaces (RKHS) only in this situation), but the adaptivity extends to functions outside of the RKHS (see, e.g., [11]).

Adaptivity to linear latent variables. Unfortunately, kernel methods are not adaptive even to basic linear structures. That is, if \(f^\ast\) depends only on the first component of \(x\), then ridge regression, if not modified, will not take advantage of it. In the context of neural networks, one striking example is that a single neuron \(x \mapsto (b^\top x)_+\) does not belong to the RKHS, and leads to bad estimation rates (see, e.g., [10, 13]).

Before looking at some experiments, let’s look at some optimization considerations, that will be important later for neural networks.

Avoiding overparameterized neuron representations. In the kernel regime, optimizing directly the cost function with \(m\) neurons by gradient descent is problematic for two reasons when \(m\) is large:

  • The optimization problem, although convex, is severely ill-conditioned as can be seen from the spectrum of the covariance matrix of the feature vector, with the \(i\)-th eigenvalue equivalent to \(i^{-1-3/d}\) for the uniform input distribution (see [14, Section 2.3]). Therefore, (stochastic) gradient descent will take a lot of time to converge.
  • We can use kernels directly (for \(m = +\infty\)) to avoid using \(m\) very large (which would be useless in practice). The running-time complexity is then of the order \(O(n^2)\) if done naively, with lots of ways to go below \(O(n^2)\), such as column sampling (see a nice analysis in [15]).

Experiments. We first consider a very simple one-dimensional example, where we look at how the estimation with the kernel function \(k\) varies as a function of the hyperparameter \(\lambda\), from underfitting to overfitting. We can observe that all learned functions are smooth, as expected.

Regression problems in one dimension with kernels (left: smooth target, right: non-smooth target), with \(n=32\) observations, with varying regularization parameters, with the optimal one shown below.

We can now compare the convergence rates for the excess risk (expected squared distance beween \(f\) and \(f^\ast\)). We can see that the rates are better for smooth functions, and with the proper choice of regularization parameter, the kernel method adapts to it.

Left: estimation with the optimal regularization parameter for smooth target function and kernel methods. Middle: estimation with a non-smooth target function. Right: convergence rates when the number \(n\) of observations increases, averaged over 32 replications.

Neural networks

We can now optimize over all weights, output \(a\) and input \(b\). Four natural questions come to mind:

  1. Where does it converge to when the number \(m\) of hidden neurons tends to infinity?
  2. What are the adaptivity properties of the resulting predictor?
  3. How big \(m\) needs to be to achieve the infinite width behavior?
  4. Can we actually solve the non-convex optimization problem?

Overparameterization limit (feature learning regime). When optimizing over both sets of weights, we can first note that the prediction function is invariant by the change of variable $$ a_j \leftarrow \mu_j a_j \ , \ \ b_j \leftarrow \mu_j^{-1} b_j.$$ Then by optimizing over \(\mu_j\) which is only involved in the penalty term \(\frac{\lambda}{2} \big( a_j^2 + \|b_j\|_2^2 \big)\), we get \(\mu_j^\ast = a_j^{-2} \| b_j \|_2^2\), and the penalty \(\lambda |a_j| \| b_j\|_2\). We thus get the equivalent optimization problem of minimizing $$\tilde{R}(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + {\lambda} \sum_{j=1}^m |a_j| \| b_j\|_2 .$$ By restricting the \(b_j\) on the unit sphere (which is OK because by optimizing over \(\mu_j\) we become scale-invariant), we can write $$\sum_{j=1}^m a_j ( b_j^\top x_i )_+ = \int_{\mathbb{S}^{d-1}} (b^\top x)_+ d\mu(b), $$ for $$ \mu = \sum_{j=1}^m a_j \delta_{b_j}$$ a weighted sum of Diracs, and \(\sum_{j=1}^m |a_j| \| b_j\|_2 = \int_{\mathbb{S}^{d-1}} \! | d\mu(b)|\) the total variation of \(\mu\). Thus, letting \(m\) go to infinity, the infinite sums become integrals of general measures, and we end up considering the set of functions that can be written as (see [10] and the many references therein for details): $$f(x) = \int_{\mathbb{S}^{d-1}} ( b^\top x)_+ d \mu(b),$$ with the penalty $$\lambda \int_{\mathbb{S}^{d-1}} \! | d\mu(b)|.$$ It has an \(\ell_1\)-norm flavor (as opposed to the \(\ell_2\)-norm for kernels), and leads to further adaptivity, if optimization problems are resolved (which is itself not easy in polynomial time, see below).

The minimal number of neurons to achieve the limiting behavior in terms of predictive performance (assuming optimization problems are resolved) can also be characterized, and grows exponentially in dimension without strong assumptions like the ones made in this blog post (see [10]). We will now study the adaptivity properties of using the function space above.

Adaptive to low-dimensional support and smoothness. It turns out that the adaptivity properties of kernel methods are preserved, both with respect to the support and the smoothness of the target function (see, e.g., [17, 10]). However, neural networks can do better!

Adaptive to linear substructures for one hidden layer. Given that the training of neural networks involves finding the best possible input weights, which are involved in the first linear layer, it is no surprise that we obtain adaptivity to linear latent variables. That is, if we can write \(f^\ast(x) = g(c_1^\top x, \dots, c_r^\top x)\) for some function \(g: \mathbb{R}^r \to \mathbb{R}\), then the vectors \(c_j\)’s will be essentially estimated among the \(b_j\)’s by the optimization algorithm. Therefore, when using the \(\ell_1\)-based function space, the convergence rate in the excess risk will depend on \(r\) and not \(d\) in the exponent [10, 13]. In particular, neural networks can perform non-linear variable selection (while the Lasso only performs linear variable selection). See a nice experiment in [13] for binary classification.

One could imagine that with more hidden layers, this extends to non-linear smooth projections of the data, that is, to cases where we assume that \(f^\ast(x) = g(h(x))\) where \(h: \mathbb{R}^d \to \mathbb{R}^r\) is a smooth function.

Thus, we obtain a stronger adaptivity for infinite \(m\) and the good choice of regularization parameter \(\lambda\). We could then try to reproduce for neural networks the figures obtained for kernel methods with varying \(\lambda\). Unfortunately, this is where non-convex optimization will make everything harder.

Overfitting with a single hidden layer is hard!

Global convergence of gradient flow for infinite width. The traditional algorithm to minimize the empirical risk is gradient descent and its stochastic extensions. In this blog post, we consider gradient descent with small step-sizes, which can be approximated by a gradient flow (as explained in last June blog post). All parameters are randomly initialized, and we consider \(b_j\) uniformly distributed on the sphere of radius \(1/\sqrt{m}\), and \(a_j\) uniformly distributed in \(\{-1/\sqrt{m},1/\sqrt{m}\}\) (this is essentially equivalent to the traditional “Glorot” initialization [18]).

This corresponds to the “mean-field” scaling of initialization where neurons from both layers move (see more details in last July blog post for other scalings). As shown in a joint work with Lénaïc Chizat [19] and explained in last June blog post, when \(m\) tends to infinity, then the gradient flow can only converge to the global minimum of the objective function, which is a non-trivial result because the cost function is not convex in \((a,b)\).

Apparent good properties for small \(m\). A key property which is not yet well understood is that the global convergence behavior can be observed for \(m\) relatively small. For example, considering the one-dimensional regression example below, where the target function is the combination of 5 hidden neurons, with \(m=32\) hidden neurons, it is already possible to learn a good function with high probability (this would not be the case with \(m=5\)).

One-dimensional regression from \(n = 64\) observations, by minimizing with gradient descent (and small step-size) the unregularized empirical risk.

One extra benefit here is that no regularization (e.g., no penalty) seems needed to obtain this good behavior. Therefore it seems that we get the best of both worlds: reasonably small \(m\) (so not too costly algorithmically), and no need to regularize.

However, like in any situation where overfitting is not observed, there is the possibility of underfitting. To study this, let us consider a noiseless problem where \(y\) is a deterministic function of \(x\), and with a sligthly more complicated target function, which now requires 9 hidden neurons.

One-dimensional noiseless regression from \(n = 64\) observations, by minimizing with gradient descent (and small step-size) the unregularized empirical risk.

Even with \(m = 8192\) hidden neurons, the resulting function is not able to fit the data, which is one extreme form of simplicity bias [21]. There is thus strong underfitting with a single hidden layer (the situation seems to be different with deeper networks). How can this be alleviated?

Kernel learning to the rescue

A simple way to avoid the underfitting phenomenon above (but potentially get overfitting, see below) is to minimize the cost \(R(a,b)\) by leveraging a convex sub-problem, as done in the glorious days of kernel methods.

In the cost function $$R(a,b) = \frac{1}{2n} \sum_{i=1}^n \Big( y_i \, – \sum_{j=1}^m a_j ( b_j^\top x_i )_+ \Big) ^2 + \frac{\lambda}{2} \sum_{j=1}^m \Big\{ a_j^2 + \| b_j\|_2^2 \Big\},$$ the problem is convex with respect to all \((a_j)\)’s; moreover, it is a least-squares problems which can be solved in closed form by matrix inversion (with algorithms that are much more robust to ill-conditioning than gradient descent).

Then, if we denote by \(\Phi(b) \in \mathbb{R}^{n \times m}\) the matrix with elements \((b_j^\top x_i)_+\), the cost function can be written as a function of \(a \in \mathbb{R}^m\) as $$R(a,b) = \frac{1}{2n} \| y \, – \Phi(b) a \|_2^2 +\frac{\lambda}{2}\| a\|_2^2 + \frac{\lambda}{2} \sum_{j=1}^m \| b_j\|_2^2.$$ It can be minimized in closed form as $$ a = ( \Phi(b)^\top \Phi(b) + n \lambda I)^{-1} \Phi(b)^\top y = \Phi(b)^\top ( \Phi(b)\Phi(b)^\top + n \lambda I)^{-1} y,$$ where the matrix inversion lemma has been used. This leads to: $$S(b) = \inf_{a \in \mathbb{R}^m} R(a,b) = \frac{\lambda}{2} y^\top ( \Phi(b)\Phi(b)^\top + n \lambda I)^{-1} y + \frac{\lambda}{2} \sum_{j=1}^m \| b_j\|_2^2.$$ The matrix \(\Phi(b)\Phi(b)^\top \in \mathbb{R}^{n \times n}\) is the traditional kernel matrix of the associated non-linear features. The resulting optimization problem is exactly one used in kernel learning [20].

Computing \(S(b)\) and its gradient with respect to \(b\) can be done in time \(O(n^2 m)\) when \(m\) is large. Since given the hidden neurons, we get the global optimum, we avoid the underfitting problem as well as the ill-conditioning issues.

When \(m\) tends to infinity, then the optimization problem fits our analysis in [19], that is, we can see the limiting flow as a Wasserstein gradient flow that can only converge to the global optimum of the associated cost function although the overall problem is not convex (as opposed to what happens in multiple kernel learning described in this older post). While there is currently no further quantitative theoretical evidence that a good optimization behavior can be reached for a smaller \(m\), it empirically solves the underfitting issue above, as shown below.

Comparison of the the two flows. Right: regular gradient flow on both sets of weights (normal training), left: gradient flow on the cost function where the output weights are maximized out.

We can now perform experiments with varying \(\lambda\)’s.

Experiments. The two plots below depict the learnt functions for a smooth and non-smooth objective as \(\lambda\) is varied, highlighting the importance of regularization.

Regression problems in one dimension with neural networks (left: smooth target, right: non-smooth target), with \(n=32\) observations, with varying regularization parameters, with the optimal one shown below.

We can now compare the convergence rates for the excess risk (expected squared distance beween \(f\) and \(f^\ast\)). We can see that the rates are better for smooth functions, and with the proper choice of regularization parameter, neural networks adapt to it. For adaptivity to linear latent variables, see [13].

Left: estimation with the optimal regularization parameter for smooth target function and neural networks. Middle: estimation with a non-smooth target function. Right: convergence rates when the number \(n\) of observations increases, averaged over 32 replications.

Conclusion

As I tried to show in this blog post, adaptivity is a key driver of good predictive performance of learning methods. It turns out that for the three classes of methods I considered, the more complex algorithms led to more adaptivity: no optimization for local averaging, finite-dimensional convex optimization for kernel methods and then non-convex optimization for neural networks (or equivalently infinite-dimensional convex optimization).

While neural networks led to more adaptivity, they are still not totally well understood, in particular in terms of the various biases that their training implies, both for shallow and deep networks. This makes a lot of research directions to follow!

Acknowledgements. I would like to thank Lénaïc Chizat and Lawrence Stewart for proofreading this blog post and making good clarifying suggestions.

References

[1] Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition,
volume 31. Springer Science & Business Media, 1996.
[2] László Györfi, Michael Kohler, Adam Krzyżak, Harro Walk. A Distribution-free Theory of Nonparametric Regression. New York : Springer, 2002.
[3] Samory Kpotufe. k-NN regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, 2011.
[4] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
[5] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
[6] Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels. MIT Press, 2001.
[7] Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
[8] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, 2008.
[9] Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, 2009.
[10] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
[11] Andreas Christmann, Ingo Steinwart. Support Vector Machines. Springer, 2008.
[12] Thomas Hamm, Ingo Steinwart. Adaptive Learning Rates for Support Vector Machines Working on Data with Low Intrinsic Dimension. Technical report, Arxiv 2003.06202, 2021.
[13] Lénaïc Chizat, Francis Bach. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic LossProceedings of the Conference on Learning Theory (COLT), 2020.
[14] Francis Bach. On the Equivalence between Kernel Quadrature Rules and Random Feature ExpansionsJournal of Machine Learning Research, 18(19):1-38, 2017. 
[15] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nyström computational regularization. In Advances in Neural Information Processing Systems, pages 1657–1665, 2015.
[16] Alessandro Rudi, Lorenzo Rosasco. Generalization properties of learning with random features. In Advances in Neural Information Processing Systems, 2017.
[17] Ryumei Nakada, Masaaki Imaizumi. Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality. Journal of Machine Learning Research, 21(174):1−38, 2020.
[18] Xavier Glorot, Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networksProceedings of the International Conference on Artificial Intelligence and Statistics, 2010.
[19] Lénaïc Chizat, Francis Bach. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal TransportAdvances in Neural Information Processing Systems, 2018
[20] Olivier Chapelle, Vladimir Vapnik, Olivier Bousquet, Sayan Mukherjee. Choosing multiple parameters for support vector machinesMachine Learning, 46(1):131-159, 2002.
[21] Harshay Shah, Kaustav Tamuly, Aditi Raghunathan, Prateek Jain, Praneeth Netrapalli. The Pitfalls of Simplicity Bias in Neural Networks. In Advances in Neural Information Processing Systems, 2020.

by Francis Bach at June 17, 2021 04:51 PM UTC


Benny Chor passed away on June 10, 2021. Luca Trevisan had a blog post on the news.

We present a guest post on Benny Chor's life and works by Oded Goldreich. The post has many refs to papers. They will appear at the end with pointers to them.


Benny Chor was born on December 23rd 1956  and grew-up in Tel-Aviv, Israel. He studied Mathematics in the Hebrew University, receiving a B.Sc. in 1980 and an M.Sc. in 1981. He then switched to studying Computer Science at MIT, and graduated in 1985 with a PhD thesis titled

 Two Issues in Public Key Cryptography -- RSA Bit Security and a New Knapsack Type System

which received an ACM Distinguished Dissertation award. After post-doctoral periods at MIT and Harvard, he took a faculty position at the Computer Science Department of the Technion (1987-2001), and then at Tel-Aviv University, where he served as the chairman of the department for two years. He died on June 10th 2021, from a terminal disease.

Although Benny was a very articulated and verbal person, I find it impossible to describe his personality in words. The point is that words cannot capture the experience of interacting with him, which was always sheer fun. Still, I guess I should say something personal as a close friend of his. So I will just mention that he lived happily with Metsada, whom he met in the summer of 1981, and that they lovingly raised three kids: Arnon, Omer and Aya. Actually, let me also mention his life-long close relationship with his brother, Kobi.

Focusing on his contributions to science, I will confine myself to the areas of cryptography and randomized computation. Still, let me mention that in the mid 1990s, Benny's research interests gradually shifted to computational biology, but I will not review his contributions to that area, since it is very remote from my own expertise.

In my opinion, Benny's most important contribution to cryptography is the fundamental
paper on Verifiable Secret Sharing [CGMA].  Loosely speaking, Verifiable Secret Sharing
is a method to share a secret so that any majority of share-holders may reconstruct the
secret, whereas no minority may do so, and still each share-holder can verify that the
share that they got is a valid one.  This primitive had a tremendous impact on the
development of secure multi-party protocols, and almost all subsequent works in this
area utilize it.

Additional research milestones of Benny, in the area of Cryptography, include the proof
of the existence of a ``hard-core'' in the RSA and Rabin functions [BCS, ACGS], the
introduction and design of Private Information Retrieval (PIR) schemes [CGKS] (as well
as their computational counterparts [CG97]), and the introduction and design of
Tracing Traitor schemes [CFNP].  Each of these works deserves more than the foregoing
brief mention, yet I'm not even mentioning other important works like [CK].

Turning to the area of Randomized Computation, Benny's contributions span diverse areas
ranging from the manipulation of randomness to the use of randomness in complexity theory
and distributed computing.  In particular, his work on weak sources of randomness [CG88]
identified min-entropy (of the source) as the key parameter for randomness extraction and
presented a simple two-source extractor.  Distilling an aspect of [ACGS], his work on
pairwise-independent sampling [CG89] demonstrates the general applicability of this method. His works in distributed computing include a randomized Byzantine Agreement protocol [CC] that beats the deterministic bound without using unproven assumptions.  Lastly, let me just mention a couple of his complexity-theoretic works: [BCGL, CCGHHRR].

Benny was a superb teacher and a devoted graduate student adviser.  It is tempting to try
listing the numerous educational projects that he initiated, and his former students, but
these lists will be too long and the risk of a crucial omissions is too high.  So let me
end by expressing the feeling of loss that is shared by many in our community,
which adds to my personal sorrow.

Oded and Benny

Benny with his family's namesake (Chor = Bull)

[ACGS] Werner Alexi, Benny Chor, Oded Goldreich, Claus-Peter Schnorr:
RSA and Rabin Functions: Certain Parts are as Hard as the Whole.
SIAM J. Comput. 17(2): 194-209 (1988) LINK

[BCGL] Shai Ben-David, Benny Chor, Oded Goldreich, Michael Luby:
On the Theory of Average Case Complexity. J. Comput. Syst. Sci. 44(2): 193-219 (1992) LINK

[BCS] Michael Ben-Or, Benny Chor, Adi Shamir: On the Cryptographic Security of Single RSA Bits STOC 1983: 421-430. LINK

[CCGHHRR] Richard Chang, Benny Chor, Oded Goldreich, Juris Hartmanis, Johan Håstad, Desh Ranjan, Pankaj Rohatgi: The Random Oracle Hypothesis Is False.
J. Comput. Syst. Sci. 49(1): 24-39 (1994) LINK

[CC] Benny Chor, Brian A. Coan:
A Simple and Efficient Randomized Byzantine Agreement Algorithm.
IEEE Trans. Software Eng. 11(6): 531-539 (1985) LINK

[CFNP] Benny Chor, Amos Fiat, Moni Naor, Benny Pinkas:
Tracing traitors. IEEE Trans. Inf. Theory 46(3): 893-910 (2000) LINK

[CG97] Benny Chor, Niv Gilboa:
Computationally Private Information Retrieval. STOC 1997: 304-313 LINK

[CG88] Benny Chor, Oded Goldreich:
Unbiased Bits from Sources of Weak Randomness and Probabilistic Communication Complexity.
SIAM J. Comput. 17(2): 230-261 (1988) LINK

[CG89] Benny Chor, Oded Goldreich:
On the power of two-point based sampling. J. Complex. 5(1): 96-106 (1989) LINK

[CGKS] Benny Chor, Oded Goldreich, Eyal Kushilevitz, Madhu Sudan:
Private Information Retrieval. J. ACM 45(6): 965-981 (1998) LINK

[CGMA] Benny Chor, Shafi Goldwasser, Silvio Micali, Baruch Awerbuch:
Verifiable Secret Sharing and Achieving Simultaneity in the Presence of Faults.
FOCS 1985: 383-395. LINK

[CK] Benny Chor, Eyal Kushilevitz:
A Zero-One Law for Boolean Privacy.  STOC 1989: 62-72. LINK




by gasarch (noreply@blogger.com) at June 17, 2021 02:07 PM UTC

Authors: Kathryn Gray, Mingwei Li, Reyan Ahmed, Stephen Kobourov
Download: PDF
Abstract: Evolving trees arise in many real-life scenarios from computer file systems and dynamic call graphs, to fake news propagation and disease spread. Most layout algorithms for static trees, however, do not work well in an evolving setting (e.g., they are not designed to be stable between time steps). Dynamic graph layout algorithms are better suited to this task, although they often introduce unnecessary edge crossings. With this in mind we propose two methods for visualizing evolving trees that guarantee no edge crossings, while optimizing (1) desired edge length realization, (2) layout compactness, and (3) stability. We evaluate the two new methods, along with four prior approaches (two static and two dynamic), on real-world datasets using quantitative metrics: stress, desired edge length realization, layout compactness, stability, and running time. The new methods are fully functional and available on github.

at June 17, 2021 10:51 PM UTC

Authors: Sebastian Schlag, Tobias Heuer, Lars Gottesbüren, Yaroslav Akhremtsev, Christian Schulz, Peter Sanders
Download: PDF
Abstract: This paper considers the balanced hypergraph partitioning problem, which asks for partitioning the vertices into $k$ disjoint blocks of bounded size while minimizing an objective function over the hyperedges. Here, we consider the most commonly used connectivity metric. We describe our open source hypergraph partitioner KaHyPar which is based on the successful multi-level approach -- driving it to the extreme of one level for (almost) every vertex. Using carefully designed data structures and dynamic update techniques, this approach offers a very good time-quality tradeoff. We present two preprocessing techniques -- pin sparsification using locality sensitive hashing and community detection based on the Louvain algorithm. The community structure is used to guide the coarsening process that incrementally contracts vertices. Portfolio-based partitioning of the contracted hypergraph already achieves good initial solutions. While reversing the contractions, a combination of highly-localized direct $k$-way local search and flow-based techniques that take a more global view, refine the partition to achieve high quality. Optionally, a memetic algorithm evolves a pool of solution candidates to obtain even higher quality.

We evaluate KaHyPar on a large set of instances from a wide range of application domains. With respect to quality, KaHyPar outperforms all previously considered systems that can handle large hypergraphs such as hMETIS, PaToH, Mondriaan, or Zoltan. KaHyPar is also faster than most of these systems except for PaToH which represents a different speed-quality tradeoff. The results even extend to the special case of graph partitioning, where specialized systems such as KaHIP should have an advantage.

at June 17, 2021 10:41 PM UTC

Authors: David Garcia-Soriano, Francesco Bonchi
Download: PDF
Abstract: We study a novel problem of fairness in ranking aimed at minimizing the amount of individual unfairness introduced when enforcing group-fairness constraints. Our proposal is rooted in the distributional maxmin fairness theory, which uses randomization to maximize the expected satisfaction of the worst-off individuals. We devise an exact polynomial-time algorithm to find maxmin-fair distributions of general search problems (including, but not limited to, ranking), and show that our algorithm can produce rankings which, while satisfying the given group-fairness constraints, ensure that the maximum possible value is brought to individuals.

at June 17, 2021 10:42 PM UTC

Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin Tian
Download: PDF
Abstract: We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction of the points are arbitrary. The goal is to output a small list of vectors at least one of which is close to the mean of $\mathcal{D}$. As our main contribution, we develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $n^{1 + o(1)} d$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades.

The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et. al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections -- thus, bypassing the $k$-PCA subroutines required by prior algorithms.

at June 17, 2021 10:37 PM UTC

Authors: Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski
Download: PDF
Abstract: Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem on graphs using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental analysis of our techniques.

at June 17, 2021 10:41 PM UTC

Authors: Talya Eden, Piotr Indyk, Shyam Narayanan, Ronitt Rubinfeld, Sandeep Silwal, Tal Wagner
Download: PDF
Abstract: We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.

at June 17, 2021 10:45 PM UTC

Authors: Ankur Moitra, Elchanan Mossel, Colin Sandon
Download: PDF
Abstract: In this work, we study the computational complexity of determining whether a machine learning model that perfectly fits the training data will generalizes to unseen data. In particular, we study the power of a malicious agent whose goal is to construct a model g that fits its training data and nothing else, but is indistinguishable from an accurate model f. We say that g strongly spoofs f if no polynomial-time algorithm can tell them apart. If instead we restrict to algorithms that run in $n^c$ time for some fixed $c$, we say that g c-weakly spoofs f. Our main results are

1. Under cryptographic assumptions, strong spoofing is possible and 2. For any c> 0, c-weak spoofing is possible unconditionally

While the assumption of a malicious agent is an extreme scenario (hopefully companies training large models are not malicious), we believe that it sheds light on the inherent difficulties of blindly trusting large proprietary models or data.

at June 17, 2021 10:48 PM UTC

We show that one-way functions exist if and only if there is some samplable distribution D such that it is hard to approximate the Kolmogorov complexity of a string sampled from D. Thus we characterize the existence of one-way functions by the average-case hardness of a natural \emph{uncomputable} problem on samplable distributions, extending a recent line of work by Liu and Pass (FOCS'20, STOC'21) and Ren and Santhanam (CCC'21). We also show that the average-case hardness of approximating Minimum Circuit Size on a locally samplable distribution (where the sampler runs in sub-linear time by using random access to its input) is equivalent to the existence of one-way functions. This is the first characterization of one-way functions by a natural average-case hardness assumption on the Minimum Circuit Size Problem. We present several other characterizations and connections between one-way functions and average-case hardness of meta-complexity problems (problems about complexity) on samplable distributions. We give various applications of these results to the foundations of cryptography and the theory of meta-complexity. We show that the average-case hardness of deciding k-SAT or Clique on any samplable distribution of high enough entropy implies the existence of one-way functions. Thus one-way functions follow from general assumptions on the average-case hardness of NP-complete problems. We observe that our assumptions are implied by standard cryptographic assumptions such as the Planted Clique hypothesis and the pseudorandomness of Goldreich's local functions. Our results imply a range of \emph{equivalences} between various meta-complexity problems, showing that the theory of meta-complexity is very \emph{robust} when considering average-case complexity. We use our results to unconditionally solve various meta-complexity problems in CZK (computational zero-knowledge) on average, and give implications of our results for the classic question of proving NP-hardness for the Minimum Circuit Size Problem.

at June 16, 2021 03:57 PM UTC

by David Eppstein at June 15, 2021 11:31 PM UTC


There are two ways to write error free programs; only the third one works–Alan Perlis

Perlis on Coding Joy source

Alan Perlis, the first Turing Award winner, summarized the whole issue of program correctness in his single quote. Maybe there is nothing more to say about it.

But this coming Thursday, if you want to hear more about it, you can tune in for a debate with Rich de Millo and myself. It will be 7:00–8:30pm Eastern time.

Harry Lewis of Harvard University will moderate. It will be broadcast LIVE HERE. It’s free.

 

 

The Paper

Perlis and Rich DeMillo and I wrote the paper “Social Processes and Proofs of Theorems
and Programs” in the middle of the second half of the last century—pushing towards fifty years ago. We were responding to the then-common view that programs should be proved correct—proved in the same manner that one proves theorems like:

Theorem 1 (Euclid) There are an infinite number of prime numbers.

We begged to object: We felt that correctness of programs was fundamentally a different issue. This is what the debate is all about.

The debate will give us all a chance to reflect now almost 50 years later when programs enrich and regulate so much more of our lives. Perhaps the real winner in society goes according to the proverb: “the one who ends up with the most toys.” In that case, Harry the moderator will win hands down:

The Writing

Mary-Claire van Leunen is an expert on writing, especially for technical articles. See her famous book.

The article was heavily written by Rich with input from Alan and myself. The ideas are due to all of us. The actual details, the words, the punctuation owe more to Rich with strong input from Mary-Claire.

Our paper is one of forty-six papers included in the book Ideas That Created the Future: Classic Papers of Computer Science, which was edited by Harry for MIT Press (2020). They are presented in chronological order beginning with Aristotle (~350 BCE). The median year is 1962-63.

Open Problems

See How to Have Your Abstract Rejected for a tongue-in-cheek view of writing technical material. It was written by Mary-Claire and myself.

See you Thursday.

by rjlipton at June 15, 2021 02:57 PM UTC

An Algebraic Circuit for a polynomial $P\in F[x_1,\ldots,x_N]$ is a computational model for constructing the polynomial $P$ using only additions and multiplications. It is a \emph{syntactic} model of computation, as opposed to the Boolean Circuit model, and hence lower bounds for this model are widely expected to be easier to prove than lower bounds for Boolean circuits. Despite this, we do not have superpolynomial lower bounds against general algebraic circuits of depth 3 (except over constant-sized finite fields) and depth 4 (over any field), while constant-depth Boolean circuit lower bounds have been known since the early 1980s. In this paper, we prove the first superpolynomial lower bounds against general algebraic circuits of all constant depths over all fields of characteristic $0$ (or large). We also prove the first lower bounds against homogeneous algebraic circuits of constant depth over any field. Our approach is surprisingly simple. We first prove superpolynomial lower bounds for constant-depth Set-Multilinear circuits. While strong lower bounds were already known against such circuits, most previous lower bounds were of the form $f(d)\cdot poly(N)$, where $d$ denotes the degree of the polynomial. In analogy with Parameterized complexity, we call this an FPT lower bound. We extend a well-known technique of Nisan and Wigderson (FOCS 1995) to prove non-FPT lower bounds against constant-depth set-multilinear circuits computing the Iterated Matrix Multiplication polynomial $IMM_{n,d}$ (which computes a fixed entry of the product of $d$ $n\times n$ matrices). More precisely, we prove that any set-multilinear circuit of depth $\Delta$ computing $IMM_{n,d}$ must have size at least $n^{d^{\exp(-O(\Delta))}}.$ This result holds over any field. We then show how to convert any constant-depth algebraic circuit of size $s$ to a constant-depth set-multilinear circuit with a blow-up in size that is exponential in $d$ but only polynomial in $s$ over fields of characteristic $0$. (For depths greater than $3$, previous results of this form increased the depth of the resulting circuit to $\Omega(\log s)$.) This implies our constant-depth circuit lower bounds for $d$ that is a slow-growing function of $n$. Finally, we observe that our superpolynomial lower bound for constant-depth circuits implies the first deterministic sub-exponential time algorithm for solving the Polynomial Identity Testing (PIT) problem for all small depth circuits using the known connection between algebraic hardness and randomness.

at June 15, 2021 05:50 AM UTC

In this post, we explore a theorem of Clement, Junqueira, Kate, and Rodrigues from PODC 2012 regarding the limits of non-equivocation. Informally, this theorem says that neither Non-equivocation nor Transferability alone is enough for tolerating minority corruptions in asynchrony. Theorem CJKR12: Neither non-equivocation nor transferability is individually sufficient to solve...

at June 14, 2021 04:04 PM UTC

[H/T Jelani Nelson]

In recent years there has been increasing interest in using machine
learning to improve the performance of classical algorithms in
computer science, by fine-tuning their behavior to adapt to the
properties of the input distribution. This “data-driven” or
“learning-based” approach to algorithm design has the potential to
significantly improve the efficiency of some of the most widely used
algorithms. For example, it has been used to design better data
structures, online algorithms, streaming and sketching algorithms,
market mechanisms and algorithms for combinatorial optimization,
similarity search and inverse problems. This virtual workshop will
feature talks from experts at the forefront of this exciting area.

The workshop will take place virtually on July 13-14, 2021.
Registration is free but mandatory. Link to register: https://fodsi.us/ml4a.html

Confirmed Speakers:

  • Alex Dimakis (UT Austin)
  • Yonina Eldar (Weizmann)
  • Anna Goldie (Google Brain, Stanford)
  • Reinhard Heckel (Technical University of Munich)
  • Stefanie Jegelka (MIT)
  • Tim Kraska (MIT)
  • Benjamin Moseley (CMU)
  • David Parkes (Harvard)
  • Ola Svensson (EPFL)
  • Tuomas Sandholm (CMU, Optimized Markets, Strategy Robot, Strategic Machine)
  • Sergei Vassilvitski (Google)
  • Ellen Vitercik (CMU/UC Berkeley)
  • David Woodruff (CMU)

Organizers:

  • Costis Daskalakis (MIT)
  • Paul Hand (Northeastern)
  • Piotr Indyk (MIT)
  • Michael Mitzenmacher (Harvard)
  • Ronitt Rubinfeld (MIT)
  • Jelani Nelson (UC Berkeley)

Unrelated announcement: JACM looking for editor in chief

(h/t Salil Vadhan)

The Journal of the ACM is looking for a new editor in chief: see the call for nominations. The (soft) deadline to submit nominations (including self nominations) is July 19th and you can do so by emailing Chris Hankin at c.hankin@imperial.ac.uk

by Boaz Barak at June 14, 2021 01:29 PM UTC


A call for nominations for her successor

In her original orange

Éva Tardos’s sentence as Editor-in-Chief (EiC) of the Journal of the ACM (JACM) is coming to an end. She has served almost six years. She will be paroled soon for her good behavior.

Today we announce that a grand jury is being convened. This group is charged with finding the right person to be indicted as the next editor-in-chief of the JACM.

Tips on possible suspects should be directed to lieutenant Chris Hankin here. Note that the location is not far from Scotland Yard. He is in charge of the grand jury:

Laura Haas (University of Massachusetts Amherst, US)
Orna Kupferman (Hebrew University, Israel)
Marta Kwiatkowska (University of Oxford, UK)
Brad Myers (Carnegie Mellon University, US)
Salil Vadhan (Harvard University, US)
Albert Zomaya (University of Sydney, Australia)
Divesh Srivastava (AT&T, US), ACM Pubs Board Liaison

History

The JACM was established in 1954 and has published over 3,000 papers, since then. There were 196 submissions received in 2020 alone. The current editorial board is this. Previous editors-in-chief are:

1954 – 1958: Franz Alt
1959 – 1962: Mario Juncosa
1963 – 1965: Richard Hamming
1966 – 1968: Calvin Gotlieb
1969 – 1972: Gerard Salton
1973 – 1975: Raymond Miller
1976 – 1979: Edward Coffman, Jr.
1979 – 1982: Michael Garey
1983 – 1986: Michael Fischer
1986 – 1990: Daniel Rosenkrantz
1991 -1997: Tom Leighton
1997 – 2003: Joseph Halpern
2003 – 2009: Prabhakar Raghavan
2009 – 2015: Victor Vianu
2015 – currently: Éva Tardos

Selection

Typically many candidates are available, but only the most dramatic cases will be considered. Nominations should include a vita along with a brief statement of why the nominee should be considered. Why they are un-worthy to be made into the editor-in-chief? A vision statement is highly encouraged. The relevant law is found here:

Roles and Responsibilities in ACM Publishing

ACM’s Evaluation Criteria for Editors-in-Chief

Open Problems

All kidding aside. We thank Éva for her hard work keeping the JACM one of the top journals in the world. We can only hope that the next chief will be nearly as sucessful as she has been.

by rjlipton at June 12, 2021 12:25 PM UTC

You may have heard of the Zeckendorf representation according to which any positive integer can be represented as a sum of non-consecutive Fibonacci numbers. Its uses include the optimal strategy in the game of Fibonacci nim. But did you know that it’s possible to efficiently add and subtract Zeckendorf representations?1 The algorithm from the paper linked above takes three passes over the input digit sequences using finite state automata, much like binary number addition can be performed by a single pass of a finite state automaton. I thought it might be interesting to describe an alternative path to the same result, using chip-firing games.

The chip-firing antimatroid

Chip-firing games, or sandpile models, are systems described by a graph, with markers of some kind such as coins on its vertices. The graph may possibly be directed, although in the simplest examples it is undirected, and it may be infinite. Each vertex may have arbitrarily many coins on it. But, if a vertex has at least as many coins as it has outgoing edges, we can “fire” the vertex, moving one coin to each neighbor. This repeats until no more such events are possible. Doing this on an infinite square grid with an initial state that piles many coins on a single vertex leads to pretty fractals; here’s a detail of the result of starting with 30 million coins:

Detail of abelian sandpile model on square grid with 3x10^7 starting coins, extracted from CC-BY-SA image https://commons.wikimedia.org/wiki/File:Sandpile_on_infinite_grid,_3e7_grains.png by colt_browning

However, we’ll be looking at this sort of system on one-dimensional infinite graphs, where their behavior is not quite as complicated.

Chip-firing, on any graph and any placement of coins for which it terminates, forms an antimatroid.2 3 A vertex \(v\) can fire for the \(i\)th time, as long as it has already fired \(i-1\) times and a total of \(i\cdot\deg(v)\) coins have reached it. Once these conditions are met, they remain true until \(v\) fires. This idea, that once the item \((v,i)\) becomes available to be added to the sequence of firings it remains available until it is added, is the defining principle of an antimatroid. From it one can prove that, if any firing sequence terminates in a stable configuration, then all sequences terminate, they all fire the same vertices the same numbers of times, and they all end in the same stable configuration.

Binary carrying as chip-firing

A binary representation of a number \(x\) is just a set of distinct powers of two, summing to \(x\). If you add two binary numbers \(x\) and \(y\), you can combine their sets into a single multiset; carrying can be thought of as a systematic method of getting rid of the duplicate powers of two in this multiset. Whenever you have two equal powers of two in a multiset, whose sum you are trying to represent, you can merge them with the fusion rule

\[2^i+2^i \Rightarrow 2^{i+1}.\]

This can be thought of as a chip-firing game on a graph where there is a vertex for every power of two, an edge to the next larger power of two, and an edge to a “bit bucket” vertex with no outgoing edges (which cannot be fired). Each instantiation of the fusion rule fires vertex \(2^i\), moving one coin to \(2^{i+1}\) and one to the bit bucket.

This is exactly what we are doing when we carry a term in binary addition: taking two powers of two from a column of the addition problem and fusing them into a single power of two in the next column. We can also think of this physically, using a one-dimensional array of cells with poker chips or coins on them, with a coin on cell \(i\) representing the number \(2^i\). The fusion rule takes two coins from any cell and replaces them by a single coin in the next cell. In binary addition of pairs of numbers, there are at most three coins per cell (the two that started out there and one carry), but you could use the same fusion rule for addition of more than two numbers, using an array of cells that can each contain a large pile of coins.

Five stacks of coins. Image Money-2180330 1920.jpg on Wikimedia commons, uploaded by Stella Vogt from pixabay free images, described as "probably free" but tagged with a CC-BY-SA license.

Carrying the coin analogy further, we might imagine that the coins have values that are powers of two, and that we are making change by replacing pairs of small coins by a single larger coin of equal value.

Conventional binary arithmetic does these fusion steps in a systematic order, from low-order bits of the binary representation (smaller powers of two) to higher-order bits (larger powers of two). It’s that systematic order, together with the observation that each pile has at most three coins, that makes this method suitable for a finite state machine. But actually, you could apply the fusion rule in any order, and it would work equally well. Each step reduces the number of coins by one, so you can never do more steps than your starting number of coins. We can only halt when we reach the binary representation of the sum, which is uniquely determined, so this process is confluent: every sequence of choices leads to the same eventual outcome. In this case, confluence can also be seen from the antimatroid property of chip-firing games.

This observation about reducing the number of powers of two gives an immediate proof of a fact about binary representations that you may not have known: the binary representation of \(x+y\) has at most as many nonzero bits as there are in \(x\) and \(y\) separately.

Chip-firing for Fibonacci numbers

Now, instead of coins with power-of-two values, let’s suppose that they have Fibonacci numbers as their values, representing any multiset of Fibonacci numbers. We can still combine pairs of Fibonacci numbers by a fusion rule, that now applies when we have two coins in adjacent piles:

\[F_i + F_{i+1} \Rightarrow F_{i+2}.\]

But in order to get the Zeckendorf representation of the sum of the multiset, we also need to deal with pairs of coins that have the same value, because repeated copies of the same Fibonacci number are not allowed in the Zeckendorf representation. We can accomplish this by a “fission” rule, that steps backwards according to the fusion rule in order to take a bigger step forwards:

\[F_i+F_i \Rightarrow F_{i-2}+F_{i-1}+F_i\Rightarrow F_{i-2}+F_{i+1}.\]

This makes sense even when \(F_i\) is \(1\) or \(2\): for \(F_i=1\), the \(F_{i-2}\) term is zero and can be dropped from the sum, and for \(F_i=2\) it equals one, a Fibonacci number. So as special cases of this rule we have \(1+1\Rightarrow 2\) and \(2+2\Rightarrow 1+3\).

Several issues complicate the analysis of this replacement system. First, although fission by itself is a chip-firing rule on an infinite graph with outdegree two, fission and fusion together make a more complicated system that is not an antimatroid, so different orderings of choosing what to do may lead to different numbers of steps. Second, the fission rule changes the piles of coins both to the left and to the right of the pile to which it applies, making it harder to find a consistent ordering in which to apply these rules. And third, fission preserves the number of coins rather than reducing them, making both the eventual termination of the system and the analysis of how many steps it takes less obvious. But assuming it does terminate, it can only terminate at the Zeckendorf representation, which is unique. So termination implies confluence.

Fission without fusion

First let’s see what happens if we just use the fission rule, forgetting about the fusion rule. Because fission by itself is a chip-firing rule, any initial state has an invariant number of firings and an invariant final state, regardless of the order of operations. For an initial state with \(n\) copies of \(F_i\), the final state gives us an expansion of \(F_i\) into a sum of \(n\) distinct Fibonacci numbers:

\[F_i = F_i\] \[2\,F_i = F_{i-2}+F_{i+1}\] \[3\,F_i=F_{i-2}+F_i+F_{i+1}\] \[4\,F_i=F_{i-4}+F_{i-3}+F_i+F_{i+2}\] \[\vdots\]

The indexes appearing in these identities can be summarized in a triangle of numbers:

\[\begin{array}{ccccccc} &&&&&&0&&&&&&\\ &&&&&-2&&\ 1\ &&&&&\\ &&&&-2&&0&&\ 1\ &&&&\\ &&&-4&&-3&&0&&\ 2\ &&&\\ &&-4&&-3&&-2&&1&&\ 2\ &&\\ &-4&&-3&&-2&&0&&1&&\ 2\ &\\ -6&&-5&&-4&&-3&&-1&&1&&\ 3\ \\ &&&&&&\vdots&&&&&&\\ \end{array}\]

The largest number in each row grows only logarithmically with \(n\), because of the way the values of the Fibonacci numbers grow exponentially as a function of their index. The fission rule cannot create gaps of three or more empty cells, so consecutive numbers in a row differ by at most three, implying that the smallest number in each row is greater than \(-3n\). Because of this linear bound on how far fission can move the coins, we get a quadratic bound on the total number of firing events: each firing decreases the total distance of the coins from cell \(-3n\). This total distance starts at \(3n^2\) and remains positive, so there can be at most \(3n^2\) firings. In computational experiments up to \(n=10000\) the actual number of firings never exceeded \(n^2\) and appeared to be growing as \(n^2\bigl(1-o(1)\bigr)\). This is a big contrast from the behavior of the similar-looking chip-firing rule on a line of cells that replaces pairs of coins by a coin one cell to the left and a coin one cell to the right, which (with \(n\) chips starting in a pile on one cell) produces a square pyramidal number of firings, cubic in \(n\).

For initial coins that are not all in one big pile, it remains impossible for fission to produce new long gaps, and we can decompose the sequence of cells into subsequences of linear length separated by gaps too long to be crossed. The same analysis of total distance of coins from the starts of sequences of cells shows that, regardless of starting state, and even if we also allow fusion rules, we get an \(O(n^2)\) bound on the total number of steps. In particular, this system always terminates.

Taking few steps

Fortunately, we don’t have to take quadratic time to simplify sums of Fibonacci numbers into their Zeckendorf representations using these rules. The problem with the previous analysis was that we were doing too much fission before we did any fusion. Suppose we sidestep that with the following prioritization:

  • If any fusion step would put a coin into a pile that is currently empty, do it.
  • Otherwise, perform a fission step at the pile with the largest possible index.

If any fusion steps are possible, there must be at least one (for instance, the one with the largest index) that is prioritized. To analyze these prioritized steps, define \(N\) to be the current number of coins, and \(M\) to be the number of coins that are at or below a pile with two or more coins on it. The key insight is that whenever a step moves a coin to a pile with a larger index, it cannot form a new larger-index multi-coin pile, so both \(M\) and \(N\) are non-decreasing. They can stay the same only for a fission step at a pile of three or more coins, but in that case the next step must be a fusion. \(M\) and \(N\) are both \(\le n\), and their sum decreases at least once for every two steps, so the total number of steps is \(\le 4n\). Thus, this prioritized chip-firing method can reduce any sum of Fibonacci numbers to its Zeckendorf representation in a linear number of steps. It’s also not difficult to implement so that it takes linear time.

Zeckendorf arithmetic

Anyway, now that we have a way of simplifying sums of Fibonacci numbers to their Zeckendorf form, let’s use it for arithmetic on Zeckendorf representations, without the need to reduce them to simpler forms.

First, addition. This is very easy: Add the two Zeckendorf representations as multisets (producing piles of coins that might be adjacent, or might have two coins in them) and then simplify. The result is a linear time for adding two Zeckendorf representations, as was already known. But because it isn’t based on finite state machines, it can also handle more than two coins, for instance by adding multiple numbers at once rather than having to reduce the problem to multiple pairwise additions. As with binary arithmetic, the chip-firing method gives a quick proof of a non-obvious fact: the Zeckendorf representation of \(x+y\) has at most as many nonzero terms as there are in \(x\) and \(y\) put together.

It’s also possible to use this technique (or really, any linear-time Zeckendorf addition algorithm) as part of a long multiplication algorithm for Zeckendorf representations. To multiply two numbers \(x\) and \(y\), both given by their Zeckendorf representations, first perform a sequence of additions to compute the representations of the numbers \(F_i\cdot x\), each computed as \((F_{i-1}\cdot x)+(F_{i-2}\cdot x)\) in a single addition from earlier numbers in the sequence. Then pick out the subset of these representations for which \(F_i\) belongs to the Zeckendorf representation of \(y\), and add them together. (Or interleave the computation of \(F_i\cdot x\) and the sum of these representations, to save space.) The total time for \(\ell\)-bit inputs is \(O(\ell^2)\), as it is for the usual long multiplication algorithm. I’m using the different variable \(\ell\), because the time here is a function of all the bits in the input Zeckendorf representation, rather than just the number of nonzero bits.

You can convert a binary number to Zeckendorf with the same idea, computing the sequence of powers of two and then picking out and adding together the subset of these powers used in the binary representation of the input. In the other direction, you can convert a Zeckendorf representation to binary by computing the Fibonacci sequence in binary, picking out the terms from the given Zeckendorf representation, and adding them. Both directions of conversion take time \(O(\ell^2)\). A recent paper by Sergeev shows how to do these conversions much more efficient, only logarithmically slower than binary-number multiplication.4 Based on Sergeev’s results it would also be much more efficient (in theory at least, if not necessarily in practice) to multiply Zeckendorf representations by converting to binary, multiplying in binary, and converting back.

As the article I linked at the start hints, the direct algorithms described here are a bit unsatisfactory from the point of view of circuit complexity: the circuit size is ok but the circuit depth is much larger than it would be for binary arithmetic. I’m not convinced that this is a serious deficiency, because we’re not likely to build computers that use this arithmetic at the circuit level. Any algorithms like this are more likely to be used only in specialized situations where Zeckendorf arithmetic is relevant, like the analysis of certain combinatorial games.

  1. Connor Ahlbach, Jeremy Usatine, Christiane Frougny, and Nicholas Pippenger (2013), “Efficient algorithms for Zeckendorf arithmetic”, Fibonacci Quarterly 51 (3): 249–255, arXiv:1207.4497, MR3093678

  2. Anders Björner, László Lovász, and Peter Shor (1991), “Chip-firing games on graphs”, European Journal on Combinatorics 12 (4): 283–291, doi:10.1016/S0195-6698(13)80111-4, MR1120415

  3. Kolja Knauer (2009), “Chip-firing, antimatroids, and polyhedra”, EuroComb 2009, Electronic Notes in Discrete Mathematics 34: 9–13, doi:10.1016/j.endm.2009.07.002, MR2591410

  4. I. S. Sergeev (2018), “On the complexity of Fibonacci coding”, Problems of Information Transmission 54: 343–350, doi:10.1134/S0032946018040038, MR3917588

(Discuss on Mastodon)

by David Eppstein at June 12, 2021 10:47 AM UTC

While in theory bipartite matching should be easy, it has been observed in practice that real instances of matching theoreticians with jobs are hard. It’s been a particularly unusual year with the entire cycle going virtual. Congrats to matched vertices on both sides!

Here is a link to a crowdsourced spreadsheet created to collect information about theory hires this year. I put in a biased pseudorandom seed, please help populate and share! Rules for the spreadsheet have been copied from previous years and all edits to the document are anonymized. Please, feel free to contact me directly or post a comment if you have any suggestions about the rules.

  • You are welcome to add yourself, or people your department has hired.
  • Separate sheets for faculty, industry and postdocs/visitors.
  • Hires should be connected to theoretical computer science, broadly defined.
  • Only add jobs that you are absolutely sure have been offered and accepted. This is not the place for speculation and rumors. Please, be particularly careful when adding senior hires (people who already have an academic or industrial job) -- end dates of their current positions might be still in the future.

Theory Jobs 2021 was originally published by Grigory Yaroslavtsev at The Big Data Theory on June 12, 2021.

by Grigory Yaroslavtsev (grigory@grigory.us) at June 12, 2021 12:00 AM UTC

Scribe notes by Junu Lee, Yash Nair, and Richard Xu.

Previous post: Toward a theory of generalization learning Next post: TBD.

See also all seminar posts and course webpage.

lecture slides (pdf)lecture slides (Powerpoint with animation and annotation)video

Much of the material on causality is taken from the wonderful book by Hardt and Recht.

For fairness, a central source was the book in preparation by Barocas, Hardt, and Narayanan as well as the related NeurIPS 2017 tutorial, and other papers mentioned below.

Causality

We may have heard that “correlation does not imply causation”. How can we mathematically represent this statement, and furthermore differentiate the two rigorously?

Roughly speaking, A and B are correlated if P(B=b|A=a) is different for different values of a. To represent causation, we change the second part of the formula: A causes B if intervening to change A to some value a changes the probability of B. That is,

Pr(B=b|\text{ do }A\leftarrow a)

depends on a.

Example of Causality

Suppose we have the random variables X,W,H (taken over choice of a random person), which represent eXercising, being overWeight and having Heart disease, respectively. We put forth the following (hypothetical! this is not medical advice!) scenarios for their relationships:

Scenario 1. X\sim Bern(1/2). Now W, the overweight indicator, follows the causal relation:

W \sim \begin{cases} 0 & \text{if } X = 1\\ Bern(1/2) & \text{if } X=0\end{cases}

and the heart disease indicator H follows the same rule

H \sim \begin{cases} 0 & \text{if } X = 1\\ Bern(1/2) & \text{if } X=0\end{cases}

So, in this scenario, exercise prevents heart disease and being overweight, while if we don’t exercise, we may be overweight or suffer from heart disease with probability 1/2 independently..

Scenario 2. W\sim Bern(1/4)

X\sim \begin{cases} 0 & \text{if } W= 1\\ Bern(2/3) & \text{if } W=0 \end{cases}

and H still depends on X in the same rule in the previous scenario. So, in this scenario, people are naturally prone to being overweight with probability 1/4, and being overweight makes you less likely to exercise, rather than the causal relation being in the other way around. As before, exercise prevents heart disease, and someone who did not exercise will get heart disease with probability 1/2.

We find that in scenario 1, P(W=1|X=0)=1/2. In scenario 2,

P(W=1|X=0)=\frac{P(W=1)P(X=0|W=1)}{P(X=0)}=\frac{\frac14\cdot 1}{\frac14+\frac34\frac13}=\frac12.

In fact, as this table shows, the probabilities for all combinations of X,W,H are identical in the two scenarios!

Now, consider the intervention of setting X=0, i.e. stop exercising. That is, we change the generating model for X to be X:=0. In scenario 1, P(W=1|\text{ do }X=0) is still 1/2. In scenario 2, X=0 tells us nothing about W now so we get P(W=1|\text{ do }X=0)=1/4. Now that we added in an intervention, the two scenarios are different!

This is an example of why correlations are not causations: while the conditional probabilities \Pr( W=1 |X=0) identical in the two scenarios, the causal probabilities are diffent P(W=1|\text{ do }X=0).

NOTE: Working out this example, and understanding (a) why the two scenarios induce identical probabilities, and in particular all conditional probabilities are identical and (b) why the causal probabilities differ from the conditional probabilities in Scenario 2, is a great way to get intuition for causality and its pitfalls.

Causal Probabilities and confounders

Consider Scenario 1, where the causal structure is as follows:

Looking at the table above, we see that the unconditional probability P(H=1) equals 1/4. Since in this scenario, there is no causal relation between being overweight and suffering from heat disease, the causal probability P(H=1 | \text{ do } W \leftarrow 0) is also equal to 1/4.

However, we can calculate the conditional probability from the table and see that P(H=1|W=0)=1/6.
That means that even though in this scenario, there is no causal relation between being overweight and getting heart disease, conditioning on not being overweight reduces the probability of getting heart disease.
Once again we see here a gap between the conditional and causal probabilities.

The reason is for this gap is that there is a counfounding variable, namely X that is a common cause of both H and W.

Definition: H,W are confounded if there are values h,w such that

P(H=h|\text{ do }W=w)\neq P(H=h|W=e),

To fix the effect of a confounder, we condition on X. It also allows us to find the probability of an intervention. The general deconfounding formula is

P(H=h|\text{ do }W=w)=\sum_x P(H=h|W=w, X=x) P(X=x)\;\; (★),

where X ranges over all the immediate causes of W.

Contrast this with the formula for computing the conditional probability which is

P(H=h | W=w) = \sum_x P(H=h | W=w, X=x) P(X=x | W=w)

Using the deconfounding formula (★) requires (a) knowing the causal graph, and (b) observing the confounders. If we get this wrong and control for the wrong confounders we can get the causal probabilities wrong, as demonstrated by the following example.


One way to describe causality theory is that it aims to clarify the situations under which correlation does in fact equal causation (i.e., the conditional probabilities are equal to the causal probabilities), and how (by appropriately controlling for confounders) we can get to such a situation.

Example (two diseases) Consider the diagram below where there are two diseases X and Y such that each occurs independently with probability p. We assume each will send you to the hospital (variable Z) and those are the only reason to arrive at the hospital.

If you control for Z (i.e look at only people who went to the hospital), we find that the probabilities are now correlated: A priori the probability is p/(2p-p^2)\approx 1/2, and conditioned on X=1, the probability is p^2/p=p <1/2.

This relates to the joke “the probability of having 2 bombs on a plane is very low, so if I bring a bomb then it is very unlikely that there will be another bomb.”

In general, the causal graph can look as one of the following shapes:

If Z is a fork then controlling for Z can tease out the causal relation. If Z is a mediator or collider then controlling for Z can actually make things worse. –>

Backdoor paths: If X and Y are two random variables, we say that there is a “backdoor path” from X to Y if there is direct ancestor Z of X that is connected in the undirected version of the causal graph in a path not going through X.

We can show the following theorem:

Theorem: If there is no backdoor path then P(Y=y | \text{ do } X \leftarrow X) = P(Y=y|X=x)

Here is a “proof by picture”:

If there isn’t a backdoor path, we sort the graph in topological order, so that all the events that happen before X are not connected to Y except through X. So we can first generate all the variables A that result in X. Then the probability distribution of the events B between X and Y only depends on the value x of X, and so similarly Y is generated from some probability distribution that only depends on x.

Experimental Design

When we design experiments, we often want to estimate causal effects, and to do so we try to make sure we eliminate backdoor paths.
Consider the example of a COVID vaccine trial.
We let V=1 be the event that a trial participant obtained a vaccine, and C=1 be the event that the participant was infected with COVID.
We want to figure out P(C=1|\text{ do }V=1).
However, there is a “backdoor path”.
You will not get the vaccine if you don’t participate in the trial (which we denote by V=1), but particpating in the trial could change your behavior and hence have a causal effect on C.

To fix this we can cut the backdoor path using a placebo: it cuts the backward path by removing the confounding variable of participation, since it ensure that (conditioning on P=1), V is now an independent variable from any behavioral changes that might impact C.

Conditioning

In general, how does conditioning on some variable Z affect correlations? It may introduce correlations in events that occur before Z, but cuts any path that depends on Z.

Average Treatment Effect and Propensity Score

Suppose we have some treatment variable T that we don’t get to control (e.g. in a natural experiment). Let Y_t = Y| \text{ do } T=t , and we hope to estimate E(Y_1)-E(Y_0) which is known as the the treatment effect.
However, we worry that some underlying variable Z (e.g. healthy lifestyle) can affect both Y and T.

The propensity score, defined as e(z)=E(T|Z=z), allows us to calculate E(Y|\text{ do T}=1). We claim that as long as Z is a valid confounder (for which the formula (★) holds)

E(Y|\text{ do T}=1)=E(YT/e(Z)).

The proof is obtained by expanding out the claim, see below

Intuitively, knowing the probability that different groups of people get treatment allows us to make T independent from (Y_0,Y_1) and calculate the treatment effect.

Calculating treatment effect using ML. Suppose that the treatment effect is \tau and Y=\psi(Z)+\tau T+\text{ noise}. Now, if we learn a model f(z)\approx E(Y|Z=z), then

Y-f(z)\approx \tau(T-e(z)).

Since both Y-f(z) and T-e(z) are calculable, we only need to do a linear regression.

Instrumental Variables

When we cannot observe the counfounding variable, we can still sometimes use instrumental variables to estimate a causal effect.

Assume a linear model Y=\tau T+f(W), where W is the stuff we don’t observe. If Z is some variable that satisfies Cov(Z,f(W))=0 then

\tau=\frac{Cov(Z,Y)}{Cov(Z,T)},

which is the ratio between two observable quantities.

Fairness

We focus on fairness in classification problems, rather than fairness in learning generative models or representation (which also have their own issues, see in particular this paper by Bender, Gebru, McMillan-Major, and “Shmitchell”).

In the public image, AI has been perceived to be very successful for some tasks, and some people might hope that it is more “objective” or “impartial” than human decisions which are known to be fraught with bias). However, there are some works suggesting this might not be the case:

  • Usage in prediciting recidivism for bail decisions. For example in ProPublica Angwin, Larson, Mattu, and Kirchner showed that 44.9% of African Americans who didn’t reoffend were labeled higher risk, whereas only 23.5% of white defendants who didn’t reoffend were labeled as such.
  • Machine vision can sometimes work better on some segments of population than others. For example, Buolamwini and Gebrue showed that some “gender classifiers” achieve 99.7% accuracy on white men but only 65.3% accuracy (not much better than coin toss) on black women.
  • Lum and Isaac gave an example of a “positive feedback loop” in predictive policing in Oakland, CA. While drug use is fairly uniform across the city, the arrests are centered on particular neighborhood (that have more minority residents). In predictive policing, more police would be sent out to the places where arrests occured, hence only exercabating this disparate treatment.

Drug use in Oakland:

Drug arrests in Oakland:

While algorithms can sometimes also help, the populations they help might not be distributed equally. For example, see this table from Gates, Perry and Zorn. A more accurate underwriting model (that can better predict the default probability) enables a lender to use a more agressive risk cut off and so end up lending to more people.

However, this is true within each subpopulation too, so it may be that if the model is less accurate in a certain subpopulation, then a profit-maximizing lender will unfairly offer fewer loans to this subpopulation.

Formalizing Unfairness

In the case of employment discrimination in the U.S., we have the following components:

  • Protected class
    • categories such as race, sex, nationality, citizenship, veteran status, etc.
  • An unfairness metric, measuring either:
    • disparate treatment
    • disparate impact.

Employers are not allowed to discriminate across protected classes when hiring. The unfairness metric gives us a way to measure if there is discrimination with respected to a protected class. In particular, disparate impacts across different protected classes is often necessary but not sufficient evidence of discrimination.

Algorithms for Fairness: an Example

To see why algorithms, which at first glance seem agnostic to group membership, may exhibit disparate treatment or impact, we consider the following Google visualization by Wattenberg, Viégas, and Hardt.

Consider a blue population and an orange population for which there is no difference in the probability of a member of either population paying back the loan, but for which our model has different accuracies—in particular, the model is more accurate on the orange population. This is described by the plot below, in which the scores correspond to the model’s prediction of the probability of paying back the loan and opaque circles correspond to those who actually do not pay back the loan, whereas filled in circles correspond to those who do.

Suppose we are in charge of making a lending decision given the model prediction.
A scenario in which we give everyone a loan would be fair, but would be bad us —we would go bankrupt!

Profit when giving everyone a loan:

If we wanted to maximize profit, we would, however, give more loans to the orange population (since we’re more sure about which members of the orange population would actually pay back their loans) by setting a lower threshold (in terms of the score given by our algorithm) above which we give out loans.

This maximizes profit but is blatantly unfair. We are treating the identical blue and orange groups differently, just because our model is more accurate on one than the other, and we also have disparate impact on the two groups. A non-defaulting applicant would be 78% likely to get a loan if they are a member of the orange group, but only 60% likely to get a loan if they are a member of the orange group.

This “profit maximization” is likely the end result of any sufficiently complex lending algorithm in the absence of a fairness intervention. Even if the algorithm does not explicitly rely on the group membership attribute, by simply optimizing it to maximize profit, it may well pick up on attributes that are correlated with group membership.

Suppose on the other hand that we wanted to mandate “equal treatment” in the sense of keeping the same thresholds for the blue and orange group. The result would be the following:

In this case, since the threshold are identical, the algorithm will be calibrated. 79% of the decisions we make will be the correct ones, for both the blue and orange population. So, from our point of view, the algorithm is fair and treats the blue and orange populations identically. However, from the point of view of the applicants, this is not the case. If you are a blue applicant that will pay your loan, you have 81% chance of getting a loan, but if you are an orange customer you only have 60% of getting it. This demonstrates that defining fairness is quite delicate. In particular the above “color blind” algorithm is still arguable unfair.

This difference between the point of view of the lender and lendee also arose in the recidivism case mentioned above. From the point of view of the defendant that would not recidivate, the algorithm was more likely to label them as “high risk” if they were Black than if they were white. From the point of view of the decision maker, the algorithm was calibrated, and if anything it was a bit more likely that a white defendant labeled high risk would not recidivate than a Black defendant. See (slightly rounded and simplified) data below

If we wanted to achieve demographic parity (both populations get same total number of loans) or equal opportunity (true positive rate same for both) then we can do so, but again using different thresholds for each group:

Fico Scores and Different Types of Fairness

While the above was a hypothetical scenario, a real life example was shown by Hardt, Price and Srebro using credit (also known as FICO) scores, as described by the plot below:

For a single threshold, around 75% of Asian candidates will get loans, whereas only around 20% of Black candidates will get loans. To ensure that all groups get loans at the same rate, we would need to set the thresholds differently. In order to equalize opportunity, we’d also need to initialize the thresholds differently as well.

We see that we have different notions of what it means to be fair and that each of these different notions result in different algorithms.

Fairness and Causality

Berkeley graduate admissions in 1973 had the following statistics:

  • 44% male applicants admitted, 35% female applicants admitted;
  • However, female acceptance rate was higher at the department level, for most departments.

This paradox is commonly referred to as Simpson’s Paradox.

A “fair” causal model for this scenario might be as follows:

In the above, perhaps gender has a causal impact on the choice of department to which the applicant applies. However, a fair application process would, conditional on the department, be independent of gender of the applicant.

However, not all models that follow this causal structure are necessarily fair. In the case Griggs v. Duke Power Co., 1971, the court ruled that decision-making under the following causal model was unfair:

While the model appears to be fair, since the job offer is conditionall independent of race, given the diploma, the court ruled that the job did not actually require a high school diploma. Hence, using the diploma as a factor in hiring decisions was really just a proxy for race, resulting in essentially purposeful unfair discrimination based on race. This creation of proxies is referred to as redlining.

Bottom Line

We cannot come up with universal fairness criteria. The notion of fairness itself is based on assumptions about:

  • representation of data
  • relationships to unmeasured inputs and outcomes
  • causal relation of inputs, predictions, outcomes.

Fairness depends on what we choose to measure to observe, in both inputs and outputs, and how we choose to act upon them. In particular, we have the following causal structure, wherein measure inputs, decision-making, and measured outcomes all play a role in affecting the real-world and function together in a feedback cycle:

A more comprehensive illustration is given in this paper of Friedler, Scheidegger, and Venkatasubramanian:

by Boaz Barak at June 11, 2021 08:42 PM UTC

I just heard that Benny Chor died this morning. Chor did very important work on computational biology and distributed algorithms, but I (and probably many of my readers) know him primarily for his work on cryptography, for his work on randomness extraction and for introducing the notion of private information retrieval.

I only met him once, at the event for Oded Goldreich’s 60th birthday. On the occasion, he gave a talk on the Chor-Goldreich paper, which introduced the problem of randomness extraction from independent sources, and which introduced min-entropy as the right parameter by which to quantify the randomness content of random sources. He did so using the original slides used for the FOCS 1985 talk.

I took a picture during the talk, which I posted online, and later he sent me an email asking for the original. Sadly, this was the totality of our correspondence. I heard that besides being a brilliant and generous researchers, he was a very playful, likeable and nice person. My thoughts are with his family and his friends.

by luca at June 10, 2021 08:18 PM UTC

This coming Fall semester the Simons Institute for the Theory of Computing in Berkeley will have in-person activities, including the really interesting program on the complexity of statistical inference, within which I will co-organize a workshop on cryptography, average-case complexity, and the complexity of statistical problems.

As it had been the case before the pandemic, all Simons Institute events will be streamed and available remotely. This includes a new series of Public Lectures called “Breakthroughs” that starts next week with a talk by Virginia Williams on matrix multiplication.

by luca at June 10, 2021 08:00 PM UTC

The other night Dana and I watched “The Internet’s Own Boy,” the 2014 documentary about the life and work of Aaron Swartz, which I’d somehow missed when it came out. Swartz, for anyone who doesn’t remember, was the child prodigy who helped create RSS and Reddit, who then became a campaigner for an open Internet, who was arrested for using a laptop in an MIT supply closet to download millions of journal articles and threatened with decades in prison, and who then committed suicide at age 26. I regret that I never knew Swartz, though he did once send me a fan email about Quantum Computing Since Democritus.

Say whatever you want about the tactical wisdom or the legality of Swartz’s actions; it seems inarguable to me that he was morally correct, that certain categories of information (e.g. legal opinions and taxpayer-funded scientific papers) need to be made freely available, and that sooner or later our civilization will catch up to Swartz and regard his position as completely obvious. The beautifully-made documentary filled me with rage and guilt not only that the world had failed Swartz, but that I personally had failed him.

At the time of Swartz’s arrest, prosecution, and suicide, I was an MIT CS professor who’d previously written in strong support of open access to scientific literature, and who had the platform of this blog. Had I understood what was going on with Swartz—had I taken the time to find out what was going on—I could have been in a good position to help organize a grassroots campaign to pressure the MIT administration to urge prosecutors to drop the case (like JSTOR had already done), which could plausibly have made a difference. As it was, I was preoccupied in those years with BosonSampling, getting married, etc., I didn’t bother to learn whether anything was being done or could be done about the Aaron Swartz matter, and then before I knew it, Swartz had joined Alan Turing in computer science’s pantheon of lost geniuses.

But maybe there was something deeper to my inaction. If I’d strongly defended the substance of what Swartz had done, it would’ve raised the question: why wasn’t I doing the same? Why was I merely complaining about paywalled journals from the comfort of my professor’s office, rather than putting my own freedom on the line like Swartz was? It was as though I had to put some psychological distance between myself and the situation, in order to justify my life choices to myself.

Even though I see the error in that way of “thinking,” it keeps recurring, keeps causing me to make choices that I feel guilt or at least regret about later. In February 2020, there were a few smart people saying that a new viral pneumonia from Wuhan was about to upend life on earth, but the people around me certainly weren’t acting that way, and I wasn’t acting that way either … and so, “for the sake of internal consistency,” I didn’t spend much time thinking about it or investigating it. After all, if the fears of a global pandemic had a good chance of being true, I should be dropping everything else and panicking, shouldn’t I? But I wasn’t dropping everything else and panicking … so how could the fears be true?

Then I publicly repented, and resolved not to make such an error again. And now, 15 months later, I realize that I have made such an error again.

All throughout the pandemic, I’d ask my friends, privately, why the hypothesis that the virus had accidentally leaked from the Wuhan Institute of Virology wasn’t being taken far more seriously, given what seemed like a shockingly strong prima facie case. But I didn’t discuss the lab leak scenario on this blog, except once in passing. I could say I didn’t discuss it because I’m not a virologist and I had nothing new to contribute. But I worry that I also didn’t discuss it because it seemed incompatible with my self-conception as a cautious scientist who’s skeptical of lurid coverups and conspiracies—and because I’d already spent my “weirdness capital” on other issues, and didn’t relish the prospect of being sneered at on social media yet again. Instead I simply waited for discussion of the lab leak hypothesis to become “safe” and “respectable,” as today it finally has, thanks to writers who were more courageous than I was. I became, basically, another sheep in one of the conformist herds that we rightly despise when we read about them in history.

(For all that, it’s still plausible to me that the virus had a natural origin after all. What’s become clear is simply that, even if so, the failure to take the possibility of a lab escape more seriously back when the trail of evidence was fresher will stand as a major intellectual scandal of our time.)

Sometimes people are wracked with guilt, but over completely different things than the world wants them to be wracked with guilt over. This was one of the great lessons that I learned from reading Richard Rhodes’s The Making of the Atomic Bomb. Many of the Manhattan Project physicists felt lifelong guilt, not that they’d participated in building the bomb, but only that they hadn’t finished the bomb by 1943, when it could have ended the war in Europe and the Holocaust.

On a much smaller scale, I suppose some readers would still like me to feel guilt about comment 171, or some of the other stuff I wrote about nerds, dating, and feminism … or if not that, then maybe about my defense of a two-state solution for Israel and Palestine, or of standardized tests and accelerated math programs, or maybe my vehement condemnation of Trump and his failed insurrection. Or any of the dozens of other times when I stood up and said something I actually believed, or when I recounted my experiences as accurately as I could. The truth is, though, I don’t.

Looking back—which, now that I’m 40, I confess is an increasingly large fraction of my time—the pattern seems consistent. I feel guilty, not for having stood up for what I strongly believed in, but for having failed to do so. This suggests that, if I want fewer regrets, then I should click “Publish” on more potentially controversial posts! I don’t know how to force myself to do that, but maybe this post itself is a step.

by Scott at June 10, 2021 05:38 PM UTC

Faculty hiring in computer science is a process long due for an overhaul. The pandemic certainly changed some of the dynamics moving most of the interviews online and saving a ton of money and time. Will this be the start of a fresh approach to recruiting?

A typical search in the past few years had some schools flying in 30-40 candidates, typically costing over a $1000 each and a full-time job for a staff member during the search. We'd justify the expense as small compared to the millions we'd invest in a faculty member throughout their career, but it is generally the largest discretionary expense for a CS department. It also gives advantages to rich departments over others.

During the pandemic all those interviews moved online and worked reasonably well at virtually no additional cost. Also no need to scrounge around to find faculty willing to skip family meals to have dinner with the candidates. And if a faculty had a conflict with a candidate on the interview day, they could schedule on a different day. There really is no reason to have all the meetings on the same day.

With the pandemic mostly behind us, will we go back to in-person interviews moving forward. I suspect the airport interview, where you fly out 20 or so candidates to have hour long interviews in a hotel near an airport with a search committee for an administrative position, will be the first to go completely virtual. 

Even for regular faculty interviews, there will be great pressure to reduce the number of in-person visits, perhaps to just the top candidates, or just the ones who have offers--make the "second visit" the only visit. Richer departments may find the expense worthwhile to make a bigger impression on the candidates and that will only expand the advantage of wealthier universities.

Times like this are the perfect opportunity for CS leadership to come in and give some sanity to the hiring process but I'm not holding my breath.

by Lance Fortnow (noreply@blogger.com) at June 10, 2021 01:10 PM UTC

We propose a new approach to the hardness-to-randomness framework and to the promise-BPP=promise-P conjecture. Classical results rely on non-uniform hardness assumptions to construct derandomization algorithms that work in the worst-case, or rely on uniform hardness assumptions to construct derandomization algorithms that work only in the average-case. In both types of results, the derandomization algorithm is ``black-box'' and uses the standard PRG approach. In this work we present results that closely relate **new and natural uniform hardness assumptions** to **worst-case derandomization** of promise-BPP, where the algorithms underlying the latter derandomization are **non-black-box**. In our main result, we show that promise-BPP=promise-P if the following holds: There exists a multi-output function computable by logspace-uniform circuits of polynomial size and depth $n^2$ that cannot be computed by uniform probabilistic algorithms in time $n^c$, for some universal constant $c>1$, on **almost all inputs**. The required failure on ``almost all inputs'' is stronger than the standard requirement of failing on one input of each length; however, the same assumption without the depth restriction on $f$ is **necessary** for the conclusion. This suggests a potential equivalence between worst-case derandomization of promise-BPP of any form (i.e., not necessarily by a black-box algorithm) and the existence of efficiently computable functions that are hard for probabilistic algorithms on almost all inputs. In our second result, we introduce a new and uniform hardness-to-randomness tradeoff for the setting of **superfast average-case derandomization**; prior to this work, superfast average-case derandomization was known only under non-uniform hardness assumptions. In an extreme instantiation of our new tradeoff, under appealing uniform hardness assumptions, we show that for every polynomial $T(n)$ and constant $\epsilon>0$ it holds that $BPTIME[T]\subseteq heur-DTIME[T\cdot n^{\epsilon}]$, where the ``heur'' prefix means that no polynomial-time algorithm can find, with non-negligible probability, an input on which the deterministic simulation errs. Technically, our approach is to design **targeted PRGs and HSGs**, as introduced by Goldreich (LNCS, 2011). The targeted PRGs/HSGs ``produce randomness from the input'', as suggested by Goldreich and Wigderson (RANDOM 2002); and their analysis relies on non-black-box versions of the reconstruction procedure of Impagliazzo and Wigderson (FOCS 1998). Our main reconstruction procedure crucially relies on the ideas underlying the proof system of Goldwasser, Kalai, and Rothblum (J. ACM 2015).

at June 10, 2021 07:46 AM UTC

We prove that there exists an absolute constant $\delta>0$ such any binary code $C\subset\{0,1\}^N$ tolerating $(1/2-\delta)N$ adversarial deletions must satisfy $|C|\le 2^{\poly\log N}$ and thus have rate asymptotically approaching $0$. This is the first constant fraction improvement over the trivial bound that codes tolerating $N/2$ adversarial deletions must have rate going to $0$ asymptotically. Equivalently, we show that there exists absolute constants $A$ and $\delta>0$ such that any set $C\subset\{0,1\}^N$ of $2^{\log^A N}$ binary strings must contain two strings $c$ and $c'$ whose longest common subsequence has length at least $(1/2+\delta)N$. As an immediate corollary, we show that $q$-ary codes tolerating a fraction $1-(1+2\delta)/q$ of adversarial deletions must also have rate approaching $0$. Our techniques include string regularity arguments and a structural lemma that classifies binary strings by their oscillation patterns. Leveraging these tools, we find in any large code two strings with similar oscillation patterns, which is exploited to find a long common subsequence.

at June 09, 2021 04:59 PM UTC

After 16 months without lecturing to an audience in my same location, I gave yesterday two lectures at the Technion in front of a live audience (and some additional audience in remote locations). The main lecture was in COMSOC 2021, an international conference on computational social choice,  and earlier I gave a guest lecture in Roy Meshulam’s class about simple polytopes. I also met many friends. 

Reshef Meir who organized (with Bill Zwicker) COMSOC 2021 wrote:

Hi all, 
today was beyond expectations – the first feeling of a real actual conference after almost a year and a half!  We had about 40 people attending, viewing posters, and listening to talks. I truly hope this will return to be a common scene and that we can all meet face to face soon.
 

In my COMSOC lecture I talked about some earlier ideas and results in my work on social choice, starting with my paper with Ariel Rubinstein and Rani Spiegler on rationalizing individual choice by multiple rationals, and my subsequent attempt to use learnability as a tool for understanding choices of economic agents. This led to interesting questions on social choice that are discussed in this 2009 post.

In Roy’s course I explained h-vectors of polytopes and the Dehn-Sommerville relations based on counting outdegrees of the graph of the polytope when we direct its edges based on a generic abstract objective function. I moved on to present a proof of Blind-Mani’s theorem that the graph of the polytope determines the full combinatorics. This proof is probably the one proof I presented the most and it is given in this 2009 post.

sc1

In my  COMSOC lecture I described how to fill the two question marks in the table above.

sc2

by Gil Kalai at June 09, 2021 06:30 AM UTC

We hope you are all staying safe. With massive vaccination programs across the globe we hope you and your loved ones are getting back to what used to be normal. With that out of the way, let us circle back to Property Testing. This month was less sleepy as compared to the two preceding months and we saw six papers in total (two of them explore problems in quantum property testing). Without further ado, let us take a deeper dive.

GSF-locality is not sufficient for proximity-oblivious testing, by Isolde Adler, Noleen Kohler, Pan Peng (arXiv) The notion of proximity oblivious testers was made explicit in the seminal work of Goldreich and Ron in 2009 [GR09]. A proximity oblivious tester for a graph property is a constant query tester that rejects a graph with probability that monotonically increases with distance to the property. (Edit: Correction) A property is called proximity oblivious testable (or PO testable) if it has a one sided proximity oblivious tester. [GR09] gave a characterization of which properties \(\Pi\) are PO testable in the bounded degree model if and only if it is a “local” property of some kind which satisfies a certain non propagation condition. [GR09] conjectured that all such “local” properties satisfy this non propagation condition. This paper refutes the above conjecture from [GR09].

Coming up next. More action on triangle freeness.

Testing Triangle Freeness in the General Model in Graphs with Arboricity \(O(\sqrt n)\), by Reut Levi (arXiv) PTReview readers are likely to be aware that triangle freeness has been a rich source of problems for developing new sublinear time algorithms. This paper considers the classic problem of testing triangle freeness in general graphs. In the dense case, algorithms with running time depending only on \(\varepsilon\) are known thanks to the work of Alon, Fischer, Krivelevich and Szegedy. In the bounded degree case, Goldreich and Ron gave testers with query complexity \(O(1/\varepsilon)\). This paper explores the problem in general graph case and proves an upper bound of \(O(\Gamma/d_{avg} + \Gamma)\) where \(\Gamma\) is the arboricity of the graph. The author also shows that this upperbound is tight for graphs with arboricity at most \(O(\sqrt n)\). Curiously enough, the algorithm does not take arboricity of the graph as an input and yet \(\Gamma\) (the arboricity) shows up in the upper and lower bounds.

Testing Dynamic Environments: Back to Basics, by Yonatan Nakar and Dana Ron (arXiv) Goldreich and Ron introduced the problem of testing “dynamic environments” in 2014. Here is the setup for this problem. You are given an environment that evolves according to a local rule. Your goal is to query some of the states in the system at some point of time and determine if the system is evolving according to some fixed rule or is far from it. In this paper, the authors consider environments defined by elementary cellular automata which evolve according to threshold rules as one of the first steps towards understanding what makes a dynamic environment tested efficiently. The main result proves the following: if your local rules satisfy some conditions, you can use a meta algorithm with query complexity \(poly(1/\varepsilon)\) which is non adaptive and has one sided error. And all the threshold rules indeed satisfy these conditions which means they can be tested efficiently.

Identity testing under label mismatch, by Clement Canonne and Karl Wimmer (arXiv) This paper considers a classic problem distribution testing with the following twist. Let \(q\) denote a distribution supported on \([n]\). You are given access to samples from another distribution \(p\) where \(p = q \circ \pi\) where \(\pi\) is some unknown permutation. Thus, I relabel the data and I give you access to samples from the relabeled dataset. Under this promise, note that identity testing becomes a trivial problem if \(q\) is known to be uniform over \([n]\). The authors develop algorithms for testing and tolerant testing of distributions under this additional promise of \(p\) being a permutation of some known distribution \(q\). The main result shows as exponential gap between the sample complexity of testing and tolerant testing under this promise. In particular, identity testing under the promise of permutation has sample complexity \(\Theta(\log^2 n)\) whereas tolerant identity testing under this promise has sample complexity \(\Theta(n^{1-o(1)})\).

Testing symmetry on quantum computers, by Margarite L. LaBorde and Mark M. Wilde (arXiv) This paper develops algorithms which test symmetries of a quantum states and changes generated by quantum circuits. These tests additionally also quantify how symmetric these states (or channels) are. For testing what are called “Bose states” the paper presents efficient algorithms. The tests for other kinds of symmetry presented in the paper rely on some aid from a quantum prover.

Quantum proofs of proximity, by Marcel Dall’Agnol, Tom Gur, Subhayan Roy Moulik, Justin Thaler (ECCC) The sublinear time (quantum) computation model has been gathering momentum steadily over the past several years. This paper seeks to understand the power of \({\sf QMA}\) proofs of proximity for property testing (recall \({\sf QMA}\) is the quantum analogue of \({\sf NP}\)). On the algorithmic front, the paper develops sufficient conditions for properties to admit efficient \({\sf QMA}\) proofs of proximity. On the complexity front, the paper demonstrates a property which admits an efficient \({\sf QMA}\) proof but does not admit a \({\sf MA}\) or an interactive proof of proximity.

by Akash at June 09, 2021 05:31 AM UTC

I now have a feature article up at Quanta magazine, entitled “What Makes Quantum Computing So Hard To Explain?” I.e., why do journalists, investors, etc. so consistently get central points wrong, even after the subject has been in public consciousness for more than 25 years? Perhaps unsurprisingly, I found it hard to discuss that meta-level question, as Quanta‘s editors asked me to do, without also engaging in the object-level task of actually explaining QC. For regular Shtetl-Optimized readers, there will be nothing new here, but I’m happy with how the piece turned out.

Accompanying the Quanta piece is a 10-minute YouTube explainer on quantum computing, which (besides snazzy graphics) features interviews with me, John Preskill, and Dorit Aharonov.

On a different note, my colleague Mark Wilde has recorded a punk-rock song about BosonSampling. I can honestly report that it’s some of the finest boson-themed music I’ve heard in years. It includes the following lyrics:

Quantum computer, Ain’t no loser
Quantum computer, Quantum computer

People out on the streets
They don’t know what it is
They think it finds the cliques
Or finds graph colorings
But it don’t solve anything
Said it don’t solve anything
Bosonic slot machine
My lil’ photonic dream

Speaking of BosonSampling, A. S. Popova and A. N. Rubtsov, of the Skolkovo Institute in Moscow, have a new preprint entitled Cracking the Quantum Advantage threshold for Gaussian Boson Sampling. In it, they claim to give an efficient classical algorithm to simulate noisy GBS experiments, like the one six months ago from USTC in China. I’m still unsure how well this scales from 30-40 photons up to 50-70 photons; which imperfections of the USTC experiment are primarily being taken advantage of (photon losses?); and how this relates to the earlier proposed classical algorithms for simulating noisy BosonSampling, like the one by Kalai and Kindler. Anyone with any insight is welcome to share!

OK, one last announcement: the Simons Institute for the Theory of Computing, in Berkeley, has a new online lecture series called “Breakthroughs,” which many readers of this blog might want to check out.

by Scott at June 08, 2021 08:29 PM UTC

We're having an online workshop on "Machine Learning for Algorithms" on July 13-14, with a great group of speakers.  Announcement below, link at https://fodsi.us/ml4a.html, free registration (but please register in advance)!

In recent years there has been increasing interest in using machine learning to improve the performance of classical algorithms in computer science, by fine-tuning their behavior to adapt to the properties of the input distribution. This "data-driven" or "learning-based" approach to algorithm design has the potential to significantly improve the efficiency of some of the most widely used algorithms. For example, they have been used to design better data structures, online algorithms, streaming and sketching algorithms, market mechanisms and algorithms for combinatorial optimization, similarity search and inverse problems.  This virtual workshop will feature talks from experts at the forefront of this exciting area.

The workshop is organized by Foundations of Data Science Institute (FODSI), a project supported by the NSF TRIPODS program (see fodsi.us). To attend, please register at    
 
https://fodsi.us/ml4a.html  

by Michael Mitzenmacher (noreply@blogger.com) at June 08, 2021 07:04 PM UTC

We give a direct product theorem for the entanglement-assisted interactive quantum communication complexity of an $l$-player predicate $V$. In particular we show that for a distribution $p$ that is product across the input sets of the $l$ players, the success probability of any entanglement-assisted quantum communication protocol for computing $n$ copies of $V$, whose communication is $o(\log(\mathrm{eff}^*(V,p))\cdot n)$, goes down exponentially in $n$. Here $\mathrm{eff}^*(V, p)$ is a distributional version of the quantum efficiency or partition bound introduced by Laplante, Lerays and Roland (2014), which is a lower bound on the distributional quantum communication complexity of computing a single copy of $V$ with respect to $p$. As an application of our result, we show that it is possible to do device-independent quantum key distribution (DIQKD) without the assumption that devices do not leak any information after inputs are provided to them. We analyze the DIQKD protocol given by Jain, Miller and Shi (2017), and show that when the protocol is carried out with devices that are compatible with $n$ copies of the Magic Square game, it is possible to extract $\Omega(n)$ bits of key from it, even in the presence of $O(n)$ bits of leakage. Our security proof is parallel, i.e., the honest parties can enter all their inputs into their devices at once, and works for a leakage model that is arbitrarily interactive, i.e., the devices of the honest parties Alice and Bob can exchange information with each other and with the eavesdropper Eve in any number of rounds, as long as the total number of bits or qubits communicated is bounded.

at June 08, 2021 12:31 PM UTC

    Paper    Project Repo    Documentation and Guides    Blog Demo

In our latest paper, in collaboration with Microsoft Research, we introduce 3DB: an extendable, unified framework for debugging and analyzing vision models using photorealistic simulation. We’re releasing 3DB as a package, accompanied by extensive API documentation, guides, and demos.

Note: You are now viewing the Javascript-free/lightweight version of this post—to see the full version (with interactive plots, diagrams, and models!), click here

Identifying failure modes and biases in vision models is a rapidly emerging challenge in machine learning. In high-stakes applications, simply deploying models and collecting failures that arise in the wild is often difficult, expensive, and irresponsible. To this end, a recent line of work in vision focuses on identifying model failure modes via in-depth analyses of image transformations and corruptions, object orientations, backgrounds, or shape-texture conflicts. These studies (and other similarly important ones) reveal a variety of patterns of performance degradation in vision models. Still, performing each such study requires time, developing (often complex) toolingFor our study of image backgrounds, for example, we used a combination of bounding boxes and classical computer vision tools to crop out image backgrounds. We then had to manually filter out the images for which the tools failed. Even for the images where the toolkit succeeded, there remained inevitable cropping artifacts., and a willingness to settle for less than perfect simulations of each potential failure mode. Our question is: can we support reliable discovery of model failures in a systematic, automated, and unified way?

3DB: A Rendering-based Debugging Platform

A sampling of the analyses enabled by 3DB.

A sampling of the analyses enabled by 3DB.

In our latest paper, we try to make progress on this question and propose 3DB, a platform for automatically identifying and analyzing the failure modes of computer vision models using 3D rendering. 3DB aims to allow users to go from a testable, robustness-based hypothesis to concrete, photorealistic experimental evidence with minimal time and effort.

The platform revolves around the modular workflow pictured below. First, users specify a set of 3D objects and environments, as well as a set of 3D (or 2D) transformations called controls that determine the space of admissible object-environment configurations. 3DB then renders a myriad of admissible scenes and feeds them through the user’s computer vision model of choice. The user can finally stratify, aggregate, or otherwise analyze the results either by reading the outputted JSON, or through the pre-packaged dashboard.

An illustration of the 3DB workflow.

3DB easily adapts to a variety of use cases: in particular, users can modify and swap out any part of this pipeline (e.g., the renderer, the logger, the model type, or the controls) for their own custom-written components, without needing to modify any of the 3DB codebase. We’ve compiled guides, extensive API documentation, and a full demo showing how 3DB streamlines model debugging.

In fact, this blog post will double as another demo! We’ll present the (short) code necessary to reproduce every plot in the post below using 3DB. You can download the aggregated code for this blog post here.

To set up, follow the steps below—then, in the remainder of this post, press “Show/hide code and instructions” to see the steps necessary to reproduce each experiment below.

  1. Clone the blog demo repo
  2. Run
    cd blog_demo
    , then
    bash setup.sh
    (assumes
    unzip
    is installed) to download a large Blender environment, then
    cd ../
  3. Install 3DB:
    curl -L https://git.io/Js8eT | bash /dev/stdin threedb
  4. Run
    conda activate threedb
  5. Our experiments below will need a
    BLENDER_DATA
    folder that contains two subfolders:
    blender_models/
    containing 3D models (
    .blend
    files with a single object whose name matches the filename), and
    blender_environments/
    containing environments. We will provide you with these later
  6. Separately, make a file called
    base.yaml
    and paste in the configuration from the next pane.

inference:
  module: 'torchvision.models'
  label_map: 'blog_demo/resources/imagenet_mapping.json'
  class: 'resnet18'
  normalization:
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
  resolution: [224, 224]
  args:
    pretrained: True
evaluation:
  module: 'threedb.evaluators.classification'
  args:
    classmap_path: 'blog_demo/resources/ycb_to_IN.json'
    topk: 1
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
policy:
  module: "threedb.policies.random_search"
  samples: 5
logging:
  logger_modules:
    - "threedb.result_logging.image_logger"
    - "threedb.result_logging.json_logger"

Using 3DB

Prior works have already used 3D rendering (to great effect) to study biases of machine learning models, including pose and context-based biases. Our goal is not to propose a specific 3D-rendering based analysis, but rather to provide an easy-to-use, highly extendable framework that unifies prior analyses (both 3D and 2D) while enabling users to (a) conduct a host of new analyses with the same ease and with realistic results; and (b) effortlessly compose different factors of variation to understand their interplay.

We’ll dedicate the rest of this post to illustrating how one might actually use 3DB in practice, focusing on a single example 3D model3DB works with any 3D model, and we refer the reader to our paper for more examples and details.:

The 3D mug model that we'll be using throughout this post.

In what follows, we will walk through example applications of 3DB to discover biases of ML models (some previously documented, others not). For the sake of brevity, we’ll highlight just a few of these (re-)discoveries—to see more, check out the paper. We’ll then demonstrate that the discoveries of 3DB transfer pretty reliably to the real world!

Our experiments will all operate on an ImageNet-pretrained ResNet-18The classifier has a ~70% validation-set accuracy. that has 42% accuracy on images from the “coffee mug” ImageNet subclass. While we only study classification in this blog post, 3DB also supports object detection and can be easily extended to support other image-based tasks, such as semantic segmentation, too.

Image Background Sensitivity

In one of our previous posts, we continued a long line of prior work (see, e.g., here, here, here, etc.) showing that models can be over-reliant on image backgrounds, and demonstrated that they are easily broken by adversarially chosen backgrounds. To accomplish this, our prior analysis used classical computer vision tools to separate foregrounds from backgrounds, then pasted foregrounds from one image onto backgrounds from another. This process was slow and required extensive quality control to ensure that backgrounds and foregrounds were being extracted properly—and even when they were, a few artifacts remained:

3DB lets us reproduce these findings effortlessly and without introducing such artifacts. To demonstrate this, we use 3DB to render our mug 3D model on hundreds of HDRI backgrounds, resulting in images such as:

We then analyze the performance of a pretrained ResNet-18Recall that this model obtains 42% accuracy on the corresponding "coffee mug" ImageNet class subset.on these images. We find that the performance of the classifier varies significantly across backgrounds, and that accuracy correlates with a crude measure of “background simplicity” (the JPEG compressed size of the image—with smaller size corresponding to being more simple).

Simplicity versus model accuracy for HDRI backgrounds

A graph plotting average accuracy of our pre-trained ImageNet model (y-axis) on images rendered by 3DB while varying the complexity of the rendering backgrounds (x-axis).

A note on compositionality: An important part of 3DB that we don’t discuss here is compositionality, i.e., the ability to put together multiple controls and study their joint effect. For example, in our paper we studied how a model’s prediction vary with various zoom levels and backgrounds of an object. We found that the optimal zoom level varies a lot by background.

Show/hide code and instructions

# $BLENDER_DATA/blender_environments` contains several backgrounds and
# $BLENDER_DATA/blender_models contains the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/backgrounds
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

# (Optional) Download additional backgrounds you want---e.g., from
# https://hdrihaven.com/hdris/ (both `.hdr` and `.blend` files work) and put
# them in BLENDER_DATA/blender_environments. 
wget https://hdrihaven.com/hdris/PATH/TO/HDRI \
    -O $BLENDER_DATA/blender_environments

# Direct results
export RESULTS_FOLDER='results_backgrounds'

# Run 3DB (with the YAML file from the next pane saved as `backgrounds.yaml`):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA backgrounds.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Finally, run analysis using pandas (third pane)
python analyze_bgs.py


base_config: "base.yaml"
policy:
  module: "threedb.policies.random_search"
  samples: 20
controls:
  - module: "threedb.controls.blender.orientation"
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"


import pandas as pd
import numpy as np
import json

log_lines = open('results_backgrounds/details.log').readlines()
class_map = json.load(open('results_backgrounds/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')

res = df.groupby('environment').agg(accuracy=('is_correct', 'mean'),
        most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)

Texture Bias

Another recent study of neural network biases showed that in contrast to humans, convolutional neural networks (CNNs) rely more on texture to recognize objects than on shape. The example below typifies this phenomenon—a cat with an elephant texture is recognized as a cat by humans, but as an elephant by CNNs:

Cue-conflict images introduced by Geirhos et al.

An example of the cue-conflict images introduced by Geirhos et al. Combining the elephant texture with the cat shape results in a mixed image on which CNNs consistently predict with the texture signal.

This example and others like it (dubbed ‘cue-conflict’ images) provide a striking illustration of the contrast between human and CNN-based classification mechanisms. Still, just as in the case of image backgrounds, creating such images typically necessitates time, technical skill, quality control, and/or introduction of unwanted artifacts (for example, in the above figure, ideally we would modify only the texture of the cat without altering the background).

However, using 3DB we can easily collect photorealistic empirical evidence of texture bias. Without modifying the internal 3DB codebase at all, one can write a custom control that modifies the texture of objects in the scene while keeping the rest intact. With this custom control in placeIn fact, the texture-swapping control for this experiment is now pre-packaged with 3DB, since we already wrote it ourselves!, one can simply randomize the texture of the mug across various backgrounds, poses and camera parameters before stratifying results:

Chungus

The performance of the pretrained model on mugs (and other objects) deteriorates severely upon replacing the mug’s texture with a “wrong” one, providing clear corroborating evidence of the texture bias! We noticed in our experiments that for some textures (e.g., zebra), the coffee mug was consistently misclassified as the corresponding animal, whereas for others (e.g., crocodile), the mug is misclassified as either a related class (e.g., turtle or other reptile), or as an unrelated object (e.g., a trash can).

Show/hide code and instructions

# ${BLENDER_DATA}/blender_environments contains several backgrounds,
# ${BLENDER_DATA}/blender_models contain the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/texture_swaps

# List the materials that we will use for this post:
ls blog_demo/data/texture_swaps/blender_control_material
# You can also make or download blender materials corresponding 
# to other textures you want to test, and add them to that folder 

# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

export RESULTS_FOLDER=results_texture

# Run 3DB (with the YAML file from the next pane saved as texture_swaps.yaml):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA texture_swaps.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Finally, run analysis using pandas (copy from third pane)
python analyze_ts.py


base_config: "base.yaml"
controls:
  - module: "threedb.controls.blender.orientation"
    rotation_x: -1.57
    rotation_y: 0.
    rotation_z: [-3.14, 3.14]
  - module: "threedb.controls.blender.position"
    offset_x: 0.
    offset_y: 0.5
    offset_z: 0.
  - module: "threedb.controls.blender.pin_to_ground"
    z_ground: 0.25
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    view_point_x: 1.
    view_point_y: 1.
    view_point_z: [0., 1.]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.material"
    replacement_material: ["cow.blend", "elephant.blend", "zebra.blend", "crocodile.blend", "keep_original"]
  - module: "threedb.controls.blender.denoiser"


import pandas as pd
import numpy as np
import json

log_lines = open('results_texture/details.log').readlines()
class_map = json.load(open('results_texture/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df = df.drop('render_args', axis=1).join(pd.DataFrame(df.render_args.values.tolist()))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')

res = df.groupby('MaterialControl.replacement_material').agg(acc=('is_correct', 'mean'),
      most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)

Part-of-Object Attribution

Beyond general hypotheses about model biases, 3DB allows us to test vision systems on a more fine-grained level. In the case of our running mug example, for instance, we can use the platform to understand which specific parts of its 3D mesh correlate with classifier accuracy. Specifically, below we generate (and classify) scenes with random mug positions, rotations, and backgrounds. Since 3DB stores texture-coordinate information for each rendering, we can reconstruct a three-dimensional heatmap that encodes, for each point on the surface of the mug, the classifier’s accuracy conditioned on that point being visible:

A part-of-object attribution heatmap for our mug 3D model. Red pixels indicate areas whose visibility most improves accuracy, whereas blue areas' visibility correlates with incorrect classifications.

A number of phenomena stand out from this heatmap, including:

  1. The classifier is worse when the side of the mug opposite the handle is seen.
  2. The classifier is more accurate when the bottom rim is visible.
  3. The classifier performs worst when the inside of the mug is visible.
Show/hide code and instructions

# point BLENDER_DATA to the environments and models for this experiment
export BLENDER_DATA=$(realpath blog_demo)/data/part_of_object
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

# Optionally: download additional backgrounds (`.hdr` or `.blend`) e.g.,
wget URL -O $BLENDER_DATA/blender_environments/new_env.hdr

# Point results folder to where you want output written
export RESULTS_FOLDER='results_part_of_object'

# Run 3DB (with the YAML file from the next pane):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA part_of_object.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Run `part_of_object.py` (third pane) to generate the heat map of the mug.
python po_analysis.py


base_config: "base.yaml"
policy:
  module: "threedb.policies.random_search"
  samples: 20
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
  with_uv: True
controls:
  - module: "threedb.controls.blender.orientation"
    rotation_x: -1.57
    rotation_y: 0.
    rotation_z: [-3.14, 3.14]
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    view_point_x: 1.
    view_point_y: 1.
    view_point_z: 1.
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"
  - module: "threedb.controls.blender.background"
    H: 1.
    S: 0.
    V: 1.

import pandas as pd
import numpy as np
import json
from PIL import Image

DIR = 'results_part_of_object'
log_lines = open(f'{DIR}/details.log').readlines()
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))

# From class index to class name (for readability)
class_map = json.load(open(f'{DIR}/class_maps.json'))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])

# We'll be a little lenient here to get a more interesting heatmap
df['is_correct'] = df['prediction'].isin(['cup', 'coffee mug'])

uv_num_correct = np.zeros((256, 256))
uv_num_visible = np.zeros((256, 256))
for imid in df["id"].unique().tolist():
    is_correct = float(df.set_index('id').loc[imid]['is_correct'])
    vis_coords_im = Image.open(f'{DIR}/images/{imid}_uv.png')
    vis_coords = np.array(vis_coords_im).reshape(-1, 3)
    # R and G channels encode texture coordinates (x, y), 
    # B channel is 255 for object and 0 for background
    # So we will filter by B then only look at R and G.
    vis_coords = vis_coords[vis_coords[:,2] > 0][:,:2]

    uv_num_visible[vis_coords[:,0], vis_coords[:,1]] += 1.
    uv_num_correct[vis_coords[:,0], vis_coords[:,1]] += is_correct

# Accuracy = # correct / # visible
uv_accuracy = uv_num_correct / (uv_num_visible + 1e-4)

# Saves a black-and-white heatmap
Image.fromarray((255 * uv_accuracy).astype('uint8'))


Now that we have hypotheses regarding model performance, we can test them! Inspecting the ImageNet validation set, we found that our classifier indeed (a) struggles on coffee mugs when the handle is not showing (providing a feasible explanation for (1), since the side opposite the handle is only visible when the handle itself isn’t), and (b) performs worse at higher camera angles (providing a plausible explanation for (2)). We want to focus, however, on the third phenomenon, i.e., that the classifier performs quite poorly whenever the inside of the mug is visible. Why could this be the case? We can use 3DB to gain insight into the phenomenon. Specifically, we want to test the following hypothesis: when classifying mugs, does our ImageNet model rely on the exact liquid inside the cup?

We investigate this hypothesis by writing a custom control that fills our mug with various liquids (more precisely, a parameterized mixture of water, milk, and coffee):

Examples of the mugs rendered in this experiment

Our running example mug, filled with different parameterized liquids: 100% water (top left), 100% coffee (top right), 100% milk (bottom right), and a coffee-milk mixture (bottom left)

In contrast to the last experiment (where we varied the orientation of the mug), we render scenes containing the mug in a fixed set of poses that reveal the contents—just as in the last experiment, however, we still vary background and mug location. We visualize the results below—each cell in the heatmap corresponds to a fixed mixture of coffee, water, and milk (i.e., the labeled corners are 100% coffee, 100% milk, and 100% water, and the other cells are linear interpolations of these ratios) and the color of the cell encodes the relative accuracy of the classifier when the mug is filled with that liquid:

Measuring the relative effect of the liquid mixture in the mug on model predictions. Each cell represents a specific liquid mixture, and the color of the cell represents the tendency of the model (averaged over random viewpoints and relative to the other cells) to predict "cup"/"pill bottle," "bucket," or "mug."

It turns out that mug content indeed highly impacts classification: our model is much less likely to correctly classify a mug that doesn’t contain coffee! This is just one example of how 3DB can help in proving or disproving hypotheses about model behavior.

From Simulation to Reality

So far, we’ve used 3DB to discover ML models’ various failure modes and biases via photorealistic rendering. To what extent though do the insights gleaned from simulated 3DB experiments actually “transfer” to the physical world?

To test such transferability, we began by creating a 3D model of a physical room we had access to. We also collected eight different 3D models with closely matching physical world counterparts—including the mug analyzed above. Next, we used 3DB to find correctly and incorrectly classified configurations (pose, orientation, location) of these eight objects inside that room. Finally, we replicated these poses (to the best of our abilities) in the physical room, and took photos with a cellphone camera:

Samples from our physical-world experiment.

Examples of simulated scenes (top) and their re-created counterparts (bottom) from our physical-world experiment.

We classified these photos with the same vision model as before and measured how often the simulated classifier correctness matched correctness on the real photographs. We observed an ~85% match! So the failure modes identified by 3DB are not merely simulation artifacts, and can indeed arise in the real world.

Conclusion

3DB is a flexible, easy-to-use, and extensible framework for identifying model failure modes, uncovering biases, and testing fine-grained hypotheses about model behavior. We hope it will prove to be a useful tool for debugging vision models.

Bonus: Object Detection, Web Dashboard, and more!

We’ll wrap up by highlighting some additional capabilities of 3DB that we didn’t get to demonstrate in this blog post:

3DBoard: a web interface for exploring results

In all of the code examples above, we showed how to analyze the results of a 3DB experiment by loading the output into a pandas dataframe. For additional convenience, however, 3DB also comes with a web-based dashboard for exploring experimental results. The following command suffices to visualize the texture swaps experiment from earlier:


python -m threedboard results_texture/ --port 3000


Navigating to YOUR_IP:3000 should lead you to a page that looks like this:

Dashboard screenshot

Object detection and other tasks

In this blog post, we focused on using 3DB to analyze image classification models. However, the library also supports object detection out-of-the-box, and can easily be extended to support a variety of image-based tasks (e.g., segmentation or regression-based tasks). For example, below we provide a simple end-to-end object detection example:

Show/hide code and instructions

# The object detection example is separate from the rest of the blog demo, so
# run the following in a separate repo:
git clone https://github.com/3db/object_detection_demo

# The repo has a data/ folder containing the Blender model (a banana) and some
# HDRI backgrounds, a classmap.json file mapping the UID of the model to a COCO
# class, and the detection.yaml file from the next pane.
cd object_detection_demo/

export BLENDER_DATA=data/
export RESULTS_FOLDER=results/

# Run 3DB
threedb_workers 1 $BLENDER_DATA 5555 > client.log & 
threedb_master $BLENDER_DATA detection.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!


inference:
  module: 'torchvision.models.detection'
  class: 'retinanet_resnet50_fpn'
  label_map: './resources/coco_mapping.json'
  normalization:
    mean: [0., 0., 0.]
    std: [1., 1., 1.]
  resolution: [224, 224]
  args:
    pretrained: True
evaluation:
  module: 'threedb.evaluators.detection'
  args:
    iou_threshold: 0.5
    nms_threshold: 0.1
    max_num_boxes: 10
    classmap_path: 'classmap.json'
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
  with_segmentation: true
policy:
  module: "threedb.policies.random_search"
  samples: 2
logging:
  logger_modules:
    - "threedb.result_logging.image_logger"
    - "threedb.result_logging.json_logger"
controls:
  - module: "threedb.controls.blender.orientation"
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"

at June 08, 2021 12:00 AM UTC

    Paper    Project Repo    Documentation and Guides    Blog Demo

In our latest paper, in collaboration with Microsoft Research, we introduce 3DB: an extendable, unified framework for debugging and analyzing vision models using photorealistic simulation. We’re releasing 3DB as a package, accompanied by extensive API documentation, guides, and demos.

Note: this post contains some interactive plots and 3D models that use JavaScript: click here for a JS-free version of this post.

Identifying failure modes and biases in vision models is a rapidly emerging challenge in machine learning. In high-stakes applications, simply deploying models and collecting failures that arise in the wild is often difficult, expensive, and irresponsible. To this end, a recent line of work in vision focuses on identifying model failure modes via in-depth analyses of image transformations and corruptions, object orientations, backgrounds, or shape-texture conflicts. These studies (and other similarly important ones) reveal a variety of patterns of performance degradation in vision models. Still, performing each such study requires time, developing (often complex) toolingFor our study of image backgrounds, for example, we used a combination of bounding boxes and classical computer vision tools to crop out image backgrounds. We then had to manually filter out the images for which the tools failed. Even for the images where the toolkit succeeded, there remained inevitable cropping artifacts., and a willingness to settle for less than perfect simulations of each potential failure mode. Our question is: can we support reliable discovery of model failures in a systematic, automated, and unified way?

3DB: A Rendering-based Debugging Platform

A sampling of the analyses enabled by 3DB.

A sampling of the analyses enabled by 3DB.

In our latest paper, we try to make progress on this question and propose 3DB, a platform for automatically identifying and analyzing the failure modes of computer vision models using 3D rendering. 3DB aims to allow users to go from a testable, robustness-based hypothesis to concrete, photorealistic experimental evidence with minimal time and effort.

The platform revolves around the modular workflow pictured below. First, users specify a set of 3D objects and environments, as well as a set of 3D (or 2D) transformations called controls that determine the space of admissible object-environment configurations. 3DB then renders a myriad of admissible scenes and feeds them through the user’s computer vision model of choice. The user can finally stratify, aggregate, or otherwise analyze the results either by reading the outputted JSON, or through the pre-packaged dashboard.

An illustration of the 3DB workflow.

3DB easily adapts to a variety of use cases: in particular, users can modify and swap out any part of this pipeline (e.g., the renderer, the logger, the model type, or the controls) for their own custom-written components, without needing to modify any of the 3DB codebase. We’ve compiled guides, extensive API documentation, and a full demo showing how 3DB streamlines model debugging.

In fact, this blog post will double as another demo! We’ll present the (short) code necessary to reproduce every plot in the post below using 3DB. You can download the aggregated code for this blog post here.

To set up, follow the steps below—then, in the remainder of this post, press “Show/hide code and instructions” to see the steps necessary to reproduce each experiment below.

  1. Clone the blog demo repo
  2. Run
    cd blog_demo
    , then
    bash setup.sh
    (assumes
    unzip
    is installed) to download a large Blender environment, then
    cd ../
  3. Install 3DB:
    curl -L https://git.io/Js8eT | bash /dev/stdin threedb
  4. Run
    conda activate threedb
  5. Our experiments below will need a
    BLENDER_DATA
    folder that contains two subfolders:
    blender_models/
    containing 3D models (
    .blend
    files with a single object whose name matches the filename), and
    blender_environments/
    containing environments. We will provide you with these later
  6. Separately, make a file called
    base.yaml
    and paste in the configuration from the next pane.

inference:
  module: 'torchvision.models'
  label_map: 'blog_demo/resources/imagenet_mapping.json'
  class: 'resnet18'
  normalization:
    mean: [0.485, 0.456, 0.406]
    std: [0.229, 0.224, 0.225]
  resolution: [224, 224]
  args:
    pretrained: True
evaluation:
  module: 'threedb.evaluators.classification'
  args:
    classmap_path: 'blog_demo/resources/ycb_to_IN.json'
    topk: 1
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
policy:
  module: "threedb.policies.random_search"
  samples: 5
logging:
  logger_modules:
    - "threedb.result_logging.image_logger"
    - "threedb.result_logging.json_logger"

Using 3DB

Prior works have already used 3D rendering (to great effect) to study biases of machine learning models, including pose and context-based biases. Our goal is not to propose a specific 3D-rendering based analysis, but rather to provide an easy-to-use, highly extendable framework that unifies prior analyses (both 3D and 2D) while enabling users to (a) conduct a host of new analyses with the same ease and with realistic results; and (b) effortlessly compose different factors of variation to understand their interplay.

We’ll dedicate the rest of this post to illustrating how one might actually use 3DB in practice, focusing on a single example 3D model3DB works with any 3D model, and we refer the reader to our paper for more examples and details.:

The 3D mug model that we'll be using throughout this post. Click to enable interactivity.

In what follows, we will walk through example applications of 3DB to discover biases of ML models (some previously documented, others not). For the sake of brevity, we’ll highlight just a few of these (re-)discoveries—to see more, check out the paper. We’ll then demonstrate that the discoveries of 3DB transfer pretty reliably to the real world!

Our experiments will all operate on an ImageNet-pretrained ResNet-18The classifier has a ~70% validation-set accuracy. that has 42% accuracy on images from the “coffee mug” ImageNet subclass. While we only study classification in this blog post, 3DB also supports object detection and can be easily extended to support other image-based tasks, such as semantic segmentation, too.

Image Background Sensitivity

In one of our previous posts, we continued a long line of prior work (see, e.g., here, here, here, etc.) showing that models can be over-reliant on image backgrounds, and demonstrated that they are easily broken by adversarially chosen backgrounds. To accomplish this, our prior analysis used classical computer vision tools to separate foregrounds from backgrounds, then pasted foregrounds from one image onto backgrounds from another. This process was slow and required extensive quality control to ensure that backgrounds and foregrounds were being extracted properly—and even when they were, a few artifacts remained:

3DB lets us reproduce these findings effortlessly and without introducing such artifacts. To demonstrate this, we use 3DB to render our mug 3D model on hundreds of HDRI backgrounds, resulting in images such as:

We then analyze the performance of a pretrained ResNet-18Recall that this model obtains 42% accuracy on the corresponding "coffee mug" ImageNet class subset.on these images. We find that the performance of the classifier varies significantly across backgrounds, and that accuracy correlates with a crude measure of “background simplicity” (the JPEG compressed size of the image—with smaller size corresponding to being more simple).

A graph plotting average accuracy of our pre-trained ImageNet model (y-axis) on images rendered by 3DB while varying the complexity of the rendering backgrounds (x-axis).

A note on compositionality: An important part of 3DB that we don’t discuss here is compositionality, i.e., the ability to put together multiple controls and study their joint effect. For example, in our paper we studied how a model’s prediction vary with various zoom levels and backgrounds of an object. We found that the optimal zoom level varies a lot by background.

Show/hide code and instructions

# $BLENDER_DATA/blender_environments` contains several backgrounds and
# $BLENDER_DATA/blender_models contains the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/backgrounds
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

# (Optional) Download additional backgrounds you want---e.g., from
# https://hdrihaven.com/hdris/ (both `.hdr` and `.blend` files work) and put
# them in BLENDER_DATA/blender_environments. 
wget https://hdrihaven.com/hdris/PATH/TO/HDRI \
    -O $BLENDER_DATA/blender_environments

# Direct results
export RESULTS_FOLDER='results_backgrounds'

# Run 3DB (with the YAML file from the next pane saved as `backgrounds.yaml`):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA backgrounds.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Finally, run analysis using pandas (third pane)
python analyze_bgs.py


base_config: "base.yaml"
policy:
  module: "threedb.policies.random_search"
  samples: 20
controls:
  - module: "threedb.controls.blender.orientation"
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"


import pandas as pd
import numpy as np
import json

log_lines = open('results_backgrounds/details.log').readlines()
class_map = json.load(open('results_backgrounds/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')

res = df.groupby('environment').agg(accuracy=('is_correct', 'mean'),
        most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)

Texture Bias

Another recent study of neural network biases showed that in contrast to humans, convolutional neural networks (CNNs) rely more on texture to recognize objects than on shape. The example below typifies this phenomenon—a cat with an elephant texture is recognized as a cat by humans, but as an elephant by CNNs:

Cue-conflict images introduced by Geirhos et al.

An example of the cue-conflict images introduced by Geirhos et al. Combining the elephant texture with the cat shape results in a mixed image on which CNNs consistently predict with the texture signal.

This example and others like it (dubbed ‘cue-conflict’ images) provide a striking illustration of the contrast between human and CNN-based classification mechanisms. Still, just as in the case of image backgrounds, creating such images typically necessitates time, technical skill, quality control, and/or introduction of unwanted artifacts (for example, in the above figure, ideally we would modify only the texture of the cat without altering the background).

However, using 3DB we can easily collect photorealistic empirical evidence of texture bias. Without modifying the internal 3DB codebase at all, one can write a custom control that modifies the texture of objects in the scene while keeping the rest intact. With this custom control in placeIn fact, the texture-swapping control for this experiment is now pre-packaged with 3DB, since we already wrote it ourselves!, one can simply randomize the texture of the mug across various backgrounds, poses and camera parameters before stratifying results:

Choose an Image

Average accuracy (compare to 42% on ImageNet val set)

Interactive demo: select any image in the top two rows to see additional samples of that class.

The performance of the pretrained model on mugs (and other objects) deteriorates severely upon replacing the mug’s texture with a “wrong” one, providing clear corroborating evidence of the texture bias! We noticed in our experiments that for some textures (e.g., zebra), the coffee mug was consistently misclassified as the corresponding animal, whereas for others (e.g., crocodile), the mug is misclassified as either a related class (e.g., turtle or other reptile), or as an unrelated object (e.g., a trash can).

Show/hide code and instructions

# ${BLENDER_DATA}/blender_environments contains several backgrounds,
# ${BLENDER_DATA}/blender_models contain the 3D model of a mug.
export BLENDER_DATA=$(realpath blog_demo)/data/texture_swaps

# List the materials that we will use for this post:
ls blog_demo/data/texture_swaps/blender_control_material
# You can also make or download blender materials corresponding 
# to other textures you want to test, and add them to that folder 

# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

export RESULTS_FOLDER=results_texture

# Run 3DB (with the YAML file from the next pane saved as texture_swaps.yaml):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA texture_swaps.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Finally, run analysis using pandas (copy from third pane)
python analyze_ts.py


base_config: "base.yaml"
controls:
  - module: "threedb.controls.blender.orientation"
    rotation_x: -1.57
    rotation_y: 0.
    rotation_z: [-3.14, 3.14]
  - module: "threedb.controls.blender.position"
    offset_x: 0.
    offset_y: 0.5
    offset_z: 0.
  - module: "threedb.controls.blender.pin_to_ground"
    z_ground: 0.25
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    view_point_x: 1.
    view_point_y: 1.
    view_point_z: [0., 1.]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.material"
    replacement_material: ["cow.blend", "elephant.blend", "zebra.blend", "crocodile.blend", "keep_original"]
  - module: "threedb.controls.blender.denoiser"


import pandas as pd
import numpy as np
import json

log_lines = open('results_texture/details.log').readlines()
class_map = json.load(open('results_texture/class_maps.json'))
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))
df = df.drop('render_args', axis=1).join(pd.DataFrame(df.render_args.values.tolist()))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])
df['is_correct'] = (df['is_correct'] == 'True')

res = df.groupby('MaterialControl.replacement_material').agg(acc=('is_correct', 'mean'),
      most_frequent_prediction=('prediction', lambda x: x.mode()))
print(res)

Part-of-Object Attribution

Beyond general hypotheses about model biases, 3DB allows us to test vision systems on a more fine-grained level. In the case of our running mug example, for instance, we can use the platform to understand which specific parts of its 3D mesh correlate with classifier accuracy. Specifically, below we generate (and classify) scenes with random mug positions, rotations, and backgrounds. Since 3DB stores texture-coordinate information for each rendering, we can reconstruct a three-dimensional heatmap that encodes, for each point on the surface of the mug, the classifier’s accuracy conditioned on that point being visible:

A part-of-object attribution heatmap for our mug 3D model. Red pixels indicate areas whose visibility most improves accuracy, whereas blue areas' visibility correlates with incorrect classifications. Click here to enable interactivity.

A number of phenomena stand out from this heatmap, including:

  1. The classifier is worse when the side of the mug opposite the handle is seen.
  2. The classifier is more accurate when the bottom rim is visible.
  3. The classifier performs worst when the inside of the mug is visible.
Show/hide code and instructions

# point BLENDER_DATA to the environments and models for this experiment
export BLENDER_DATA=$(realpath blog_demo)/data/part_of_object
# if you want to use the pre-written material in blog_demo, uncomment:
# cd blog_demo

# Optionally: download additional backgrounds (`.hdr` or `.blend`) e.g.,
wget URL -O $BLENDER_DATA/blender_environments/new_env.hdr

# Point results folder to where you want output written
export RESULTS_FOLDER='results_part_of_object'

# Run 3DB (with the YAML file from the next pane):
threedb_workers 1 $BLENDER_DATA 5555 > client.log &
threedb_master $BLENDER_DATA part_of_object.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!

# Run `part_of_object.py` (third pane) to generate the heat map of the mug.
python po_analysis.py


base_config: "base.yaml"
policy:
  module: "threedb.policies.random_search"
  samples: 20
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
  with_uv: True
controls:
  - module: "threedb.controls.blender.orientation"
    rotation_x: -1.57
    rotation_y: 0.
    rotation_z: [-3.14, 3.14]
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    view_point_x: 1.
    view_point_y: 1.
    view_point_z: 1.
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"
  - module: "threedb.controls.blender.background"
    H: 1.
    S: 0.
    V: 1.

import pandas as pd
import numpy as np
import json
from PIL import Image

DIR = 'results_part_of_object'
log_lines = open(f'{DIR}/details.log').readlines()
df = pd.DataFrame.from_records(list(map(json.loads, log_lines)))

# From class index to class name (for readability)
class_map = json.load(open(f'{DIR}/class_maps.json'))
df['prediction'] = df['prediction'].apply(lambda x: class_map[x[0]])

# We'll be a little lenient here to get a more interesting heatmap
df['is_correct'] = df['prediction'].isin(['cup', 'coffee mug'])

uv_num_correct = np.zeros((256, 256))
uv_num_visible = np.zeros((256, 256))
for imid in df["id"].unique().tolist():
    is_correct = float(df.set_index('id').loc[imid]['is_correct'])
    vis_coords_im = Image.open(f'{DIR}/images/{imid}_uv.png')
    vis_coords = np.array(vis_coords_im).reshape(-1, 3)
    # R and G channels encode texture coordinates (x, y), 
    # B channel is 255 for object and 0 for background
    # So we will filter by B then only look at R and G.
    vis_coords = vis_coords[vis_coords[:,2] > 0][:,:2]

    uv_num_visible[vis_coords[:,0], vis_coords[:,1]] += 1.
    uv_num_correct[vis_coords[:,0], vis_coords[:,1]] += is_correct

# Accuracy = # correct / # visible
uv_accuracy = uv_num_correct / (uv_num_visible + 1e-4)

# Saves a black-and-white heatmap
Image.fromarray((255 * uv_accuracy).astype('uint8'))


Now that we have hypotheses regarding model performance, we can test them! Inspecting the ImageNet validation set, we found that our classifier indeed (a) struggles on coffee mugs when the handle is not showing (providing a feasible explanation for (1), since the side opposite the handle is only visible when the handle itself isn’t), and (b) performs worse at higher camera angles (providing a plausible explanation for (2)). We want to focus, however, on the third phenomenon, i.e., that the classifier performs quite poorly whenever the inside of the mug is visible. Why could this be the case? We can use 3DB to gain insight into the phenomenon. Specifically, we want to test the following hypothesis: when classifying mugs, does our ImageNet model rely on the exact liquid inside the cup?

We investigate this hypothesis by writing a custom control that fills our mug with various liquids (more precisely, a parameterized mixture of water, milk, and coffee):

Examples of the mugs rendered in this experiment

Our running example mug, filled with different parameterized liquids: 100% water (top left), 100% coffee (top right), 100% milk (bottom right), and a coffee-milk mixture (bottom left)

In contrast to the last experiment (where we varied the orientation of the mug), we render scenes containing the mug in a fixed set of poses that reveal the contents—just as in the last experiment, however, we still vary background and mug location. We visualize the results below—each cell in the heatmap corresponds to a fixed mixture of coffee, water, and milk (i.e., the labeled corners are 100% coffee, 100% milk, and 100% water, and the other cells are linear interpolations of these ratios) and the color of the cell encodes the relative accuracy of the classifier when the mug is filled with that liquid:

Measuring the relative effect of the liquid mixture in the mug on model predictions. Each cell represents a specific liquid mixture, and the color of the cell represents the tendency of the model (averaged over random viewpoints and relative to the other cells) to predict "cup"/"pill bottle," "bucket," or "mug."

It turns out that mug content indeed highly impacts classification: our model is much less likely to correctly classify a mug that doesn’t contain coffee! This is just one example of how 3DB can help in proving or disproving hypotheses about model behavior.

From Simulation to Reality

So far, we’ve used 3DB to discover ML models’ various failure modes and biases via photorealistic rendering. To what extent though do the insights gleaned from simulated 3DB experiments actually “transfer” to the physical world?

To test such transferability, we began by creating a 3D model of a physical room we had access to. We also collected eight different 3D models with closely matching physical world counterparts—including the mug analyzed above. Next, we used 3DB to find correctly and incorrectly classified configurations (pose, orientation, location) of these eight objects inside that room. Finally, we replicated these poses (to the best of our abilities) in the physical room, and took photos with a cellphone camera:

Samples from our physical-world experiment.

Examples of simulated scenes (top) and their re-created counterparts (bottom) from our physical-world experiment.

We classified these photos with the same vision model as before and measured how often the simulated classifier correctness matched correctness on the real photographs. We observed an ~85% match! So the failure modes identified by 3DB are not merely simulation artifacts, and can indeed arise in the real world.

Conclusion

3DB is a flexible, easy-to-use, and extensible framework for identifying model failure modes, uncovering biases, and testing fine-grained hypotheses about model behavior. We hope it will prove to be a useful tool for debugging vision models.

Bonus: Object Detection, Web Dashboard, and more!

We’ll wrap up by highlighting some additional capabilities of 3DB that we didn’t get to demonstrate in this blog post:

3DBoard: a web interface for exploring results

In all of the code examples above, we showed how to analyze the results of a 3DB experiment by loading the output into a pandas dataframe. For additional convenience, however, 3DB also comes with a web-based dashboard for exploring experimental results. The following command suffices to visualize the texture swaps experiment from earlier:


python -m threedboard results_texture/ --port 3000


Navigating to YOUR_IP:3000 should lead you to a page that looks like this:

Dashboard screenshot

Object detection and other tasks

In this blog post, we focused on using 3DB to analyze image classification models. However, the library also supports object detection out-of-the-box, and can easily be extended to support a variety of image-based tasks (e.g., segmentation or regression-based tasks). For example, below we provide a simple end-to-end object detection example:

Show/hide code and instructions

# The object detection example is separate from the rest of the blog demo, so
# run the following in a separate repo:
git clone https://github.com/3db/object_detection_demo

# The repo has a data/ folder containing the Blender model (a banana) and some
# HDRI backgrounds, a classmap.json file mapping the UID of the model to a COCO
# class, and the detection.yaml file from the next pane.
cd object_detection_demo/

export BLENDER_DATA=data/
export RESULTS_FOLDER=results/

# Run 3DB
threedb_workers 1 $BLENDER_DATA 5555 > client.log & 
threedb_master $BLENDER_DATA detection.yaml $RESULTS_FOLDER 5555

# Analyze results in the dashboard
python -m threedboard $RESULTS_FOLDER --port 3000
# Navigate to localhost:3000 to view the results!


inference:
  module: 'torchvision.models.detection'
  class: 'retinanet_resnet50_fpn'
  label_map: './resources/coco_mapping.json'
  normalization:
    mean: [0., 0., 0.]
    std: [1., 1., 1.]
  resolution: [224, 224]
  args:
    pretrained: True
evaluation:
  module: 'threedb.evaluators.detection'
  args:
    iou_threshold: 0.5
    nms_threshold: 0.1
    max_num_boxes: 10
    classmap_path: 'classmap.json'
render_args:
  engine: 'threedb.rendering.render_blender'
  resolution: 256
  samples: 16
  with_segmentation: true
policy:
  module: "threedb.policies.random_search"
  samples: 2
logging:
  logger_modules:
    - "threedb.result_logging.image_logger"
    - "threedb.result_logging.json_logger"
controls:
  - module: "threedb.controls.blender.orientation"
  - module: "threedb.controls.blender.camera"
    zoom_factor: [0.7, 1.3]
    aperture: 8.
    focal_length: 50.
  - module: "threedb.controls.blender.denoiser"

at June 08, 2021 12:00 AM UTC

The Simons institute started a new virtual seminar series highlighting recent advances in theoretical computer science. The first two talks in the series will be:

by Boaz Barak at June 07, 2021 06:18 PM UTC

ICML 2021, one of the biggest conferences in machine learning, naturally has a ton of interesting sounding papers on the topic of differential privacy. We went through this year’s accepted papers and aggregated all the relevant papers we could find. In addition, this year features three workshops on the topic of privacy, as well as a tutorial. As always, please inform us if we overlooked any papers on differential privacy.

Workshops

Tutorial

Papers

by Gautam Kamath at June 07, 2021 04:30 PM UTC

I have recently posted the paper [Vio21] (download) which does something that I have been trying to do for a long time, more than ten years, on and off. Consider the basic data-structure problem of storing m bits of data x\in \{0,1\}^{m} into m+r bits so that the prefix-sum queries

\begin{aligned} \mathbb {\text {\textsc {Rank}}}(i):=\sum _{j\le i}x_{j} \end{aligned}

can be computed by probing q cells (or words) of w bits each. (You can think w=\log m throughout this post.) The paper [PV10] with Pǎtraşcu shows that r\ge m/w^{O(q)}, and this was recently shown to be tight by Yu [Yu19] (building on the breakthrough data structure [Pǎt08] which motivated the lower bound and is not far from it).

As is common in data-structure lower bounds, the proof in [PV10] is an encoding argument. In the recently posted paper, an alternative proof is presented which avoids the encoding argument and is perhaps more in line with other proofs in complexity lower bounds. Of course, everything is an encoding argument, and nothing is an encoding argument, and this post won’t draw a line.

The new proof establishes an intrinsic property of efficient data structures, whereas typical proofs including [PV10] are somewhat tailored to the problem at hand. The property is called the separator and is a main technical contribution of the work. At the high level the separator shows that in any efficient data structure you can restrict the input space a little so that many queries are nearly pairwise independent.

Also, the new proof rules out a stronger object: a sampler (see previous post here on sampling lower bounds). Specifically, the distribution Rank(U) where U is the uniform distribution cannot be sampled, not even slightly close, by an efficient cell-probe algorithm. This implies the data-structure result, and it can be informally interpreted as saying that the “reason” why the lower bound holds is not that the data is compressed, but rather that one can’t generate the type of dependencies occurring in Rank via an efficient cell-probe algorithm, regardless of what the input is.

Building on this machinery, one can prove several results about sampling, like showing that cell-probe samplers are strictly weaker than AC0 samplers. While doing this, it occurred to me that one gets a corollary for data structures which I had not seen in the literature. The corollary is a probe hierarchy, showing that some problem can be solved with zero redundancy (r=0) with O(q) probes, while it requires almost linear r for q probes. For example I don’t know of a result yielding this for small q such as q=O(1); I would appreciate a reference. (As mentioned in the paper, the sampling viewpoint is not essential and just like for Rank one can prove the data-structure corollaries directly. Personally, and obviously, I find the sampling viewpoint useful.)

One of my favorite open problems in the area still is: can a uniform distribution over [m] be approximately sampled by an efficient cell-probe algorithm? I can’t even rule out samplers making two probes!

References

[Pǎt08]   Mihai Pǎtraşcu. Succincter. In 49th IEEE Symp. on Foundations of Computer Science (FOCS). IEEE, 2008.

[PV10]   Mihai Pǎtraşcu and Emanuele Viola. Cell-probe lower bounds for succinct partial sums. In 21th ACM-SIAM Symp. on Discrete Algorithms (SODA), pages 117–122, 2010.

[Vio21]   Emanuele Viola. Lower bounds for samplers and data structures via the cell-probe separator. Available at http://www.ccs.neu.edu/home/viola/, 2021.

[Yu19]    Huacheng Yu. Optimal succinct rank data structure via approximate nonnegative tensor decomposition. In Moses Charikar and Edith Cohen, editors, ACM Symp. on the Theory of Computing (STOC), pages 955–966. ACM, 2019.

by Manu at June 07, 2021 03:49 PM UTC


You know, I loved math. My mom was a math teacher—Joan Cusack

Mary Kay Farley, my dear wife’s mom, and Dorothy Lipton, my mom, have unfortunately both passed away. Kathryn and I miss them greatly. Both women shared keen mathematical skills, a fascination with the game of baseball and a commitment to living a well-ordered life.

Today is not Mother’s Day. We still hope all mothers everywhere are enjoying their day.

We will take this time to thank all of you out there. We missed doing so last month, but the pandemic has distended time anyway. What we are hearing now are stories of mothers and children and grandchildren finally being able to think of seeing each other in person rather than via video.

Another Kind of Parentage

This has blended with musings on our recent post in which I (Dick) noted that Dorit Aharonov is an academic grandchild of mine, in that Avi Wigderson co-supervised her doctoral thesis and I supervised Avi’s.

Years ago we featured on Father’s Day a post with the title “Who’s Your Doktorvater?”—which was a play on the expression “who’s your daddy?” Now it is high time to note that there are many “doctor mothers”—as Dorit has herself become.

One difference from human genealogy is that most often there is only one “doctor parent.” My advisor, David Parnas, has two: Alan Perlis and Everard Williams. From Perlis it is a straight shot back to Siméon Poisson, whose 1800 dissertation was co-advised by Joseph Lagrange and Pierre Laplace. For Lagrange there is a strange note of Leonhard Euler as a virtual advisor, but the real one is Giovanni Beccaria—who has no listed parent. Going through Laplace also dead-ends. But selecting Euler includes a chain that ends in the 1100s with Sharaf al-Dīn al-Ṭūsī, who improved the complexity of approximately solving cubic equations.

I appear not to have any female ancestors in my doctoral genealogy. I have two female PhD graduates, one of whom is a Doktormutter. Ken’s first female doctoral student, co-advised, had a successful thesis defense last week; he has another nearing the ABD stage. But I have known quite a few other “doctor mothers” personally. Today, Ken and I thought to recognize them.

Some Doctor Moms I Know

Here are some that I have had the honor to know. They are in a certain order—do you see what it is? I give only the surname on purpose—click the second name and note its URL for a singular reflection of this.

The last gives us an all-female tree, not just one branch, of people we know. Besides Anna Gilbert, another of Ingrid Daubechies’s students, who herself has advisees, is Cynthia Rudin of Duke, whom Ken knew and taught while she was an undergraduate at Buffalo.

There are others I could mention who went into research labs where there are different relationships besides PhD advising. They include Irene Greif, Tal Rabin, Lynn Conway, and Jean Sammet. I could include Jamie Morgenstern, whom we recently featured and who his advising her first students at the University of Washington—do they have to be “born” yet to count you as a Doktormutter? I’ve left others out—apologies for that—but the ones I’ve listed make a nice {4 \times 4} collage:


My Upbringing

Of these, the one with the most formative impact on me was Helena Rasiowa. I learned advanced logic from her when I was an undergraduate.

Here is a tribute to her by Melvin Fitting:

I once heard Dana Scott criticize her book, The Mathematics of Metamathematics with Roman Sikorski, because, while it took an algebraic approach to logic, it did not carry the work further and consider set theory. If it had, then forcing would have been discovered years earlier than it was. This is not, at heart, a criticism, but a tribute. The building of mathematics always goes on. Foundations, firmly laid, enable later construction, and the foundations laid by that book were powerfully firm.

Ken also points to logic as a formative influence—though from men at Oxford. Both of us were attracted to Gödel-type undecidability issues in complexity theory in the early 1980s. The nexus of logic and algebra has been important to us in different ways, Ken more with finite automata and descriptive complexity. Courses in logic gave both of us a habit of framing problems along formal lines.

Open Problems

Which “doctor mothers” have you known or been influenced by?


[Added and re-formatted photos at top]

by RJLipton+KWRegan at June 07, 2021 04:58 AM UTC

July 13-14, 2021 Foundations of Data Science Institute (FODSI) https://fodsi.us/ml4a.html In recent years there has been increasing interest in using machine learning to improve the performance of classical algorithms in computer science, by fine-tuning their behavior to adapt to the properties of the input distribution. This “data-driven” or “learning-based” approach to algorithm design has the … Continue reading Workshop on Machine Learning for Algorithms

by shacharlovett at June 07, 2021 04:13 AM UTC

 If I was refering to the paper with bibtex entry: 


@misc{BCDDL-2018,

  author    = {Jeffrey Bosboom and

               Spencer Congero and

               Erik D. Demaine and

               Martin L. Demaine and

               Jayson Lynch},

  title     = {Losing at Checkers is Hard},

  year      = {2018},

  note      = {\newline\url{http://arxiv.org/abs/1806.05657}},

}

(The paper is here.)

I would write 

Bosboom et al.~\cite{BCDDL-2018} proved that trying to lose at checkers (perhaps you are playing a child and want to keep up their self-esteem, or a Wookie and don't want your arms to be torn off your shoulders   (see here),  or a Wookie child) is hard. 


Why did I use et al. ? Because it would be a pain to write out all of those names. 

How many names does it take to make you write et al. ? Are there exceptions? 

I have not seen any discussion of this point on the web. So here are my rules of thumb and some questions.(CORRECTION- a commenter points out that this IS discussed on the web. Even so, you can comment on my thoughts or give your thoughts or whatever you like.) 

1) If  there are 3 or more people than use et al. Exception: If the three are VERY WELL KNOWN as a triple. E.g., the double play was Tinker to Evers to Chance. Well, that would not really come up since in baseball nobody ever says the double play was Tinker et al.  More relevant examples:

Aho, Hopcroft, and Ullman

Cormen, Leiserson, and Rivest, also known as CLR

The more recent edition is

Cormen, Leiserson, Rivest, and Stein. I have heard CLRS but I don't know if people WRITE all four names. 

Lenstra-Lenstra-Lovasz also usually mentions all three. 

2) If there is one name do not use et al.  unless that one person has a multiple personality disorder.

3) If there are 2 people it can be tricky and perhaps unfair. If the second one has a long name then I am tempted to use et al. For example

Lewis and Papadimitriou (If I mispelled Christos's name-- well- that's  the point!- to avoid spelling errors I want to use et al. )

Lampropoulos and Paraskevopoulou (the first one is UMCP new faculty!). After typing in the first name I would not be in the mood to type in the second. 

Similar if there are lots of accents in the name making it hard to type in LaTeX (though I have macros for some people like Erdos who come up a lot) then I might use et al. 

(ADDED LATER- some of the commenters object to my `rule' of leaving out the last name if its complicated. That is not really my rule- the point of this post was to get a discussion going about the issue, which I am happy to say has happened.) 

----------

There are other issues along these lines: when to include the first name (when there is more than one person with that last name, e.g. Ryan Williams and Virginia  Vassilevska Williams), when to use middle initials (in the rare case where there is someone with the same first and last name- Michael  J. Fox added the J and uses it since there was an actor named Michael Fox.)

I will soon give a quote from a math paper that amused me, but first some context.  The problem of determining if a poly in n variables over Z has an integer solution is called E(n). By the solution to Hilbert's 10th problem we know that there exists n such that E(n) is undecidable. E(9) is undecidable, but the status of E(8) is unknown (as of May 2021) and has been since the early 1980's. 

Here is the quote (from here).

Can we replace 9 by a smaller number? It is believed so. In fact, A. Baker, Matiyasevich and J.Robinson  even conjectured that E(3) is undecidable over N.

Note that Baker and Robinson get their first initial but Matiyasevich does not.

I suspect that they use J. Robinson since there is another mathematician with last name Robinson: Rafael Robinson who was Julia's Robinson's husband (to my younger readers--- there was a time when a women who got married took her husband's last name). There is at least one other Math-Robinson: Robert W Robinson. I do not think he is closely related. 

Baker: I do not know of another mathematician named Baker. I tried Google, but the Bakers  I found were   not in the right time frame. I also kept finding hits to an anecdote about Poincare and a man whose profession was a baker (see here though its later in that blog post). However, I suspect there was another mathematician named Baker which is why the author uses the first initial.  Its possible the author did not want to confuse Alan  Baker with Theodore Baker, one of the authors of Baker-Gill-Solovay that showed there were oracles that made P = NP and others that made P NE NP.  But somehow, that just doesn't seem right to me. I suspect there is only one mathematician with last name Matijsavic. 

Thomas Alva Edison named his son Thomas Alva Edison Jr.  This was a bad idea but not for reasons of authorship, see here.


by gasarch (noreply@blogger.com) at June 07, 2021 03:23 AM UTC

The stabilizer rank of a quantum state $\psi$ is the minimal $r$ such that $\left| \psi \right \rangle = \sum_{j=1}^r c_j \left|\varphi_j \right\rangle$ for $c_j \in \mathbb{C}$ and stabilizer states $\varphi_j$. The running time of several classical simulation methods for quantum circuits is determined by the stabilizer rank of the $n$-th tensor power of single-qubit magic states. We prove a lower bound of $\Omega(n)$ on the stabilizer rank of such states, improving a previous lower bound of $\Omega(\sqrt{n})$ of Bravyi, Smith and Smolin [BSS16]. Further, we prove that for a sufficiently small constant $\delta$, the stabilizer rank of any state which is $\delta$-close to those states is $\Omega(\sqrt{n}/\log n)$. This is the first non-trivial lower bound for approximate stabilizer rank. Our techniques rely on the representation of stabilizer states as quadratic functions over affine subspaces of $\mathbb{F}_2^n$, and we use tools from analysis of boolean functions and complexity theory. The proof of the first result involves a careful analysis of directional derivatives of quadratic polynomials, whereas the proof of the second result uses Razborov-Smolensky low degree polynomial approximations and correlation bounds against the majority function.

at June 06, 2021 07:23 PM UTC